Observability in a LangGraph graph: what Langfuse sees that the log doesn't
Logs cover what happened inside each node. They don't answer 'did the fallback rate climb in the last 30 minutes?'. For that, Langfuse.
In previous posts I showed how DSPy classifies intent and how the custom embedding-based router decides macro intent. Both share one trait: they can degrade silently.
DSPy degrades when the provider updates the model. The semantic router degrades when user vocabulary changes. Neither raises an exception when this happens. You discover it through response quality (or user complaint).
Observability is what changes that. This post is about instrumenting a LangGraph graph with Langfuse to track exactly where each decision happened and how to detect degradation before it reaches the user.
What structured logging doesn't solve
The structured logger covers what happened inside each node. It doesn't answer questions like:
- Which scope did DSPy route to in the last 500 queries? What's the distribution?
- Did the fallback rate to
geralclimb in the last 30 minutes? - How many responses did
critique_nodeintervene in today? By which violation type? - In how many requests did
dspy.Refineneed more than one attempt?
These questions require an aggregated view across multiple executions, not the log of a single execution. That's where Langfuse comes in.
Base instrumentation: @observe on nodes
The simplest entry point is the @observe decorator on graph nodes:
capture_input=False and capture_output=False aren't paranoia. In any system with user data, capturing the full graph state in Langfuse means capturing conversation history, session identifiers, and potentially personal data. What you want in the trace isn't the state - it's the decision metadata.
update_trace_metadata: injecting context into the active trace
Inside any node, update_trace_metadata accumulates metadata in the current request's trace:
Each node contributes its metadata to the same trace. In Langfuse, you see the full trace of a request with all those fields aggregated, without cross-referencing logs from different workers.
Tags and user_id: segmentation without personal data
In the supervisor, when assembling the graph configuration:
The langfuse_user_id is an internal identifier - never a national ID, email, or phone number. Tags let you filter traces by channel, environment, and degraded mode in the dashboard without exposing personal data.
RouterMetricsCollector: in-process metrics with Redis backend
Langfuse tracks individual executions. For aggregated real-time metrics - fallback rate, scope distribution, anaphora hits - the system has its own collector:
The Redis backend uses HINCRBY, compatible with multi-worker deploys where each process has its own in-process counter but all write to the same HASH:
The increment is fire-and-forget: it doesn't block router_node's hot path. If Redis is unavailable, the in-process counter still works, and the fallback alert still fires.
What to monitor to detect DSPy degradation
In router_node, after each inference:
The three indicators that matter most:
- Fallback rate to
geral: when DSPy can't classify the query, it returnsscope="geral". A rate above 25% signals that compiled demos are outdated relative to user vocabulary. Fix: expand the dataset and recompile. - Anaphora resolution rate: if
anaphora_hit_count/totaldrops abruptly, the regex patterns of the anaphora resolver stopped matching user messages. Fix: review the patterns. - Forced coercion rate: if
coercion_fallback_countrises, the LLM started returning output formats the coercion layer can't parse. Fix: inspectdspy.inspect_history()and revise theSignature.
What Langfuse shows that the log doesn't
With update_trace_metadata on every node, Langfuse aggregates into a single trace:
In 5 seconds of execution you know: the anaphora was resolved, DSPy routed to catalog search, Refine needed 2 attempts, and critique injected a missing disclaimer.
Without that trace, you have 5 logs in different files, with no direct correlation, no timeline.
Important items
capture_input=Falseon every span that touches user messages.capture_output=Falseon every span that returns graph state.- What goes to Langfuse is decision metadata - scope, selected tool, latency, compliance flags - not the data itself.
- User data belongs in the legal archive. Decision metadata belongs in the observability system. Mixing the two creates privacy issues and inflates Langfuse cost without adding debugging value.
Next week: how the generation module uses runtime self-correction with a reward function, and the latency trade-off it creates.