Logs Are Not Enough
You can have perfect logs and still have no idea what your system is doing.
We’ve become obsessed with logging. Structured logs, log levels, distributed tracing, retention policies, indexing strategies. Teams spend weeks building robust logging infrastructure, confident that comprehensive observability will follow. But when an incident hits and you’re staring at thousands of chronological entries, each one technically correct, you realize the truth: you have perfect records of everything that happened and no understanding of why any of it mattered.
Logs Without Narrative
Logs answer what happened, not why it mattered. They are atomized snapshots, chronologically ordered but context-poor, each event recorded in isolation from the larger decision flow it belongs to. When something goes wrong, engineers reconstruct meaning manually, scrolling through timestamps, correlating request IDs across services, building mental models of causation from scattered breadcrumbs. This happens under time pressure, often at 3am, with users waiting and stakeholders demanding answers. That’s not observability, that’s archaeology. You’re excavating the past from fragments, hoping you’ve assembled the pieces correctly, never quite certain you haven’t missed the critical connection that explains why the system did what it did.
The fundamental problem is correlation versus causation. Systems produce oceans of data but no storyline connecting events to outcomes. You can see that service A called service B, that a database query returned, that a cache was hit, that a message was published. But you can’t see why service A decided to call service B instead of service C, why that specific query was chosen, whether the cache hit was expected or a surprise, whether that message represented success or fallback logic. Without intent, logs become noise. The more complex the system grows, the more services involved, the more layers of abstraction, the less useful raw logs become. Volume isn’t the answer to confusion; volume is usually the cause.
From Events to Decisions
Decision logs transform systems from recorders into narrators. Instead of logging that an event occurred, you log why it occurred and what alternatives were considered. “We chose path X because condition Y was true and threshold Z was exceeded.” “Rule A applied to this request, but Rule B was skipped because field C was missing.” “This record was ignored due to failing validation on condition D, which we added last month to handle malformed upstream data.” Each entry doesn’t just timestamp an action, it captures the reasoning behind it, the context that mattered, the branches not taken. This turns your system into a narrator of its own behavior, explaining its choices in real time rather than leaving engineers to infer intent from circumstantial evidence.
This fundamentally changes operations. Debugging becomes inspection rather than inference. Instead of building theories about what might have happened and searching logs for supporting evidence, you read the system’s own explanation of what it decided and why. Onboarding accelerates because new engineers can follow decision trails through actual incidents, seeing not just what broke but why the system responded the way it did, which safeguards triggered, which didn’t, and what that reveals about system assumptions. Trust increases throughout the organization because the system shows its work. Product managers can see why certain requests were rejected. Support teams can explain to users why their action produced a specific result. Everyone stops relying on the engineers who built the system to interpret its behavior, because the system interprets itself.
When the Logs Lied by Telling the Truth
We had a payment processing system that occasionally failed to charge customers despite successful API responses from the payment provider. The logs were meticulous. Every request logged, every response captured, timestamps synchronized across services. When we investigated, the logs showed exactly what we expected: request sent, 200 response received, payment ID stored, confirmation event published. Every line was accurate. Every timestamp made sense. The logs told the truth. But they didn’t tell us that the payment provider’s 200 response actually indicated a pending state that required later verification, not immediate success. They didn’t explain that our system interpreted any non-error response as completion. They didn’t show that we had implicit assumptions about what 200 meant that the payment provider didn’t share.
We added a single field to our decision logs: payment_status_interpretation. It recorded not just the status code, but what our system interpreted that code to mean and which downstream actions resulted from that interpretation. The next time it happened, the answer was immediate. The log showed “received status 200, interpreted as SUCCESS based on rule: any_ok_response_is_complete, skipped verification step because assumption: success_is_final.” The entire incident collapsed from mystery to obvious misconfiguration. We hadn’t needed more logs, we’d needed one explanatory field that made our assumption visible. The system was finally telling the story, not just recording events.
Key Takeaways
Logs capture activity, not reasoning; they show what happened without explaining why it mattered
Explanation requires intent, not volume; more data without more context is just more noise
Systems should narrate their own behavior, turning operators from archaeologists into readers
If a new engineer joined your team tomorrow, could your system explain itself to them? Or would they need to find the person who built it three years ago?
Subscribe for more on building systems that tell their own stories.
What’s the incident that your logs captured perfectly but couldn’t explain? Reply and let me know.


