platform engineering Dec 18, 2025 7 min read

observability debt is a debt

why most platforms have more telemetry than they can afford, fewer answers than they need, and a quiet monthly bill nobody is measuring.

pramod

co-founder

Every platform has an observability strategy. In most of them, the strategy is "turn everything on and hope." The bills that result are not shocking because they arrive gradually — one instrumented library at a time, one cardinality explosion at a time. By the time anyone looks, the observability spend is a measurable fraction of the compute spend, the incident response is still slow, and the dashboard that would answer the current question is in someone's private folder on a tool that nobody has logged into for a month.

Observability debt is a debt. It compounds. It is discharged not by adding more instrumentation but by deleting instrumentation that is not earning its cost.

the three things telemetry is for

There are, in the end, three reasons to emit a piece of telemetry:

Alerting — you will page someone when this metric crosses a threshold.
Debugging — during an incident, you will need this data to find root cause.
Analysis — you will run a query against this data to make a product or capacity decision.

Every piece of telemetry in your system should map to one of those three. If a metric, log line, or span exists for none of the three reasons, it is cost without return. Delete it.

This sounds obvious and it is, and yet every client we audit is paying for at least one of the following: metrics that no alert references and no dashboard uses; structured logs whose fields are never queried; traces whose spans are sampled at 100% for services nobody investigates. The bill is real and the answer is the same in every case: the telemetry was added for a reason that has since evaporated, and nobody took it out.

the cardinality question

The single largest multiplier on observability cost is cardinality — how many unique tag combinations exist on a metric. A metric tagged with service, region, and status_code has a modest cardinality. Add user_id and you have exploded your cardinality by six or seven orders of magnitude, depending on your user base. The cost follows.

Most teams know this. Most teams have, nonetheless, at least one high-cardinality metric they are paying for. The question to ask, quarterly, is: for each of our top ten metrics by cost, what is the highest-cardinality tag, and does any alert or dashboard depend on that tag? If the answer to the second question is no, remove the tag. If the answer is yes but the use case is "occasional debugging," move the tag to a span attribute — traces handle high-cardinality cheaply and selectively. If the answer is yes and the use case is "always-on," accept the cost and move on.

You will be surprised how many cases fall into the first bucket.

the debugging bias

Engineers instinctively over-instrument for debugging. The reasoning is simple: when something breaks in production, you cannot go back and add the log line you wish you had. You must have logged it already. So engineers log everything, all the time, for the 1% of the time they will need it.

The correct architecture — which we have successfully moved clients to three times in the last year — is a two-tier observability model. The first tier is always-on, low-cost, and unambitious: structured logs at INFO or above, RED metrics (rate, errors, duration) per service boundary, and head-sampled traces at 1%. The second tier is on-demand: when a developer wants to investigate, they can trigger full trace sampling for a service, or DEBUG logs for a particular user session, for a bounded time window. The platform provides this as a self-service tool.

The result is a quieter baseline and a deeper dive when you need it. The total cost drops, sometimes dramatically. The incident response gets faster, because the signal-to-noise ratio is better. Nothing is lost, because the lower-volume baseline still captures what matters, and the on-demand deep-dive captures everything else precisely when you want it.

the dashboard graveyard

Every organisation with more than two years of observability history has a dashboard graveyard: hundreds, sometimes thousands, of dashboards created for a specific incident, a specific review, or a specific engineer's investigation. They remain, indexed, searchable, largely unused. Each one has a link from somewhere; none of them is a source of truth.

Dashboards should be versioned, owned, and pruned. Ownership in practice means each dashboard has a clearly named owner and a date at which it will be deleted unless renewed. Most dashboards should not survive their first year. A small number — the handful that every on-call engineer actually uses — should be treated as product assets, maintained like any other shared code, and tested against live data the way you test any other production system.

the quarterly ritual

The discipline that works, the one we install in every engagement, is a quarterly observability review. It takes a team of two half a day. The agenda is:

top ten metrics by cost — what are they used for?
top ten high-cardinality tags — do any alerts depend on them?
list of dashboards unused in the last 90 days — which get deleted?
total observability spend as a fraction of compute spend — is it trending up or down?

Teams that do this quietly save 30% to 50% on their observability bill within a year. More importantly, they keep their mental model of what they instrument aligned with what they use. The goal is not to spend as little as possible; it is to spend only on telemetry that earns its keep.