Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How do you keep system context from rotting over time?
12 points by kennethops 8 hours ago | hide | past | favorite | 13 comments
Former SRE here, looking for advice.

I know there are a lot of tools focused on root cause analysis after things break. Cool, but that’s not what’s wearing me down. What actually hurts is the constant context switching while trying to understand how a system fits together, what depends on what, and what changed recently.

As systems grow, this feels like it gets exponentially harder. Add logs and now you’ve created a million new events to reason about. Add another database and suddenly you’re dealing with subnet constraints or a DB choice that’s expensive as hell, and no one noticed until later. Everyone knows their slice, but the full picture lives nowhere, so bit rot just keeps creeping in.

This feels even worse now that AI agents are pushing large amounts of code and config changes quickly. Things move faster, but shared understanding falls behind even faster.

I’m honestly stuck on how people handle this well in practice. For folks dealing with real production systems, what’s actually helped? Diagrams, docs, tribal knowledge, tooling, something else? Where does it break down?





Every company I've worked with has started with an ER diagram for their primary database (and insisted on it, in fact), only to give up when it became too complex. You quickly hit the point where no one can understand it.

You then eventually have that same pattern happen with services, where people give up on mapping the full thing out as well.

What I've done for my current team is to list the "downstream" services, what we use them for, who to contact, etc. It only goes one level deep, but it's something that someone can read quickly during an incident.


Sorry what is an ER diagram?

First hits on DDG, anonymous Google, Bing

ERD/ Entity Relationship Diagram https://www.lucidchart.com/pages/er-diagrams

ERM / Entity-Relationship Model https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_mo...

(same-same, ERD is the more common acronym)


That is what I figured it would be, but you never know anymore with the amount of acronyms thrown around nowadays.

Monitoring tools (APM) will show dependencies (web calls, databases, etc) and should contain things like deployment markers and trend lines.

All of those endpoints should be documented in an environment variable or similar as well.

The breakdown is when you don't instrument the same tooling everywhere.

Documentation is generally out of date by the time you finish writing it so I don't really bother with much detail there.


This has been my experience as well. imo documentation feels like one of the few areas that AI can be good at today.

I don't think OP is looking for context from the AI model perspective but rather a process for maintaining a mental picture of the system architecture and managing complexity.

I'm not sure I've seen any good vendors but I remember seeing a reverse devops tool posted a few days ago that would reverse engineer your VMs into Ansible code. If that got extended to your entire environment, that would almost be an auto documenting process.


Context rots when it stays implicit. Make the system model an explicit artifact with fixed inputs and checkpoints, then update it on purpose. Otherwise you keep rebuilding the same picture from scratch.

Im honestly looking for both. I haven't found a vender to do this well for just humans nor am I seeing something that can expose this context, read only, to all of the ai agent coding models

I will check that tool out.


If the system is so good, why constantly change the context?

One thing that’s evidently helped: using CLAUDE.md / agent instructions as de facto architecture docs. If the agent needs to understand system boundaries to work effectively, those docs actually get maintained

But how do you ensure the .md file is able to see all of the details of the infra?

You don't, it's a map of intent, not infra state. What exists, why, what talks to what. Live state still needs IaC and observability. The .md captures the 'why' that terraform can't



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: