Back to Blog
documentation best-practices devtools workflow api

How to document microservices: patterns that actually work at scale

Monolith docs scale badly. Microservices docs fail differently — fragmented, ownerless, constantly stale. Here are the patterns that actually hold up.

Yorjander Hernandez
Yorjander Hernandez
Co-founder & CTO · · 7 min read
How to document microservices: patterns that actually work at scale

Your monolith had one README. Forty microservices later, you have forty READMEs — and none of them match what the service actually does.

This isn’t a discipline problem. It’s a structural one. Documentation practices built for monoliths collapse at microservice scale because the assumptions break: one codebase, one team, one release cycle. Microservices shatter all three. The result is documentation that’s fragmented across repos, owned by nobody in particular, and perpetually a few deploys behind reality.

The teams that solve this don’t write more documentation. They change the architecture of how documentation is produced and maintained. Here’s what that looks like.

Why microservices break traditional documentation

A monolith is comprehensible from one place. You can clone the repo, read the top-level docs, and build a mental model of the whole system. Documentation for a monolith is a solved problem — it’s hard to do well, but the scope is bounded.

A microservices architecture is a distributed system, and distributed systems documentation inherits all the failure modes of distributed systems: inconsistency, partition tolerance problems, and no single source of truth.

The specific breakdowns look like this:

  • Context fragmentation. The payment-service knows it calls fraud-detection, but fraud-detection’s docs don’t say who calls it or why. You need to read both repos, and the relationship between them lives in neither.
  • Stale contracts. One team upgrades their service’s response schema. The calling service’s docs still describe the old fields. The error shows up in production, not in reviews.
  • Dead READMEs. A service changes ownership. The new team doesn’t know the README exists, or doesn’t know it’s wrong. The old team doesn’t know there are readers relying on it.
  • No discoverability. A new engineer needs to find the service that handles email notifications. There’s no catalog, no index. They ask in Slack. Two senior engineers give different answers.

A 2023 survey of 500 engineering teams by DX found that lack of documentation discoverability was the second-biggest drag on developer productivity, behind only slow CI pipelines. At 20+ services, it becomes the first.

What actually needs documenting — the four layers

Not everything needs the same depth. Teams that try to fully document every service at every layer end up documenting nothing because the overhead is too high. The pattern that scales is recognizing four distinct layers, each with different owners and different update cadences.

Layer 1: Service overview. One page per service. What it does, who owns it, what its external dependencies are, and what SLOs it carries. This belongs in the repo’s README.md and gets updated whenever the service’s scope changes — which is infrequent.

Layer 2: API reference. Every endpoint, event schema, or gRPC method the service exposes. This should be auto-generated from code or spec — not hand-written. Hand-written API docs drift from reality within weeks. Auto-generated docs drift the moment the build breaks, which means drift gets caught immediately.

Layer 3: Dependency map. Which services this service calls, which data stores it owns, which message queues it publishes to or consumes from. This is the layer most teams skip, and it’s the layer that causes the most incidents.

Layer 4: Runbook. What to do when the service is on fire. Common failure modes, alert definitions, escalation paths, links to dashboards. This is the only layer that engineers will definitely read under pressure, so it needs to be correct.

# Example: service-manifest.yaml — a lightweight Layer 1 document
service:
  name: payment-processor
  owner: payments-team
  slack: "#payments-eng"
  slo:
    availability: 99.95%
    p99_latency_ms: 200

dependencies:
  calls:
    - fraud-detection-service
    - ledger-service
  publishes_to:
    - payments.completed (kafka)
  consumes_from:
    - orders.placed (kafka)
  data_stores:
    - postgres: payments-db
    - redis: payment-sessions

runbook: docs/runbook.md
api_spec: openapi/payment-processor.yaml

This manifest is machine-readable and human-readable. It’s checked into the repo. It’s the input to a service catalog. It’s the truth.

The ownership problem

At 5 services, “the team” owns the docs. At 25 services with 8 squads, “the team” means nobody. Undefined ownership is the single biggest reason microservices documentation fails — not tooling, not effort, not writing skill.

The pattern that works is explicit, code-adjacent ownership. Each service has a single named team in a CODEOWNERS-style manifest, and that team is responsible for Layers 1 and 4. Layers 2 and 3 are generated — no team needs to own them because no human writes them.

❌ The ownership anti-pattern:

# README.md (last edited 14 months ago)
## Owners
"The platform team"

✅ The ownership pattern that holds:

# service-manifest.yaml
owner:
  team: payments-eng
  primary: [email protected]
  secondary: [email protected]
  on_call_rotation: PagerDuty-Payments

The difference isn’t pedantic. When a service is misbehaving at 2 AM, “the platform team” is not an actionable contact. A named team with a PagerDuty rotation is.

For the writing responsibility itself, the rule that scales is: the person who ships the feature writes the doc update in the same PR. Not afterward. Not as a follow-up ticket. In the PR. Doc updates that leave the PR become lost work — the follow-up ticket is deprioritized, the context is forgotten, and the doc never gets written.

Teams that enforce this report a 60–70% reduction in documentation debt accumulation. The cost is a 10–15 minute overhead per feature PR. That’s a favorable trade.

Patterns that work at scale

Three structural patterns hold up once you’re past 15 services.

Service catalogs. A single indexed page (or internal site) that lists every service, its owner, its SLO, its API, and its status. Built from the service-manifest.yaml files in each repo, aggregated by a CI pipeline. Engineers can search by team, by dependency, by data store. New engineers can orient themselves without asking anyone.

Tools like Backstage provide this out of the box, but they require effort to populate. The manifests-in-repo approach described above makes population automatic — the manifest is already in the repo, the catalog just reads it.

Auto-generated API reference. If your service has an OpenAPI spec or a gRPC proto file, your API reference should be generated from it. Treat hand-written API reference docs the same way you treat hand-written SQL migrations: possible, but not the default.

# openapi/payment-processor.yaml (excerpt)
paths:
  /v1/payments:
    post:
      operationId: createPayment
      summary: Initiate a new payment
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/PaymentRequest'
      responses:
        '202':
          description: Payment accepted and queued for processing
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PaymentResponse'
        '422':
          description: Validation error — malformed request body

This spec generates your API reference page, your SDK client, and your mock server. One source, three artifacts. When the endpoint changes, you update the spec and all three update together.

Architecture Decision Records (ADRs). Microservices accumulate decisions — why this service was split from that one, why Kafka instead of RabbitMQ, why this schema looks the way it does. Without ADRs, that context lives in the heads of the engineers who were in the room and leaves when they do.

ADRs are short (one page), checked into the repo, and never deleted — only superseded. They answer the question “why does this exist” that no amount of code comments can answer.

Keeping everything in sync when services change constantly

The honest answer is that you can’t keep documentation in sync through effort alone. Effort-based sync requires everyone to remember, have time, and care equally. None of those hold at scale.

The systems that work make sync automatic or make staleness visible.

Automatic sync means documentation is generated from the source of truth (code, spec, manifest) at build time. Layer 2 (API reference) and Layer 3 (dependency maps) should be entirely in this category. If a schema changes without updating the spec, the build fails — and failing builds get fixed.

Visible staleness means surfacing when a human-written doc hasn’t been touched in N days while the code it describes has changed. This is the mechanism for Layer 1 (service overview) and Layer 4 (runbooks) — you can’t auto-generate a runbook, but you can alert the owning team that the runbook hasn’t been updated in 90 days while the service has had 40 deploys.

❌ What doesn’t work:

  • A quarterly “docs review” calendar event that everyone skips
  • A docs debt backlog that grows and never shrinks
  • Asking engineers to “keep docs updated” with no tooling support

✅ What does work:

  • CI checks that fail if the OpenAPI spec doesn’t match the deployed routes
  • Automated PRs opened against the owning team when staleness thresholds are crossed
  • Doc coverage reports in the same dashboard as test coverage reports

The cultural shift is treating documentation staleness as a build signal, not a process reminder. A broken test fails the CI pipeline. A stale runbook should create the same friction.

How GitDocAI handles documentation at the service level

GitDocAI is built around the assumption that documentation has to live next to code to stay current. For microservices teams, this means connecting each repo’s manifest and spec files to a shared documentation layer that updates on every push.

When a service’s OpenAPI spec changes, the reference page updates automatically. When a service-manifest.yaml changes, the service catalog reflects it within minutes. When a runbook hasn’t been touched in 60 days but the owning team has shipped 30 PRs, an alert surfaces in their review queue.

The pattern GitDocAI enforces is the one described in this post: Layers 2 and 3 are generated, Layers 1 and 4 are owned and prompted. No team has to build the pipeline themselves — it’s the default.

If you’re managing documentation across more than five services and things are starting to drift, the architecture described here scales to hundreds. The tooling to support it is at gitdoc.ai — start with a free workspace and connect your first repo in under five minutes.

Keep reading