Foundational Architecture

How I Evaluate and Stabilize Distributed Systems

A practical framework for understanding, diagnosing, and hardening distributed systems before they scale.

6 min read

Opening

Most distributed systems don’t fail because of obvious architectural mistakes.

They fail at the boundaries:

  • Retries
  • State ownership
  • Workflow recovery

These issues don’t show up in diagrams. They show up under load, during partial failure, and in the gaps between services.


The Lens

When evaluating a system, the goal is not to understand how it is described but how it behaves under real conditions.

Where I look first:

  • Retry loops that amplify failure (or back-pressure)
  • Unclear or duplicated state across services
  • Tight coupling that makes small changes risky
  • Workflows that don’t recover cleanly from partial failure

Most systems become brittle as they become increasingly unclear.

The first objective is to make the system legible:

  • Clarify service boundaries
  • Make state ownership explicit
  • Understand how workflows behave under failure

Once a system is legible, it becomes possible to reason about it.
Once it is understandable, it can be stabilized.


Designing Complex Distributed Systems

One of the more complex systems I’ve architected was the event-driven architecture behind Salesforce Einstein Bots, which evolved into a large-scale multi-service workflow system processing billions of events per month — documented in Building a Fault-Tolerant Data Pipeline for Chatbots.

At its core, the system coordinated:

  • Real-time user interactions
  • Asynchronous processing pipelines
  • External API integrations
  • Long-running, stateful workflows

Kafka-backed pipelines were used for real-time aggregation and signal processing, providing visibility into system behavior as it evolved.

The challenge was not just throughput, it was maintaining correctness and recoverability across long-running, stateful interactions.

Two key tradeoffs shaped the system.

Consistency vs Availability

The system favored eventual consistency with strong idempotency guarantees:

  • Explicit retry strategies
  • Deduplication at service boundaries
  • Clearly defined state transitions

This improved resilience under failure, but required discipline to avoid inconsistent outcomes.

Centralization vs Decomposition

Early designs centralized too much orchestration, creating bottlenecks and limiting flexibility.

The system evolved toward event-driven decomposition:

  • Services owned their own state
  • Behavior emerged through event flow rather than centralized coordination

This improved scalability and separation of concerns, but increased the need for:

  • Well-defined contracts (Avro/Protobuf schemas, idempotency guarantees at service boundaries)
  • Observability
  • End-to-end traceability across services

In practice, the hardest part was not building the initial pipeline, but making the system observable and operable at scale. Once bottlenecks are visible, the system becomes tractable.


What to Keep vs What to Change

When evaluating an existing system, I group components into three categories:

Category Characteristics Action
Stable and well-bounded Clear ownership, predictable behavior Keep
Works today but structurally fragile Retry, state, or boundary issues Refactor incrementally
Fundamentally misaligned Unclear ownership, hidden coupling, or over-centralized orchestration Redesign

Large rewrites are avoided unless the system is actively blocking progress.

Most systems improve significantly through:

  • Tightening contracts
  • Making state explicit
  • Restructuring workflow boundaries

The priority is always:

  1. Make the system understandable
  2. Make it resilient under real conditions

Once those are in place, the system becomes much easier to evolve.


Failure Modes: When Retries Amplify Instability

A common failure pattern in event-driven systems is retry amplification.

In one system, a downstream service began intermittently failing under load.
Upstream services retried aggressively, re-enqueuing the same events multiple times.

Because idempotency was not consistently enforced:

  • Duplicate events were processed as new work
  • State became inconsistent
  • Downstream systems were overloaded
  • System latency degraded

The issue was not the initial failure but that the retry strategy amplified it.

The failure mode becomes clear when visualized:

Retry Amplification vs Controlled Retry

Fundamental Improvements

  • Idempotency enforced at all service boundaries
  • Staged retry queues introduced with increasing delay intervals
  • Dead-letter queues added to isolate persistent failures
  • Retry policies tightened with exponential backoff and circuit breakers
  • State transitions made explicit so workflows could resume safely

Observability was improved through:

  • End-to-end tracing
  • Correlation across services

The lesson

Retries are not a reliability mechanism on their own.

Without clear state management and idempotent boundaries, retries amplify failure instead of containing it.


Event-Driven Boundaries: Publisher vs Consumer

Validation, retries, and schema enforcement are shared responsibilities in event-driven systems, but not equally.

Publisher responsibilities

  • Enforce schema and contract correctness
  • Validate required fields (business logic)
  • Prevent malformed events
  • Handle short-lived transport retries

Consumer responsibilities

  • Enforce semantic validation
  • Ensure idempotency
  • Control retry behavior
  • Manage recovery logic

A message can be structurally valid but still be:

  • Duplicated
  • Out of order
  • No longer relevant

The principle:

Publishers protect the contract. Consumers protect correctness.

Robust systems require both.


Documenting Systems That Weren’t Built by You

In some cases, I’ve designed and built the system from the ground up, as with the event-driven architecture behind Einstein Bots.

But more often, the challenge is stepping into an existing system: one with history, partial understanding, and evolving constraints.

In those situations, documentation is not the starting point. Reconstruction is.

The system is examined from multiple perspectives:

  • Code structure
  • Service boundaries
  • Runtime behavior
  • Infrastructure
  • Data flow
  • End-to-end workflows

The goal is to understand:

  • Who owns state
  • Where sources of truth live
  • Where synchronous vs asynchronous transitions occur

Documentation is built in layers:

System map

High-level view of services, integrations, queues, and boundaries

Workflow traces

End-to-end flows including:

  • Happy paths
  • Failure paths
  • Retries
  • Recovery behavior

Contracts

  • Event schemas
  • Payloads
  • Ownership boundaries
  • Validation responsibilities

Operational model

  • Deployment strategy
  • Environment differences
  • Observability
  • Failure modes

Documentation is validated by walking it with engineers familiar with the system.

Because documentation is only useful if it helps someone reason about the system under pressure.


Selected Work

The patterns described here are drawn from production systems built and operated at scale.

  • Building a Fault-Tolerant Data Pipeline for Chatbots
    Event-driven pipeline design for conversational systems, including retry strategies, failure isolation, and large-scale processing.
    Salesforce Engineering

  • Building a Scalable Event Pipeline with Heroku and Salesforce
    Distributed event pipeline architecture, focusing on scalability, reliability, and system boundary design.
    Salesforce Engineering


Closing

The goal is not to describe the system, but to make the system and its design legible enough that a team can evolve it safely.


If you’re working on a system like this and want a second set of eyes before scaling, reach out.