Foundational Architecture

How I Evaluate and Stabilize Distributed Systems

A practical framework for understanding, diagnosing, and hardening distributed systems before they scale.

April 3, 2026 6 min read

Opening

Most distributed systems don’t fail because of obvious architectural mistakes.

They fail at the boundaries:

Retries
State ownership
Workflow recovery

These issues don’t show up in diagrams. They show up under load, during partial failure, and in the gaps between services.

The Lens

When evaluating a system, the goal is not to understand how it is described but how it behaves under real conditions.

Where I look first:

Retry loops that amplify failure (or back-pressure)
Unclear or duplicated state across services
Tight coupling that makes small changes risky
Workflows that don’t recover cleanly from partial failure

Most systems become brittle as they become increasingly unclear.

The first objective is to make the system legible:

Clarify service boundaries
Make state ownership explicit
Understand how workflows behave under failure

Once a system is legible, it becomes possible to reason about it.
Once it is understandable, it can be stabilized.

Designing Complex Distributed Systems

One of the more complex systems I’ve architected was the event-driven architecture behind Salesforce Einstein Bots, which evolved into a large-scale multi-service workflow system processing billions of events per month — documented in Building a Fault-Tolerant Data Pipeline for Chatbots.

At its core, the system coordinated:

Real-time user interactions
Asynchronous processing pipelines
External API integrations
Long-running, stateful workflows

Kafka-backed pipelines were used for real-time aggregation and signal processing, providing visibility into system behavior as it evolved.

The challenge was not just throughput, it was maintaining correctness and recoverability across long-running, stateful interactions.

Two key tradeoffs shaped the system.

Consistency vs Availability

The system favored eventual consistency with strong idempotency guarantees:

Explicit retry strategies
Deduplication at service boundaries
Clearly defined state transitions

This improved resilience under failure, but required discipline to avoid inconsistent outcomes.

Centralization vs Decomposition

Early designs centralized too much orchestration, creating bottlenecks and limiting flexibility.

The system evolved toward event-driven decomposition:

Services owned their own state
Behavior emerged through event flow rather than centralized coordination

This improved scalability and separation of concerns, but increased the need for:

Well-defined contracts (Avro/Protobuf schemas, idempotency guarantees at service boundaries)

Observability
End-to-end traceability across services

In practice, the hardest part was not building the initial pipeline, but making the system observable and operable at scale. Once bottlenecks are visible, the system becomes tractable.

What to Keep vs What to Change

When evaluating an existing system, I group components into three categories:

Category	Characteristics	Action
Stable and well-bounded	Clear ownership, predictable behavior	Keep
Works today but structurally fragile	Retry, state, or boundary issues	Refactor incrementally
Fundamentally misaligned	Unclear ownership, hidden coupling, or over-centralized orchestration	Redesign

Large rewrites are avoided unless the system is actively blocking progress.

Most systems improve significantly through:

Tightening contracts
Making state explicit
Restructuring workflow boundaries

The priority is always:

Make the system understandable
Make it resilient under real conditions

Once those are in place, the system becomes much easier to evolve.

Failure Modes: When Retries Amplify Instability

A common failure pattern in event-driven systems is retry amplification.

In one system, a downstream service began intermittently failing under load.
Upstream services retried aggressively, re-enqueuing the same events multiple times.

Because idempotency was not consistently enforced:

Duplicate events were processed as new work
State became inconsistent
Downstream systems were overloaded
System latency degraded

The issue was not the initial failure but that the retry strategy amplified it.

The failure mode becomes clear when visualized:

Retry Amplification vs Controlled Retry

Fundamental Improvements

Idempotency enforced at all service boundaries
Staged retry queues introduced with increasing delay intervals
Dead-letter queues added to isolate persistent failures
Retry policies tightened with exponential backoff and circuit breakers
State transitions made explicit so workflows could resume safely

Observability was improved through:

End-to-end tracing
Correlation across services

The lesson

Retries are not a reliability mechanism on their own.

Without clear state management and idempotent boundaries, retries amplify failure instead of containing it.

Event-Driven Boundaries: Publisher vs Consumer

Validation, retries, and schema enforcement are shared responsibilities in event-driven systems, but not equally.

Publisher responsibilities

Enforce schema and contract correctness
Validate required fields (business logic)
Prevent malformed events
Handle short-lived transport retries

Consumer responsibilities

Enforce semantic validation
Ensure idempotency
Control retry behavior
Manage recovery logic

A message can be structurally valid but still be:

Duplicated
Out of order
No longer relevant

The principle:

Publishers protect the contract. Consumers protect correctness.

Robust systems require both.

Documenting Systems That Weren’t Built by You

In some cases, I’ve designed and built the system from the ground up, as with the event-driven architecture behind Einstein Bots.

But more often, the challenge is stepping into an existing system: one with history, partial understanding, and evolving constraints.

In those situations, documentation is not the starting point. Reconstruction is.

The system is examined from multiple perspectives:

Code structure
Service boundaries
Runtime behavior
Infrastructure
Data flow
End-to-end workflows

The goal is to understand:

Who owns state
Where sources of truth live
Where synchronous vs asynchronous transitions occur

Documentation is built in layers:

System map

High-level view of services, integrations, queues, and boundaries

Workflow traces

End-to-end flows including:

Happy paths
Failure paths
Retries
Recovery behavior

Contracts

Event schemas
Payloads
Ownership boundaries
Validation responsibilities

Operational model

Deployment strategy
Environment differences
Observability
Failure modes

Documentation is validated by walking it with engineers familiar with the system.

Because documentation is only useful if it helps someone reason about the system under pressure.

Selected Work

The patterns described here are drawn from production systems built and operated at scale.

Building a Fault-Tolerant Data Pipeline for Chatbots
Event-driven pipeline design for conversational systems, including retry strategies, failure isolation, and large-scale processing.
Salesforce Engineering
Building a Scalable Event Pipeline with Heroku and Salesforce
Distributed event pipeline architecture, focusing on scalability, reliability, and system boundary design.
Salesforce Engineering

Closing

The goal is not to describe the system, but to make the system and its design legible enough that a team can evolve it safely.

If you’re working on a system like this and want a second set of eyes before scaling, reach out.

← Back to Papers