Finished by Roberto Vitillo

Understanding Distributed Systems

The core trade-offs of distributed systems, especially failure, replication, time, consensus, and why coordination always has a cost.

Started

26 May 2022

1 day span

Progress

344/344

100% complete

What I Learned

  • Distributed systems are defined by partial failure, uncertain timing, and coordination trade-offs.
  • Replication improves resilience and scale but introduces consistency and ordering problems.
  • System design gets better when trade-offs are made explicit instead of hidden behind abstractions.

What stayed with me

What I like about this book is that it stays close to the operational reality of distributed systems. It explains why things get hard once a program crosses machine boundaries: clocks drift, messages arrive late, nodes fail independently, and the system still has to make progress anyway.

It is a good mental-model book. Instead of treating consensus, replication, and partition tolerance as isolated topics, it frames them as recurring consequences of the same basic fact: shared state across unreliable boundaries is expensive.

Notes I wanted to keep

  • Network calls are not function calls with worse latency. They have different failure semantics.
  • Every coordination mechanism buys safety by spending availability, latency, or complexity.
  • Time is a weak source of truth in distributed systems.
  • Idempotency and retries are practical survival tools, not optional polish.
  • This complements designing-data-intensive-applications-second-edition as a shorter pass over the same family of trade-offs.