Production System Design
Production system design focused on reliability, correctness, and failure handling across payments and booking lifecycles.
Overview
This case study documents how I designed production workflows where correctness is non-negotiable: payments, booking confirmation, and cancellation paths with direct financial consequences. The patterns described here come from GigWizard, spanning mobile clients, a serverless backend, a database, and a third-party payment provider—an environment where partial failures, retries, and out-of-order events are expected.
Rather than describing product features, this page focuses on the system design decisions that kept workflows safe under real-world behaviour: idempotent processing, explicit lifecycle states, transactional state transitions, and operational recovery paths.
Anchor Workflows
These workflows formed the correctness boundary of the system:
- Payment capture and release
Once money moves, you cannot “retry casually.” Duplicate processing is a critical failure. - Booking confirmation lifecycle
Double-confirmation, skipped states, or stale transitions lead to disputes and broken trust. - Cancellations with penalties
Incorrect timing or state can trigger incorrect refunds or payouts.
System Boundaries
The workflows crossed multiple independently failing components:
- Mobile clients — intermittent connectivity, retries, duplicate user actions
- Serverless backend — concurrency, retries, execution time limits
- Database — transactional updates, read-optimised models, eventual consistency edges
- Payment provider — asynchronous callbacks and webhooks with at-least-once delivery
- Email / push notifications — best-effort delivery, non-critical side effects
Payment & Booking Lifecycle Flow
Failure Modes Designed For
The system was designed assuming the following would occur regularly:
- Duplicate callbacks / webhooks from the payment provider
- Concurrent processing of the same event by multiple workers
- Out-of-order state transitions caused by delayed events
- Retries after partial failure, including execution retries after timeouts
- Timeouts after side effects, where core writes succeed but notifications fail
Core Invariants
These invariants defined correctness and were enforced throughout the system:
- Funds are released at most once
- A booking cannot be confirmed without a successful payment hold or capture
- State transitions are monotonic, never moving backwards
- Irreversible actions are idempotent, making duplicate events safe no-ops
Design Patterns Applied
1) Explicit lifecycle states
Critical workflows were modelled using explicit booking lifecycle states. Each transition was validated against the current state before execution.
Why it matters
- Prevents implicit state derived from scattered flags
- Makes invalid transitions unrepresentable
- Simplifies recovery and manual intervention
2) Idempotent processing for at-least-once events
All external callbacks and backend triggers were treated as at-least-once. Every irreversible step was guarded by an idempotency check combined with a transactional state update.
What this achieved
- Duplicate payment events became safe no-ops
- Concurrent workers could not apply the same transition twice
- Retries after partial failure became deterministic
3) Transactional state transitions on critical paths
For money-moving workflows, state transitions and “already processed” markers were persisted atomically. This prevented partial updates such as recording a payment without advancing booking state.
Principle
- External systems are untrusted and repeatable
- The database is the single source of truth for irreversible actions
4) Separation of critical and non-critical paths
The critical path was limited to:
- payment validity
- booking state progression
- payout eligibility and release conditions
Non-critical side effects (emails, push notifications, receipts) were deliberately decoupled.
Why it matters
- Payment and booking integrity is not coupled to notification success
- Failures remain visible without corrupting core state
5) Auditability and recovery
The system was designed for real-world operation, not just correctness in code:
- structured custom logs for traceability
- crash and exception reporting
- release checklists to reduce regression risk
- explicit, auditable manual recovery paths for rare edge cases
Trade-offs and What I Intentionally Did Not Build
To keep the system operable by a small team with predictable costs, I intentionally avoided:
- Exactly-once processing guarantees
Chose at-least-once delivery with idempotency and monotonic state instead. - Large centralised services
Preferred smaller, purpose-specific workflows to reduce blast radius. - Opaque auto-healing for critical flows
Used explicit failure states and manual recovery to preserve correctness. - Heavy internal tooling early
Expanded operational tooling only as real production needs emerged.
Concrete Incident: Duplicate Payment Event Processing
Early in production, payment provider webhooks were delivered multiple times within seconds due to retries. In one case, two serverless workers processed the same event concurrently.
Although no funds were duplicated, this revealed a risk where booking state transitions could be applied more than once under race conditions.
I resolved this by strengthening idempotency using external event identifiers and transactional guards, ensuring irreversible operations could be applied at most once. After this change, duplicate events became deterministic no-ops and recovery from partial failures was predictable and observable.
Outcome
- Reliable payment and booking workflows under real production load
- Deterministic recovery from retries and partial failures
- Clear operational paths for intervention without state corruption
Tech
Flutter · Firebase / Firestore · Serverless Functions · Stripe · Push Notifications