← Back to Case Studies

Production System Design

Production system design focused on reliability, correctness, and failure handling across payments and booking lifecycles.

Overview

This case study documents how I designed production workflows where correctness is non-negotiable: payments, booking confirmation, and cancellation paths with direct financial consequences. The patterns described here come from GigWizard, spanning mobile clients, a serverless backend, a database, and a third-party payment provider—an environment where partial failures, retries, and out-of-order events are expected.

Rather than describing product features, this page focuses on the system design decisions that kept workflows safe under real-world behaviour: idempotent processing, explicit lifecycle states, transactional state transitions, and operational recovery paths.

Anchor Workflows

These workflows formed the correctness boundary of the system:

  • Payment capture and release
    Once money moves, you cannot “retry casually.” Duplicate processing is a critical failure.
  • Booking confirmation lifecycle
    Double-confirmation, skipped states, or stale transitions lead to disputes and broken trust.
  • Cancellations with penalties
    Incorrect timing or state can trigger incorrect refunds or payouts.

System Boundaries

The workflows crossed multiple independently failing components:

  • Mobile clients — intermittent connectivity, retries, duplicate user actions
  • Serverless backend — concurrency, retries, execution time limits
  • Database — transactional updates, read-optimised models, eventual consistency edges
  • Payment provider — asynchronous callbacks and webhooks with at-least-once delivery
  • Email / push notifications — best-effort delivery, non-critical side effects

Payment & Booking Lifecycle Flow

Failure Modes Designed For

The system was designed assuming the following would occur regularly:

  • Duplicate callbacks / webhooks from the payment provider
  • Concurrent processing of the same event by multiple workers
  • Out-of-order state transitions caused by delayed events
  • Retries after partial failure, including execution retries after timeouts
  • Timeouts after side effects, where core writes succeed but notifications fail

Core Invariants

These invariants defined correctness and were enforced throughout the system:

  • Funds are released at most once
  • A booking cannot be confirmed without a successful payment hold or capture
  • State transitions are monotonic, never moving backwards
  • Irreversible actions are idempotent, making duplicate events safe no-ops

Design Patterns Applied

1) Explicit lifecycle states

Critical workflows were modelled using explicit booking lifecycle states. Each transition was validated against the current state before execution.

Why it matters

  • Prevents implicit state derived from scattered flags
  • Makes invalid transitions unrepresentable
  • Simplifies recovery and manual intervention

2) Idempotent processing for at-least-once events

All external callbacks and backend triggers were treated as at-least-once. Every irreversible step was guarded by an idempotency check combined with a transactional state update.

What this achieved

  • Duplicate payment events became safe no-ops
  • Concurrent workers could not apply the same transition twice
  • Retries after partial failure became deterministic

3) Transactional state transitions on critical paths

For money-moving workflows, state transitions and “already processed” markers were persisted atomically. This prevented partial updates such as recording a payment without advancing booking state.

Principle

  • External systems are untrusted and repeatable
  • The database is the single source of truth for irreversible actions

4) Separation of critical and non-critical paths

The critical path was limited to:

  • payment validity
  • booking state progression
  • payout eligibility and release conditions

Non-critical side effects (emails, push notifications, receipts) were deliberately decoupled.

Why it matters

  • Payment and booking integrity is not coupled to notification success
  • Failures remain visible without corrupting core state

5) Auditability and recovery

The system was designed for real-world operation, not just correctness in code:

  • structured custom logs for traceability
  • crash and exception reporting
  • release checklists to reduce regression risk
  • explicit, auditable manual recovery paths for rare edge cases

Trade-offs and What I Intentionally Did Not Build

To keep the system operable by a small team with predictable costs, I intentionally avoided:

  • Exactly-once processing guarantees
    Chose at-least-once delivery with idempotency and monotonic state instead.
  • Large centralised services
    Preferred smaller, purpose-specific workflows to reduce blast radius.
  • Opaque auto-healing for critical flows
    Used explicit failure states and manual recovery to preserve correctness.
  • Heavy internal tooling early
    Expanded operational tooling only as real production needs emerged.

Concrete Incident: Duplicate Payment Event Processing

Early in production, payment provider webhooks were delivered multiple times within seconds due to retries. In one case, two serverless workers processed the same event concurrently.

Although no funds were duplicated, this revealed a risk where booking state transitions could be applied more than once under race conditions.

I resolved this by strengthening idempotency using external event identifiers and transactional guards, ensuring irreversible operations could be applied at most once. After this change, duplicate events became deterministic no-ops and recovery from partial failures was predictable and observable.

Outcome

  • Reliable payment and booking workflows under real production load
  • Deterministic recovery from retries and partial failures
  • Clear operational paths for intervention without state corruption

Tech

Flutter · Firebase / Firestore · Serverless Functions · Stripe · Push Notifications