Production System Design

Production system design focused on reliability, correctness, and failure handling across payments and booking lifecycles.

Overview

This case study documents how I designed production workflows where correctness is non-negotiable: payments, booking confirmation, and cancellation paths with direct financial consequences. The patterns described here come from GigWizard, spanning mobile clients, a serverless backend, a database, and a third-party payment provider—an environment where partial failures, retries, and out-of-order events are expected.

Rather than describing product features, this page focuses on the system design decisions that kept workflows safe under real-world behaviour: idempotent processing, explicit lifecycle states, transactional state transitions, and operational recovery paths.

Anchor Workflows

These workflows formed the correctness boundary of the system:

Payment capture and release
Once money moves, you cannot “retry casually.” Duplicate processing is a critical failure.
Booking confirmation lifecycle
Double-confirmation, skipped states, or stale transitions lead to disputes and broken trust.
Cancellations with penalties
Incorrect timing or state can trigger incorrect refunds or payouts.

System Boundaries

The workflows crossed multiple independently failing components:

Mobile clients — intermittent connectivity, retries, duplicate user actions
Serverless backend — concurrency, retries, execution time limits
Database — transactional updates, read-optimised models, eventual consistency edges
Payment provider — asynchronous callbacks and webhooks with at-least-once delivery
Email / push notifications — best-effort delivery, non-critical side effects

Payment & Booking Lifecycle Flow

Failure Modes Designed For

The system was designed assuming the following would occur regularly:

Duplicate callbacks / webhooks from the payment provider
Concurrent processing of the same event by multiple workers
Out-of-order state transitions caused by delayed events
Retries after partial failure, including execution retries after timeouts
Timeouts after side effects, where core writes succeed but notifications fail

Core Invariants

These invariants defined correctness and were enforced throughout the system:

Funds are released at most once
A booking cannot be confirmed without a successful payment hold or capture
State transitions are monotonic, never moving backwards
Irreversible actions are idempotent, making duplicate events safe no-ops

Design Patterns Applied

1) Explicit lifecycle states

Critical workflows were modelled using explicit booking lifecycle states. Each transition was validated against the current state before execution.

Why it matters

Prevents implicit state derived from scattered flags
Makes invalid transitions unrepresentable
Simplifies recovery and manual intervention

2) Idempotent processing for at-least-once events

All external callbacks and backend triggers were treated as at-least-once. Every irreversible step was guarded by an idempotency check combined with a transactional state update.

What this achieved

Duplicate payment events became safe no-ops
Concurrent workers could not apply the same transition twice
Retries after partial failure became deterministic

3) Transactional state transitions on critical paths

For money-moving workflows, state transitions and “already processed” markers were persisted atomically. This prevented partial updates such as recording a payment without advancing booking state.

Principle

External systems are untrusted and repeatable
The database is the single source of truth for irreversible actions

4) Separation of critical and non-critical paths

The critical path was limited to:

payment validity
booking state progression
payout eligibility and release conditions

Non-critical side effects (emails, push notifications, receipts) were deliberately decoupled.

Why it matters

Payment and booking integrity is not coupled to notification success
Failures remain visible without corrupting core state

5) Auditability and recovery

The system was designed for real-world operation, not just correctness in code:

structured custom logs for traceability
crash and exception reporting
release checklists to reduce regression risk
explicit, auditable manual recovery paths for rare edge cases

Trade-offs and What I Intentionally Did Not Build

To keep the system operable by a small team with predictable costs, I intentionally avoided:

Exactly-once processing guarantees
Chose at-least-once delivery with idempotency and monotonic state instead.
Large centralised services
Preferred smaller, purpose-specific workflows to reduce blast radius.
Opaque auto-healing for critical flows
Used explicit failure states and manual recovery to preserve correctness.
Heavy internal tooling early
Expanded operational tooling only as real production needs emerged.

Concrete Incident: Duplicate Payment Event Processing

Early in production, payment provider webhooks were delivered multiple times within seconds due to retries. In one case, two serverless workers processed the same event concurrently.

Although no funds were duplicated, this revealed a risk where booking state transitions could be applied more than once under race conditions.

I resolved this by strengthening idempotency using external event identifiers and transactional guards, ensuring irreversible operations could be applied at most once. After this change, duplicate events became deterministic no-ops and recovery from partial failures was predictable and observable.

Outcome

Reliable payment and booking workflows under real production load
Deterministic recovery from retries and partial failures
Clear operational paths for intervention without state corruption

Tech

Flutter · Firebase / Firestore · Serverless Functions · Stripe · Push Notifications

← View all case studies