Skip to content

Stripe Webhook Ingress — Implementation Plan

Status: done

Implements stripe-webhook-ingress.md. Closes the spec/code divergence noted in Known Gap #1.

Goal

Bring the live code into compliance with the spec by introducing a fast-claim → async-process → sweeper-backstop architecture. Driven by the May 1 2026 production incident (383 P2028 errors during monthly billing burst).

Status

  • [x] Item 3WebhookEventService.findStuckEvents (shipped 2026-05-01)
  • [x] Item 1 — Refactor live ingress into claim + async-process
  • [x] Item 2 — Export processClaimedEvent helper
  • [x] Item 4 — New sweeper cron
  • [x] Item 5 — Register the cron
  • [x] Item 6 — Tests for items 1, 2, 4
  • [x] Item 7 — Spec follow-up (update Known Gap #1, drop (target) from File map)

Driving citations

Work breakdown

1. Refactor live ingress into claim + async-process

  • Modify apps/api/src/app/api/stripe/webhooks/route.ts
  • Split current withTransaction(...) block at line 74 into two phases:
    • Phase A — claim transaction (small, fast): runs WebhookEventService.findExistingEvent + createEvent or claimEventForProcessing. Returns { webhookEventId, alreadyProcessed }.
    • Phase B — async process via next/after: runs StripeWebhookService.handleEvent + markAsProcessed or markAsFailed, then dispatches side effects.
  • Tighten claim-tx options: timeout: 5000, maxWait: 10000 (claim is single-row work, doesn't need 30s)
  • Tighten process-tx options: timeout: 25000, maxWait: 10000 (5s slack vs. maxDuration: 30; matches global core.ts default Aaron set in c55e88c63)
  • Add P2028 to retry-eligible conditions at current route.ts:88-92 — match 'p2028' and 'unable to start a transaction' substrings. Spec Invariant 1 — transient claim failure should self-heal in-process before falling through to sweeper.
  • Preserve __testPrisma synchronous bypass at route.ts:65-67 so the existing 290+ tests in apps/api/tests/api/stripe/*.api.test.ts continue passing without modification.
  • Spec coverage: Invariants 1, 3, 4, 5; State transitions: (no row) → Claimed; Claimed → Processed/Failed; HTTP contract

2. Export processClaimedEvent helper from the route

  • In apps/api/src/app/api/stripe/webhooks/route.ts, add an exported helper:
  • export async function processClaimedEvent(webhookEventId: string, event: Stripe.Event): Promise<void>
  • Internally: opens withTransaction({ audit: { source: 'webhook:stripe' } }), runs StripeWebhookService.handleEvent(event, tx) + WebhookEventService.markAsProcessed(webhookEventId, tx), then dispatches side effects (the existing after() body at current route.ts:269-359 becomes a call to this helper).
  • On throw: WebhookEventService.markAsFailed(webhookEventId, error.message, tx) + captureError.
  • Sweeper cron imports from @/src/app/api/stripe/webhooks/route directly. Single source of truth without manufacturing a new service file for one function.
  • Spec coverage: Invariants 1, 5, 9; Cross-system dependencies — handlers preserve their tx parameter shape

3. Add findStuckEvents to WebhookEventService — DONE

  • Shipped: apps/api/src/services/shared/WebhookEventService.ts:findStuckEvents
  • Signature: findStuckEvents(source, { staleAfterMs, lookbackMs, limit }, tx) → Promise<StuckWebhookEvent[]>
  • StuckWebhookEvent = { id, eventId, eventType, payload }
  • Tests: apps/api/tests/services/shared/WebhookEventService.unit.test.ts (9 tests)
  • Implementation note: raw SQL with FOR UPDATE SKIP LOCKED:
SELECT id, event_id AS "eventId", event_type AS "eventType", payload
FROM webhook_events
WHERE source = $1
  AND processed = false
  AND created_at > NOW() - ($lookback_ms || ' milliseconds')::interval
  AND (processing_started_at IS NULL
       OR processing_started_at < NOW() - ($stale_ms || ' milliseconds')::interval)
ORDER BY created_at ASC
LIMIT $limit
FOR UPDATE SKIP LOCKED
  • Spec correction: initial draft of this plan included AND created_at < NOW() - $stale, which would have blocked sweeper retries of freshly-failed events for 5 min. Spec Sweeper cron Behavior text (processed=false AND createdAt > NOW() - 24h AND (processingStartedAt IS NULL OR processingStartedAt < NOW() - 5min)) is canonical; the implementation matches the spec, not the original plan SQL.

  • Spec coverage: Invariants 6, 7; State: Stale Claim, Failed; Sweeper cron Behavior

4. New sweeper cron

  • New file apps/api/src/app/api/cron/sweep-stuck-stripe-webhooks/route.ts
  • Mirror structure of apps/api/src/app/api/cron/process-failed-transfers/route.ts exactly:
    • withErrorHandling wrapper
    • QStash signature verification with CRON_SECRET Bearer fallback
    • runWithAuditContext({ source: 'cron:sweep-stuck-stripe-webhooks' }, ...)
  • Per-iteration:
    1. Call WebhookEventService.findStuckEvents('stripe', {...}, tx) inside a read transaction
    2. For each row: attempt claimEventForProcessing(id, 5) — proceeds only if claim count = 1
    3. Reconstruct Stripe.Event from the stored payload JSON
    4. Call processClaimedEvent(id, event) imported from route.ts
    5. Tally { processed, succeeded, failed, skipped } (skipped = claim count was 0, another worker got it)
  • Batch limit: 50 per run. Process sequentially within a run — parallelism would amplify the same connection-pool pressure we're fixing. Sequential-with-cap stays well under the 5-min cron interval at observed handler latencies.
  • Returns JSON body with the tally.
  • Spec coverage: Contracts → Sweeper cron; Invariant 7 (single in-flight processor)

5. Register the cron

  • Modify apps/api/vercel.json
  • Append to crons array: { "path": "/api/cron/sweep-stuck-stripe-webhooks", "schedule": "*/5 * * * *" }
  • Leave functions["app/api/stripe/webhooks/route.ts"].maxDuration at 30 — claim is sub-second; after() budget of 30s is more than handlers need
  • Spec coverage: Sweeper cron schedule

6. Tests

  • New unit test apps/api/tests/unit/services/payment/process-claimed-event.unit.test.ts
  • Success path → row goes to Processed, side effects dispatched
  • handleEvent throws → row goes to Failed with errorMessage
  • Side-effect throw doesn't re-mark the row (per spec Known Gap #3)
  • New API test apps/api/tests/api/cron/sweep-stuck-stripe-webhooks.api.test.ts
  • 401 without auth
  • QStash signature happy path
  • Bearer-secret happy path
  • Stuck event (past stale threshold) gets re-processed
  • Fresh row (created <5min ago, no claim yet) is left alone
  • Row past 24h lookback is skipped (spec Invariant 6)
  • LIMIT honored (covers findStuckEvents boundary behavior end-to-end)
  • Existing webhook tests at apps/api/tests/api/stripe/*.api.test.ts continue using __testPrisma synchronous bypass; no changes needed.
  • Manual test: trigger Stripe CLI burst (stripe trigger invoice.paid × N) against deployed staging; verify reconciliation script (apps/api/scripts/stripe.ts reconcile-webhooks --staging) shows 100% delivered.

7. Spec follow-up

  • After merge: update docs/features/stripe-webhook-ingress/spec.md Known Gap #1 to remove the "current implementation does not match this spec" note. Spec and code now agree.
  • Update File map to drop the (target) annotation on the sweeper cron, and update the "Background processor (target)" entry to point at the exported helper in route.ts rather than a separate Processor service file.

Order of merge

All of items 1–6 ship as one PR (per Aaron — bundling avoids partial state). Item 7's spec follow-up appends to the impl PR or commits immediately after merge.

Rollback

  • Pre-commit hook caught the rogue agent's earlier attempt at this; no risk of silent regression.
  • If the impl PR ships and behaves badly:
  • The sweeper cron is additive — disable by removing the entry from vercel.json crons.
  • The route refactor can be reverted to route.ts from before the PR. Tests will still pass on the old code.
  • webhook_events schema is unchanged, so no migration to undo.

Out of scope

  • connection_limit URL parameter changes (current sizing intentional per c55e88c63)
  • attachDatabasePool migration to PrismaPg adapter (separate effort; known Supavisor leak issue needs investigation first)
  • Per-event-type handler refactoring (move Stripe API calls outside handleEvent's transaction). Real win but bigger scope; deferred to a follow-up.
  • Generalizing the ingress spec to cover Acuity (source='acuity') — same machinery, different per-event semantics. Deferred until an Acuity spec exists for comparison.

Risk register

  • after() runs longer than maxDuration. Bounded by Vercel; if a handler legitimately needs more than 30s, the row stays at processed=false and the sweeper retakes it. No data loss but extra latency. Mitigate by bumping maxDuration in vercel.json if observed.
  • Sweeper races itself. Two concurrent sweeper invocations could both target the same row. FOR UPDATE SKIP LOCKED + atomic claimEventForProcessing prevent double-processing.
  • Side-effect throws after markAsProcessed. Per spec Known Gap #3, side-effect failures don't trigger event re-dispatch. Captured to Sentry. This is intentional but worth alerting on if frequency rises.
  • Test mode skipping the new path. The synchronous __testPrisma bypass means the new claim/after split is not exercised by integration tests. Mitigated by the new unit + API tests in item 6, but worth a manual staging burst before relying on May 1 next month.