SnakeBatch v9: Concurrency

Charles Dana · Monce SAS · live mesh telemetry

snakebatch.aws.monce.ai · /architecture · /economics · /dashboard · /grid/v9

1. Concurrency Anatomy — v9 Mesh

The same anatomy AWS publishes, scoped to v9-worker-pipe-*. Y axis is invocations / second, capped at 16.67/s (the L-A1AFA3CF account-wide spawn ceiling, 1000/min). Live values polled from lambda:get_function_concurrency + list_provisioned_concurrency_configs + CloudWatch ConcurrentExecutions.

Unreserved (other functions) v9 Reserved concurrency v9 Provisioned concurrency Live concurrent execs L-A1AFA3CF ceiling
refresh: · auto every 30s
Reserved — the carve-out
What: a slice of the account's 1000/min spawn budget guaranteed to v9. Other Monce functions can never use these slots, even when v9 is idle. Cost: $0. Reserving slots doesn't cost anything — it just restricts who can use them. Effect: blast-radius isolation. A misbehaving non-v9 function can't drain v9's slots. State: spawn must still go through cold start when invoked.
Provisioned — the pre-warmed
What: Lambda containers AWS keeps initialized in advance, so invocations skip cold start and the L-A1AFA3CF spawn-rate ceiling for the provisioned fraction. Cost: $0.0000048673/GB-s of reservation14× the on-demand compute rate, paid 24/7 while configured (see /economics §5). Effect: sub-100ms invoke latency, predictable wall clock. State: always a subset of reserved slots.

2. Live Spawn Rate — Last 60 Minutes

Per-minute MAX invocations/second across all 100 v9 shards (10-second resolution: peak of the six 10s sub-buckets ÷ 10). This is the actual instantaneous spawn rate, not an average — a 50-invocation burst in 5 seconds reads as 10/s, not 0.83/s. Bars turn orange at 70% of ceiling, red at 100%.

v9 mesh max inv/sec (10s resolution) ≥70% of ceiling ≥ceiling 16.67/s ceiling your peak (60m)
refresh: · auto every 30s · source: cloudwatch

3. What This Tells You About Compute Headroom

LayerLimitAdjustable?What it means here
Per-shard reserved concurrency0–1,000 instantaneousyes (per fn)blast-radius cap; one shard cannot drain the account
L-B99A9384 concurrent executions20,000 accountyes (support)only matters at very heavy load
L-A1AFA3CF scaling rate1,000 / minute  = 16.67/sNOactual ceiling: how fast cold containers can be created
SQS messages/s3,000 unbatchedyesnot binding for v9
S3 GET / prefix~5,500/sshard prefixesnot binding for v9

Translating fires/sec into useful sustained work, with measured per-leaf compute (Snake at 4 GB warm, ~700 ms/leaf):

Sustained spawn rateLambda-seconds / hourEquivalent constant compute5K×5L jobs / hour ceiling
0.17/s  (10/min)2520.07 vCPU continuous~95
1.7/s  (100/min)2,5200.7 vCPU continuous~950
8.3/s  (500/min)12,6003.5 vCPU continuous~4,750
16.7/s  (cap)25,200~7 vCPU continuous~9,500
Honest read: at the L-A1AFA3CF ceiling we get the equivalent of roughly one busy laptop's worth of CPU, sustained, in a region. The "100,000 slots" framing was always about peak instantaneous parallelism, not throughput. For one user training many small jobs the ceiling is generous; for production-grade prediction throughput it's the binding constraint and the only escape is provisioned concurrency — a 14× pay-to-bypass on the provisioned fraction (see /economics §5).

3.1 We are throttled below 16.67/s by a lot — and it isn't AWS

Empirical observation, 2026-05-21: a 1K×25L training run fired exactly 75 leaf invocations (3 leaves × 25 layers) in ~6 seconds of useful work. If those 75 leaves had fired in parallel we'd have seen a 15/s burst — that's 90% of L-A1AFA3CF's 16.67/s ceiling, with zero throttles, hitting the cap on a tiny 1K-row job.

What we actually saw on CloudWatch: 75 invocations spread across a full minute = 1.25/s sustained, peak ~3–5/s in a 10s window. That's 3–4× below the AWS ceiling. The leaves are not all firing in parallel; they're queueing behind their own parents.

Where the chain serializesCost
Each layer = independent SQS-dispatched root divide5–25 root divides, each lands on its own cold shard via hash(jid, layer)
Per-layer chain: divide → conquer → leafConquer time.sleep(PARENT_POLL_MS) on DDB until leaf acks; divide does the same waiting on conquer. Three cold-start hops billed serially per layer.
_pick_least_loaded_shard on every divide100-key DDB BatchGetItem, 30–80 ms on the critical path

The remediation is structural and free: when n ≤ τ, divide should directly enqueue the leaf payload — conquer is doing no useful work in that path, only forwarding indices and burning a Lambda slot on a poll. With that one change, layer chains collapse from 3 hops to 1, and 25 layers can spawn in parallel instead of stretched across a minute.

The atom should be a population of one bucket size. The original v4/v5 divide-and-conquer worked great because each leaf was given exactly bucket-many rows to chew on, and the recursion was the only mechanism dictating fan-out. Today's conquer takes the entire n ≤ τ slice and hands it to one leaf, which then re-partitions internally via build_bucket_chain. That works correctness-wise, but it collapses the fan-out from "one Lambda per bucket" back to "one Lambda per ≤τ slice", erasing the whole point of having 100 shards. Restoring bucket-grained leaves — one Lambda per bucket-sized atom, picked up directly off SQS — is what unlocks the mesh.

See /economics §5.4 on why the provisioning toggle is a wall-clock product, not a cost product, until chain-serialization is removed: 14× the bill for 2× the speed is a bad trade. Fix the chain first, then the toggle becomes a real lever.

4. Toggling Past the Ceiling — Per-Hour Pricing

v9's wall-clock scaling is empirically O(n0.77) sublinear up to ~7.5K rows × 25 layers on synthetic data. Past that, the L-A1AFA3CF on-demand ramp gates spawn rate before the tree finishes its useful fan-out, and the model diverges from the regression — the wall clock isn't compute-bound anymore, it's queueing for warm containers. The fix that costs nothing is the chain-serialization removal in §3.1; the fix that costs some dollars is provisioning a thin slice of bypass capacity, toggled on for the duration of a job and off afterwards.

Provisioned slots bill per second. Toggle on ($pprox$60–90s warm-up, not billed during spin-up; we charge from READY) and off (instant), with no minimums. So you pay for the wall-clock window the slots are active — nothing more.

4.1 The 2× lever — sized to the ceiling, not the mesh

To raise effective spawn rate from 16.67/s to ~33/s (a 2× lift), we don't need 100 slots; we need enough provisioned capacity to absorb the second 16.67/s of leaf invocations the on-demand pool can't warm. With ~700 ms/leaf at 4 GB:

slots_needed ≈ target_rate × t_leaf
           = 16.67/s × 0.7s
           ≈ 12 slots

That's 12 slots fully shared across the 100-shard mesh — SQS pulls them from whichever shard the next leaf hashes to, so the bypass is global, not pinned. Above the 7.5K×25L break this is what keeps the N^0.77 line from bending.

4.2 Per-hour cost — toggle on / toggle off

Pricing eu-west-3, 4 GB, 2026-05-21:

N slotsReservation only (idle)At 100% utilization (active on top)
1$0.070/hr$0.234/hr
5$0.350/hr$1.169/hr
12 (= 2× bypass)$0.841/hr$2.804/hr
25$1.752/hr$5.842/hr
50$3.504/hr$11.683/hr
100$7.008/hr$23.367/hr

Formula at 4 GB:

idle floor   = N × 4 × 3600 × $0.0000048673  ≈ $0.0701 × N / hr
at 100% util = N × 4 × 3600 × $0.0000162242  ≈ $0.2336 × N / hr

Concrete: flip 12 slots on for the 5-second window of a single training burst → 12 × 4 × 5 × $0.0000162242 = $0.0039. The reservation floor only bites when slots sit idle. Toggle-on-demand sidesteps the 24/7 floor entirely.

4.3 Why 12, not 100

The full-mesh framing in /economics §5 ($168/day at 100 shards always-on) is the always-warm scenario, useful for SLA-class predict throughput. For training-burst bypass, you only need 12 slots because:

4.4 Empirical break — 7.5K×25L on synthetic data

Measured 2026-05-21: training runs follow N^0.77 wall-clock through n=5K, n=7.5K at 25 layers. At n=7.5K, L=25, the run hits the on-demand spawn ceiling on its inner layers and the regression breaks — subsequent layers stretch out behind queue depth instead of finishing in parallel. Before the break we have a clean sublinear curve; after, the wall clock fans into "as many seconds as the ceiling will give us." The 12-slot bypass closes that gap exactly.

The economics in one sentence. Without bypass, v9 is free up to ~7.5K×25L and then capped by AWS. With 12 slots toggled on for the job's lifetime ($0.84/hr idle, $2.80/hr active, prorated by the second), v9 stays on its N^0.77 curve through 1M+ rows. That's the only knob between "free" and "capped." Provisioning more is paying for capacity you can't fill.

See also: /paper for the 1M-row claim that implicitly assumes free L-A1AFA3CF headroom (it does past the bypass); /economics §5 for full-mesh always-warm economics; /architecture §8 for why sharding gives isolation, not throughput.