SnakeBatch v9: Concurrency

Charles Dana · Monce SAS · live mesh telemetry

snakebatch.aws.monce.ai · /architecture · /economics · /dashboard · /grid/v9

1. Concurrency Anatomy — v9 Mesh

The same anatomy AWS publishes, scoped to v9-worker-pipe-*. Y axis is invocations / second, capped at 16.67/s (the L-A1AFA3CF account-wide spawn ceiling, 1000/min). Live values polled from lambda:get_function_concurrency + list_provisioned_concurrency_configs + CloudWatch ConcurrentExecutions.

Unreserved (other functions) v9 Reserved concurrency v9 Provisioned concurrency Live concurrent execs L-A1AFA3CF ceiling

refresh: — · auto every 30s

Reserved — the carve-out
What: a slice of the account's 1000/min spawn budget guaranteed to v9. Other Monce functions can never use these slots, even when v9 is idle. Cost: $0. Reserving slots doesn't cost anything — it just restricts who can use them. Effect: blast-radius isolation. A misbehaving non-v9 function can't drain v9's slots. State: spawn must still go through cold start when invoked.

Provisioned — the pre-warmed
What: Lambda containers AWS keeps initialized in advance, so invocations skip cold start and the L-A1AFA3CF spawn-rate ceiling for the provisioned fraction. Cost: $0.0000048673/GB-s of reservation — 14× the on-demand compute rate, paid 24/7 while configured (see /economics §5). Effect: sub-100ms invoke latency, predictable wall clock. State: always a subset of reserved slots.

2. Live Spawn Rate — Last 60 Minutes

Per-minute MAX invocations/second across all 100 v9 shards (10-second resolution: peak of the six 10s sub-buckets ÷ 10). This is the actual instantaneous spawn rate, not an average — a 50-invocation burst in 5 seconds reads as 10/s, not 0.83/s. Bars turn orange at 70% of ceiling, red at 100%.

v9 mesh max inv/sec (10s resolution) ≥70% of ceiling ≥ceiling 16.67/s ceiling your peak (60m)

refresh: — · auto every 30s · source: cloudwatch

3. What This Tells You About Compute Headroom

Layer	Limit	Adjustable?	What it means here
Per-shard reserved concurrency	0–1,000 instantaneous	yes (per fn)	blast-radius cap; one shard cannot drain the account
L-B99A9384 concurrent executions	20,000 account	yes (support)	only matters at very heavy load
L-A1AFA3CF scaling rate	1,000 / minute = 16.67/s	NO	actual ceiling: how fast cold containers can be created
SQS messages/s	3,000 unbatched	yes	not binding for v9
S3 GET / prefix	~5,500/s	shard prefixes	not binding for v9

Translating fires/sec into useful sustained work, with measured per-leaf compute (Snake at 4 GB warm, ~700 ms/leaf):

Sustained spawn rate	Lambda-seconds / hour	Equivalent constant compute	5K×5L jobs / hour ceiling
0.17/s (10/min)	252	0.07 vCPU continuous	~95
1.7/s (100/min)	2,520	0.7 vCPU continuous	~950
8.3/s (500/min)	12,600	3.5 vCPU continuous	~4,750
16.7/s (cap)	25,200	~7 vCPU continuous	~9,500

Honest read: at the L-A1AFA3CF ceiling we get the equivalent of roughly one busy laptop's worth of CPU, sustained, in a region. The "100,000 slots" framing was always about peak instantaneous parallelism, not throughput. For one user training many small jobs the ceiling is generous; for production-grade prediction throughput it's the binding constraint and the only escape is provisioned concurrency — a 14× pay-to-bypass on the provisioned fraction (see /economics §5).

3.1 We are throttled below 16.67/s by a lot — and it isn't AWS

Empirical observation, 2026-05-21: a 1K×25L training run fired exactly 75 leaf invocations (3 leaves × 25 layers) in ~6 seconds of useful work. If those 75 leaves had fired in parallel we'd have seen a 15/s burst — that's 90% of L-A1AFA3CF's 16.67/s ceiling, with zero throttles, hitting the cap on a tiny 1K-row job.

What we actually saw on CloudWatch: 75 invocations spread across a full minute = 1.25/s sustained, peak ~3–5/s in a 10s window. That's 3–4× below the AWS ceiling. The leaves are not all firing in parallel; they're queueing behind their own parents.

Where the chain serializes	Cost
Each layer = independent SQS-dispatched root divide	5–25 root divides, each lands on its own cold shard via hash(jid, layer)
Per-layer chain: `divide → conquer → leaf`	Conquer `time.sleep(PARENT_POLL_MS)` on DDB until leaf acks; divide does the same waiting on conquer. Three cold-start hops billed serially per layer.
`_pick_least_loaded_shard` on every divide	100-key DDB BatchGetItem, 30–80 ms on the critical path

The remediation is structural and free: when n ≤ τ, divide should directly enqueue the leaf payload — conquer is doing no useful work in that path, only forwarding indices and burning a Lambda slot on a poll. With that one change, layer chains collapse from 3 hops to 1, and 25 layers can spawn in parallel instead of stretched across a minute.

The atom should be a population of one bucket size. The original v4/v5 divide-and-conquer worked great because each leaf was given exactly bucket-many rows to chew on, and the recursion was the only mechanism dictating fan-out. Today's conquer takes the entire n ≤ τ slice and hands it to one leaf, which then re-partitions internally via build_bucket_chain. That works correctness-wise, but it collapses the fan-out from "one Lambda per bucket" back to "one Lambda per ≤τ slice", erasing the whole point of having 100 shards. Restoring bucket-grained leaves — one Lambda per bucket-sized atom, picked up directly off SQS — is what unlocks the mesh.

See /economics §5.4 on why the provisioning toggle is a wall-clock product, not a cost product, until chain-serialization is removed: 14× the bill for 2× the speed is a bad trade. Fix the chain first, then the toggle becomes a real lever.

4. Toggling Past the Ceiling — Per-Hour Pricing

v9's wall-clock scaling is empirically O(n^0.77) sublinear up to ~7.5K rows × 25 layers on synthetic data. Past that, the L-A1AFA3CF on-demand ramp gates spawn rate before the tree finishes its useful fan-out, and the model diverges from the regression — the wall clock isn't compute-bound anymore, it's queueing for warm containers. The fix that costs nothing is the chain-serialization removal in §3.1; the fix that costs some dollars is provisioning a thin slice of bypass capacity, toggled on for the duration of a job and off afterwards.

Provisioned slots bill per second. Toggle on ($pprox$60–90s warm-up, not billed during spin-up; we charge from READY) and off (instant), with no minimums. So you pay for the wall-clock window the slots are active — nothing more.

4.1 The 2× lever — sized to the ceiling, not the mesh

To raise effective spawn rate from 16.67/s to ~33/s (a 2× lift), we don't need 100 slots; we need enough provisioned capacity to absorb the second 16.67/s of leaf invocations the on-demand pool can't warm. With ~700 ms/leaf at 4 GB:

slots_needed ≈ target_rate × t_leaf

           = 16.67/s × 0.7s

           ≈ 12 slots

That's 12 slots fully shared across the 100-shard mesh — SQS pulls them from whichever shard the next leaf hashes to, so the bypass is global, not pinned. Above the 7.5K×25L break this is what keeps the N^0.77 line from bending.

4.2 Per-hour cost — toggle on / toggle off

Pricing eu-west-3, 4 GB, 2026-05-21:

N slots	Reservation only (idle)	At 100% utilization (active on top)
1	$0.070/hr	$0.234/hr
5	$0.350/hr	$1.169/hr
12 (= 2× bypass)	$0.841/hr	$2.804/hr
25	$1.752/hr	$5.842/hr
50	$3.504/hr	$11.683/hr
100	$7.008/hr	$23.367/hr

Formula at 4 GB:

idle floor   = N × 4 × 3600 × $0.0000048673  ≈ $0.0701 × N / hr
at 100% util = N × 4 × 3600 × $0.0000162242  ≈ $0.2336 × N / hr

Concrete: flip 12 slots on for the 5-second window of a single training burst → 12 × 4 × 5 × $0.0000162242 = $0.0039. The reservation floor only bites when slots sit idle. Toggle-on-demand sidesteps the 24/7 floor entirely.

4.3 Why 12, not 100

The full-mesh framing in /economics §5 ($168/day at 100 shards always-on) is the always-warm scenario, useful for SLA-class predict throughput. For training-burst bypass, you only need 12 slots because:

The L-A1AFA3CF ceiling is the bottleneck, not aggregate concurrency. You're sizing to cover the ceiling shortfall, not to pre-warm every shard.
Provisioned slots are not pinned to a shard. SQS draws from the provisioned pool wherever the next leaf hashes — 12 shared slots behave as 12 sustained extra fires/s regardless of which shards are hot.
Warm-up takes 60–90s. For an isolated job that fires in <6s, you can't reactively flip on; toggling is a scheduled primitive (turn on at known-busy hours, or pre-warm a 1-slot tripwire continuously — $1.68/day — and burst from there).

4.4 Empirical break — 7.5K×25L on synthetic data

Measured 2026-05-21: training runs follow N^0.77 wall-clock through n=5K, n=7.5K at 25 layers. At n=7.5K, L=25, the run hits the on-demand spawn ceiling on its inner layers and the regression breaks — subsequent layers stretch out behind queue depth instead of finishing in parallel. Before the break we have a clean sublinear curve; after, the wall clock fans into "as many seconds as the ceiling will give us." The 12-slot bypass closes that gap exactly.

The economics in one sentence. Without bypass, v9 is free up to
~7.5K×25L and then capped by AWS. With 12 slots toggled on for the
job's lifetime ($0.84/hr idle, $2.80/hr active, prorated by the second),
v9 stays on its N^0.77 curve through 1M+ rows. That's the only knob between
"free" and "capped." Provisioning more is paying for capacity you can't fill.

See also: /paper for the 1M-row claim that implicitly assumes free L-A1AFA3CF headroom (it does past the bypass); /economics §5 for full-mesh always-warm economics; /architecture §8 for why sharding gives isolation, not throughput.