SnakeBatch v9: Economics of the 100-Shard Mesh

Charles Dana · Monce SAS · May 2026

snakebatch.aws.monce.ai · /paper · /architecture · /concurrency · /math

1. Per-Lambda Cost

v9 uses ONE Lambda binary at 4096 MB, deployed to 100 functions.

v9-worker (4096 MB, ~3 vCPUs):
  invoke  : $0.0000002 per request
  compute : 4 GB × duration × $0.0000166667/GB-s
          = $0.0000667/s

  divide  (~80ms warm)  :  $0.000005 + $0.0000002  ≈ $0.0000054
  conquer (~150ms warm) :  $0.000010 + $0.0000002  ≈ $0.0000102
  leaf    (~400ms warm) :  $0.0000267 + $0.0000002 ≈ $0.0000269

2. Tree Topology & Worker Count

v9 partitions input via binary divide, then hands each ≤τ slice to one leaf. Per layer the tree shape is:

divide nodes ≈ 2 ⌈n/τ⌉ − 1 · conquer nodes = leaves = ⌈n/τ⌉

Snake at the leaf does its own internal IF/ELIF/ELSE partition with build_bucket_chain(bucket=user_b). Snake's partition is class-aware, not size-fixed: a slice of 500 mixed-class rows at bucket=32 emits ~4 buckets (one per class plus catch-all), not 500/32 = 16. So the assembled JSON is dramatically lighter than naive arithmetic suggests.

3. Total Cost Per Training Run

W(n, L) = L · (3 · ⌈n/τ⌉ − 1)

L layers each with ~⌈n/τ⌉ leaves, ~⌈n/τ⌉ conquers, and ~⌈n/τ⌉−1 divides. Production default τ = 1,000 (locked — not a tuning knob). The cost numbers below assume this.

The cost rows below are split into compute (the leaf bill, linear in N) and reservation (provisioned-slot bypass needed to keep peak invocations/s under the L-A1AFA3CF ceiling, super-linear past N≈30K). At τ=1000, all rows above 7.5K×25L need bypass to be feasible — see /paper §7 for the derivation.

Scale	L	Wall clock	Slots	$ compute	$ reservation	$ total
1K	5	3.8s	0	$0.0006	—	$0.0006
5K	5	~5s	0	$0.006	—	$0.006
15K	5	~14s	0	$0.017	—	$0.017
150K	5	~80s	9	$0.17	$0.01	$0.18
1M	5	6.5 min	81	$1.13	$0.62	$1.75
1M	1	78s	7	$0.23	$0.01	$0.24
1M	25	33 min	449	$5.65	$17.16	$22.81
10M	5	~38 min	~470	$11.30	$60.10	$71.40

Costs split: leaves are linear in N (the compute column); reservation is super-linear in N because the slot count needed to sustain peak invocations/s grows as O(L · N / log N) at fixed τ (see /paper §7.3). Past N≈30K at L=25, the reservation term overtakes compute and dominates total cost.

Headline. A 1M×5L Snake classifier trains in 6.5 minutes for $1.75 total at τ=1000. Slots toggled on at job start, off at job end — per-second billing. The earlier "$0.05 for 1M×5L" claim was wrong by 35×: it assumed free L-A1AFA3CF headroom, true at 5K (~7/s peak), false at 1M (132/s peak, 8× the ceiling). The fix is either pay the reservation tax above (canonical), or fix parent-poll chain-serialization to dilute the spawn-window per recursion level (/architecture §8).

3.1 Measured: 3 Datasets × 5K Rows (May 2026)

Live benchmark: 3 datasets (binary, 3-class, regression), each trained twice — once on full 5K (perfect-fit), once on 4K with 1K held-out. All predicts forced through the cloud (cloud_threshold = 1). 12 model trains in total, ~30K predict round trips, single CloudWatch window:

Pass	n_train	n_predict	Train (s)	Predict (s)	Quality
binary — perfect-fit	5,000	5,000	4.80	6.09	acc 100.00%
binary — held-out	4,000	1,000	4.01	4.76	acc 100.00%
3-class — perfect-fit	5,000	5,000	4.66	6.26	acc 100.00%
3-class — held-out	4,000	1,000	3.72	4.44	acc 100.00%
regression — perfect-fit	5,000	5,000	24.93	3.74	R² = 1.0000
regression — held-out	4,000	1,000	20.39	4.20	R² = 0.9845

Per-row cost decomposition

CloudWatch delta over the entire bench: $0.0434 for 387 Lambda invocations. Total work performed: 6 trainings × 4–5K rows = 27,000 training rows, plus ~24,000 inference round trips. So:

$0.0434 / (27,000 train rows + 24,000 predict rows) ≈ $0.85 per million row-ops

Trainings dominate compute time (52s of 73s wall clock, 71%); predicts dominate count. If we ascribe cost proportional to wall clock:

train cost ≈ $0.031 / 27K rows = $1.13 per million training rows
predict cost ≈ $0.012 / 24K rows = $0.51 per million predictions (toy models, ~5K-row train)

Cloud predict's per-row cost is not a constant — it scales with the assembled bucket-chain length (training size × layer depth). A second empirical anchor from a Nature consumer running a 22K-row × 5-layer model:

predict cost ≈ $0.0023 / 4737 rows ≈ $0.49 per thousand predictions = ~$485 per million on the Nature-class model

The lever for >>1M-prediction workloads is the local handoff (§6.1): download the model JSON once with m.to_algorithmeai() and run algorithmeai.Snake in-process. algorithmeai per-row inference is sub-millisecond regardless of model size, with no Lambda billing, no S3 fetch, no HTTP RTT. Cloud predict's pricing is for the case where the user can't or doesn't want to ship the model to the caller (multi-tenant, audit trail, central enforcement) — not for high-volume batch.

Numbers persist in v9/bench/results_3x5k.json; rerun via v9_smoke_economics.py.

3.2 Earlier Burn Snapshot (May 2026)

72-hour CloudWatch snapshot of the eu-west-3 account, all Lambdas:

Function	Memory	Invocations (72h)	Duration billed (72h)	Est. cost
v9 mesh (100 shards)	4 GB	~14,000	3.4 hours	~$0.21
v6-worker (legacy)	10 GB	7,485,628	5,992 hours	~$914
v6-divide (legacy)	4 GB	172,387	2,391 hours	~$143

What this measures. The v6 numbers come from a single runaway recursion event on May 18–19, 2026 (since contained). The v9 row is steady-state usage across all 100 shards over the same window. The point is the shape: v9 distributes recursion across 100 isolated functions, so a single misbehaving shard cannot drain the account — it hits its own concurrency cap and stops. v6's single function had no such firewall.

4. v6 vs v9 — Why Spread the Mesh?

Property	v6 (one fn)	v9 (100 fns)
Per-shard reserved concurrency	1000 cap (single fn)	1000 × 100 shards (isolation, not aggregation)
Account spawn ceiling (L-A1AFA3CF)	1000/min, sharded with all retries	1000/min, sharded by hash — misbehaving job hits one shard's wall
1K cost	$0.003	$0.0006
1K wall clock	2.4s	3.8s
15K wall clock	7.3s	~8s (parity)
150K wall clock	19.8s	~80s (peak rate gated)
1M×5L wall clock	concurrency-capped, fails	6.5 min with 81-slot bypass
10M×5L wall clock	concurrency-capped, fails	~38 min with ~470-slot bypass

v9 is the version that completes at scale — v6's recursive self-invokes wedged under L-A1AFA3CF tightening. v9 routes around that with sharding (isolation) plus toggled provisioned slots (peak-rate bypass). The mesh is an availability mechanism; the slots are a throughput mechanism; neither alone replaces the other.

5. Three-Bucket Cost Structure

The per-million numbers above are compute-only. The real cost of running v9 has three buckets, and only one of them scales with rows processed:

Bucket	What you pay	When billed
Fixed	~$98/month flat	24/7, identical at 0 jobs or 1M rows
Activation	$0.0117–$0.117/min	only while provisioned-concurrency toggle is on
Usage	$1.13/M train + $0.51/M predict	per row processed (compute itself)

5.1 Fixed Costs — the always-on floor

EC2 t4g.2xlarge (gatherer)   $96.77/month  (on-demand)
                             $65.30/month  (reserved 1y, 32% off)
S3 storage (~5 GB)            $0.12/month
DynamoDB on-demand            $0.06/month  (~50K writes)
CloudWatch log retention      $0.30/month
Route53 + ACM cert            $0.50/month
                             ────────────
                             $97.75/month total fixed

The gatherer is 8 vCPU / 32 GB at 99% idle. Massively overprovisioned for a single user; sized for multi-tenant. At 100K rows/month total throughput, the fixed bucket alone amortizes to $978/M rows — dwarfs every other cost. At 100M rows/month it's $0.98/M and disappears into the noise.

5.2 Activation Cost — the provisioned-concurrency toggle

L-A1AFA3CF (Lambda concurrency scaling rate) is a 1000/min account-wide ceiling AWS does not adjust. Cold starts on a chain-serialized invoke graph cap v9 at <5× parallelism for small jobs — the 100-shard mesh is structurally underutilized at the spawn-rate ceiling. Provisioned concurrency bypasses this for the provisioned fraction:

Provisioned reservation rate (eu-west-3 x86):
  $0.0000048673 per GB-second  (paid while READY, not while invoking)

Spin-up:                  ~84s, NOT billed
Tear-down:                ~0.8s, instant
Min billing granularity:  1 second

  4 GB ×   1 shard ×  60s = $0.00117  per minute warm
  4 GB ×  10 shards ×  60s = $0.0117   per minute warm
  4 GB × 100 shards ×  60s = $0.117    per minute warm  ($7.01/hour)

The spin-up tax means the toggle is a session primitive, not a per-job one. You toggle on once, run a stream of jobs against the warm mesh, toggle off when done. AWS cannot bill provisioning while the function itself is unreachable, so a reliable toggle-off (EventBridge one-shot + 60s sweep) is mandatory — a leaked toggle on 100 shards burns $168/day.

5.3 Usage Cost vs Activation Cost — the friction equation

Define friction as the fraction of warm-window time the mesh sat ready but doing nothing useful. 0% friction = perfectly back-to-back jobs. 50% friction = half your warm window was idle.

provisioned $/min  $0.117  (full mesh)
on-demand   $/min  $0.0083 (measured, 5K×5L bench)
                  ───────
ratio              14×     provisioning is 14× the price of plain Lambda

So provisioning only pays for itself when the warm window crams enough back-to-back compute to dilute the reservation tax. Concretely, for a 5K×5L job (one warm session, full 100-shard mesh, 60s budget):

Scenario	Compute cost	Provisioning cost	Per job	Per million train rows
On-demand only (today)	$0.005	—	$0.005	$1.13/M
Toggle, 100% friction (perfect)	$0.005	$0.0117	$0.017	$3.34/M
Toggle, 50% friction (realistic)	$0.005	$0.117	$0.122	$24.40/M
Toggle, 5% friction (heavy session)	$0.005	$1.17	$1.18	$235/M

The session profile that genuinely beats on-demand is 20+ back-to-back jobs through one warm window. One job per session is strictly worse than on-demand — you're paying 14× the compute cost for ~2× the wall-clock improvement.

Break-even cheat sheet.
1 job/session → on-demand wins (toggle costs 14× more for 2× speed)
5 jobs/session → rough parity
20+ jobs/session → toggle wins on $/job AND wall-clock

5.4 The amortization scenario — when warm Lambdas pay off

Provisioning is a session primitive, not a per-job one. The 84s spin-up plus the 14× reservation rate means a single isolated job gets crushed by the toggle. But a session — a stretch of clock time during which you fire many jobs back to back — can amortize the warm cost down to fractions of a cent per job. Concrete numbers, full mesh (100 shards × 4 GB), 60s of spin-up amortized into the budget:

Session shape	Wall clock	Jobs run	Provisioning $	Compute $	$/job	vs on-demand
Spin up, run 1 job	~90 s	1	$0.105	$0.005	$0.110	22× worse
Spin up, run 5 jobs	~120 s	5	$0.140	$0.025	$0.033	6.6× worse
Spin up, run 20 jobs	~210 s	20	$0.245	$0.100	$0.0173	3.5× worse
1-hour fully-saturated session	3,600 s	~600	$7.01	$3.00	$0.0167	3.3× worse
Compute-saturated infinity (theoretical floor)	∞	∞	$/sec ratio	$/sec ratio	$0.005	parity

Provisioned concurrency literally cannot beat on-demand on $/job. The reservation rate is 14× the per-second compute rate, so even a mythical 100% utilized warm window pays exactly the on-demand price plus the reservation tax. Provisioning never wins on cost — it wins on wall clock.

5.5 So what's it actually for?

Provisioning trades dollars for predictable latency. The 14× tax is the price of skipping cold starts and the L-A1AFA3CF spawn-rate ceiling on the provisioned fraction. Three regimes where that trade is worth taking:

Regime	Why warm helps	Cost-frame
Interactive demo	5K×5L drops from ~6 s to ~2 s wall clock; live audience doesn't see latency	$0.10 per 60s warm window — eat it as a sales cost
SLA predict throughput	Cloud predict at scale needs 100s of leaves spawning together; cold-start tail blows P99	$7/hour during business hours — charge it to the customer's SLA tier
Continuous training pipeline	Hourly model rebuild on fresh data, jobs back-to-back, want each rebuild <15 s	$84/day if always warm during work hours; tolerable if model rebuild is critical-path

For everything else — ad-hoc training, exploratory benches, the "morning bruv let me try this dataset" flow — on-demand is the right default. v9's small-job wall clock (~6 s for 5K×5L) is bad enough to justify the toggle when latency is the product, fine to live with when correctness is the product.

5.6 Honest per-million by usage profile

Profile	Volume	Toggle?	$/M train rows	What dominates
Infrequent	1–5 jobs/month	off	~$5,000/M	EC2 fixed cost — per-row math meaningless
Active	~10K rows/day	off	$1.13/M + $325/M fixed	EC2 still 290× usage
Production	~100K rows/day, spot toggle	50% friction	$24.40/M + $32.50/M fixed	activation & fixed roughly even
Heavy	1M rows/day, session toggle	10% friction	$11.70/M + $3.25/M fixed	activation dominates compute
Industrial	10M rows/day, always warm	0% friction (saturated)	$3.34/M + $0.33/M fixed	compute ≈ activation, fixed gone

The blue dashed line is on-demand — the toggle curve approaches 14× that and stops. Provisioning is a wall-clock product, not a cost product.

The cost class still holds. Even the worst realistic profile (Production,
50% friction) lands at $24.40/M train rows + $98/mo fixed. Vertex AI
Tabular at $200/M is 8× more expensive at the same volume. The 1000×
advantage in §3 is the headline number, not the worst-case number —
honest framing is "4–1000× cheaper depending on profile."

6. EC2 Gatherer Cost (deep dive)

One t4g.2xlarge in eu-west-3, 8 vCPU / 32 GB:

On-demand : $0.1344/hour = $96.77/month
Reserved 1y : $0.0907/hour = $65.30/month  (32% off)

The gatherer hosts the FastAPI app (/v9/train, /v9/status, /v9/model, /grid/v9) and runs the SQS drainer thread. Idle CPU 99%, RAM 0.85 / 30 GB at current load — massively overprovisioned for a single user, perfect for multi-tenant scale.

7. Inference Cost

Two paths:

7.1 Local handoff — `monceai.Snake.to_algorithmeai()`

Download /v9/model/{id}, instantiate algorithmeai.Snake(path), predict locally. $0 per prediction, sub-millisecond per row on commodity hardware, zero network roundtrip.

This is the recommended path for v9 because the model JSON is byte-equivalent in semantics to a locally-trained Snake. There is no operational reason to keep inference in the cloud.

7.2 Cloud predict — `/v9/predict/{id}`

Not yet implemented. When wired, expected:

v9-predict (1 GB, ~5ms warm):
  invoke  : $0.0000002
  compute : 1 GB × 0.005s × $0.0000166667 = $0.000000083
  Total per Lambda ≈ $0.0000003 per call
  ≈ $0.30 per million predictions

8. Cost vs Provider Pricing

Comparing v9 to closed-source classification APIs at 10K-row training scale:

Provider	10K-row train cost	Wall clock	Inference
SnakeBatch v9	$0.001	~8s	$0 local / $0.30/M cloud
Vertex AI Tabular	$1–3 (sustained)	~20m minimum	per-call billed
SageMaker Autopilot	$5–10	~1h	endpoint $/hr
OpenAI fine-tune classifier	$10+ per 10K	~30m	token-billed

The SnakeBatch number isn't a discount; it's a different cost class. SAT-by-construction means we don't pay for backprop, we don't pay for hyperparameter search, we don't pay for retries. The work is polynomial in n and linear in spot Lambda time. There is no model to "fit" beyond constructing the formula.