SnakeBatch v9: Economics of the 100-Shard Mesh

Charles Dana · Monce SAS · May 2026

snakebatch.aws.monce.ai · /paper · /architecture · /concurrency · /math

1. Per-Lambda Cost

v9 uses ONE Lambda binary at 4096 MB, deployed to 100 functions.

v9-worker (4096 MB, ~3 vCPUs):
  invoke  : $0.0000002 per request
  compute : 4 GB × duration × $0.0000166667/GB-s
          = $0.0000667/s

  divide  (~80ms warm)  :  $0.000005 + $0.0000002  ≈ $0.0000054
  conquer (~150ms warm) :  $0.000010 + $0.0000002  ≈ $0.0000102
  leaf    (~400ms warm) :  $0.0000267 + $0.0000002 ≈ $0.0000269

2. Tree Topology & Worker Count

v9 partitions input via binary divide, then hands each ≤τ slice to one leaf. Per layer the tree shape is:

divide nodes ≈ 2 ⌈n/τ⌉ − 1  ·  conquer nodes = leaves = ⌈n/τ⌉

Snake at the leaf does its own internal IF/ELIF/ELSE partition with build_bucket_chain(bucket=user_b). Snake's partition is class-aware, not size-fixed: a slice of 500 mixed-class rows at bucket=32 emits ~4 buckets (one per class plus catch-all), not 500/32 = 16. So the assembled JSON is dramatically lighter than naive arithmetic suggests.

3. Total Cost Per Training Run

W(n, L) = L · (3 · ⌈n/τ⌉ − 1)

L layers each with ~⌈n/τ⌉ leaves, ~⌈n/τ⌉ conquers, and ~⌈n/τ⌉−1 divides. Production default τ = 1,000 (locked — not a tuning knob). The cost numbers below assume this.

The cost rows below are split into compute (the leaf bill, linear in N) and reservation (provisioned-slot bypass needed to keep peak invocations/s under the L-A1AFA3CF ceiling, super-linear past N≈30K). At τ=1000, all rows above 7.5K×25L need bypass to be feasible — see /paper §7 for the derivation.

ScaleLWall clockSlots$ compute$ reservation$ total
1K53.8s0$0.0006$0.0006
5K5~5s0$0.006$0.006
15K5~14s0$0.017$0.017
150K5~80s9$0.17$0.01$0.18
1M56.5 min81$1.13$0.62$1.75
1M178s7$0.23$0.01$0.24
1M2533 min449$5.65$17.16$22.81
10M5~38 min~470$11.30$60.10$71.40

Costs split: leaves are linear in N (the compute column); reservation is super-linear in N because the slot count needed to sustain peak invocations/s grows as O(L · N / log N) at fixed τ (see /paper §7.3). Past N≈30K at L=25, the reservation term overtakes compute and dominates total cost.

Headline. A 1M×5L Snake classifier trains in 6.5 minutes for $1.75 total at τ=1000. Slots toggled on at job start, off at job end — per-second billing. The earlier "$0.05 for 1M×5L" claim was wrong by 35×: it assumed free L-A1AFA3CF headroom, true at 5K (~7/s peak), false at 1M (132/s peak, 8× the ceiling). The fix is either pay the reservation tax above (canonical), or fix parent-poll chain-serialization to dilute the spawn-window per recursion level (/architecture §8).

3.1 Measured: 3 Datasets × 5K Rows (May 2026)

Live benchmark: 3 datasets (binary, 3-class, regression), each trained twice — once on full 5K (perfect-fit), once on 4K with 1K held-out. All predicts forced through the cloud (cloud_threshold = 1). 12 model trains in total, ~30K predict round trips, single CloudWatch window:

Passn_trainn_predictTrain (s)Predict (s)Quality
binary — perfect-fit5,0005,0004.806.09acc 100.00%
binary — held-out4,0001,0004.014.76acc 100.00%
3-class — perfect-fit5,0005,0004.666.26acc 100.00%
3-class — held-out4,0001,0003.724.44acc 100.00%
regression — perfect-fit5,0005,00024.933.74R² = 1.0000
regression — held-out4,0001,00020.394.20R² = 0.9845

Per-row cost decomposition

CloudWatch delta over the entire bench: $0.0434 for 387 Lambda invocations. Total work performed: 6 trainings × 4–5K rows = 27,000 training rows, plus ~24,000 inference round trips. So:

$0.0434 / (27,000 train rows + 24,000 predict rows) ≈ $0.85 per million row-ops

Trainings dominate compute time (52s of 73s wall clock, 71%); predicts dominate count. If we ascribe cost proportional to wall clock:

train cost ≈ $0.031 / 27K rows = $1.13 per million training rows
predict cost ≈ $0.012 / 24K rows = $0.51 per million predictions  (toy models, ~5K-row train)

Cloud predict's per-row cost is not a constant — it scales with the assembled bucket-chain length (training size × layer depth). A second empirical anchor from a Nature consumer running a 22K-row × 5-layer model:

predict cost ≈ $0.0023 / 4737 rows ≈ $0.49 per thousand predictions  = ~$485 per million on the Nature-class model

The lever for >>1M-prediction workloads is the local handoff (§6.1): download the model JSON once with m.to_algorithmeai() and run algorithmeai.Snake in-process. algorithmeai per-row inference is sub-millisecond regardless of model size, with no Lambda billing, no S3 fetch, no HTTP RTT. Cloud predict's pricing is for the case where the user can't or doesn't want to ship the model to the caller (multi-tenant, audit trail, central enforcement) — not for high-volume batch.

Per-row cost — SnakeBatch v9 vs market $10/M $1/M $0.10/M $0.01/M v9 train $1.13/M v9 predict $0.51/M local algorithmeai $0/M Vertex Tabular ~$200/K (10K=$2) SageMaker AP ~$1/K (10K=$10)

Numbers persist in v9/bench/results_3x5k.json; rerun via v9_smoke_economics.py.

3.2 Earlier Burn Snapshot (May 2026)

72-hour CloudWatch snapshot of the eu-west-3 account, all Lambdas:

FunctionMemoryInvocations (72h)Duration billed (72h)Est. cost
v9 mesh (100 shards)4 GB~14,0003.4 hours~$0.21
v6-worker (legacy)10 GB7,485,6285,992 hours~$914
v6-divide (legacy)4 GB172,3872,391 hours~$143
What this measures. The v6 numbers come from a single runaway recursion event on May 18–19, 2026 (since contained). The v9 row is steady-state usage across all 100 shards over the same window. The point is the shape: v9 distributes recursion across 100 isolated functions, so a single misbehaving shard cannot drain the account — it hits its own concurrency cap and stops. v6's single function had no such firewall.

4. v6 vs v9 — Why Spread the Mesh?

Propertyv6 (one fn)v9 (100 fns)
Per-shard reserved concurrency1000 cap (single fn)1000 × 100 shards (isolation, not aggregation)
Account spawn ceiling (L-A1AFA3CF)1000/min, sharded with all retries1000/min, sharded by hash — misbehaving job hits one shard's wall
1K cost$0.003$0.0006
1K wall clock2.4s3.8s
15K wall clock7.3s~8s (parity)
150K wall clock19.8s~80s (peak rate gated)
1M×5L wall clockconcurrency-capped, fails6.5 min with 81-slot bypass
10M×5L wall clockconcurrency-capped, fails~38 min with ~470-slot bypass

v9 is the version that completes at scale — v6's recursive self-invokes wedged under L-A1AFA3CF tightening. v9 routes around that with sharding (isolation) plus toggled provisioned slots (peak-rate bypass). The mesh is an availability mechanism; the slots are a throughput mechanism; neither alone replaces the other.

5. Three-Bucket Cost Structure

The per-million numbers above are compute-only. The real cost of running v9 has three buckets, and only one of them scales with rows processed:

BucketWhat you payWhen billed
Fixed~$98/month flat24/7, identical at 0 jobs or 1M rows
Activation$0.0117–$0.117/minonly while provisioned-concurrency toggle is on
Usage$1.13/M train + $0.51/M predictper row processed (compute itself)

5.1 Fixed Costs — the always-on floor

EC2 t4g.2xlarge (gatherer)   $96.77/month  (on-demand)
                             $65.30/month  (reserved 1y, 32% off)
S3 storage (~5 GB)            $0.12/month
DynamoDB on-demand            $0.06/month  (~50K writes)
CloudWatch log retention      $0.30/month
Route53 + ACM cert            $0.50/month
                             ────────────
                             $97.75/month total fixed

The gatherer is 8 vCPU / 32 GB at 99% idle. Massively overprovisioned for a single user; sized for multi-tenant. At 100K rows/month total throughput, the fixed bucket alone amortizes to $978/M rows — dwarfs every other cost. At 100M rows/month it's $0.98/M and disappears into the noise.

5.2 Activation Cost — the provisioned-concurrency toggle

L-A1AFA3CF (Lambda concurrency scaling rate) is a 1000/min account-wide ceiling AWS does not adjust. Cold starts on a chain-serialized invoke graph cap v9 at <5× parallelism for small jobs — the 100-shard mesh is structurally underutilized at the spawn-rate ceiling. Provisioned concurrency bypasses this for the provisioned fraction:

Provisioned reservation rate (eu-west-3 x86):
  $0.0000048673 per GB-second  (paid while READY, not while invoking)

Spin-up:                  ~84s, NOT billed
Tear-down:                ~0.8s, instant
Min billing granularity:  1 second

  4 GB ×   1 shard ×  60s = $0.00117  per minute warm
  4 GB ×  10 shards ×  60s = $0.0117   per minute warm
  4 GB × 100 shards ×  60s = $0.117    per minute warm  ($7.01/hour)

The spin-up tax means the toggle is a session primitive, not a per-job one. You toggle on once, run a stream of jobs against the warm mesh, toggle off when done. AWS cannot bill provisioning while the function itself is unreachable, so a reliable toggle-off (EventBridge one-shot + 60s sweep) is mandatory — a leaked toggle on 100 shards burns $168/day.

5.3 Usage Cost vs Activation Cost — the friction equation

Define friction as the fraction of warm-window time the mesh sat ready but doing nothing useful. 0% friction = perfectly back-to-back jobs. 50% friction = half your warm window was idle.

provisioned $/min  $0.117  (full mesh)
on-demand   $/min  $0.0083 (measured, 5K×5L bench)
                  ───────
ratio              14×     provisioning is 14× the price of plain Lambda

So provisioning only pays for itself when the warm window crams enough back-to-back compute to dilute the reservation tax. Concretely, for a 5K×5L job (one warm session, full 100-shard mesh, 60s budget):

ScenarioCompute costProvisioning costPer jobPer million train rows
On-demand only (today)$0.005$0.005$1.13/M
Toggle, 100% friction (perfect)$0.005$0.0117$0.017$3.34/M
Toggle, 50% friction (realistic)$0.005$0.117$0.122$24.40/M
Toggle, 5% friction (heavy session)$0.005$1.17$1.18$235/M

The session profile that genuinely beats on-demand is 20+ back-to-back jobs through one warm window. One job per session is strictly worse than on-demand — you're paying 14× the compute cost for ~2× the wall-clock improvement.

Break-even cheat sheet.
1 job/session → on-demand wins (toggle costs 14× more for 2× speed)
5 jobs/session → rough parity
20+ jobs/session → toggle wins on $/job AND wall-clock

5.4 The amortization scenario — when warm Lambdas pay off

Provisioning is a session primitive, not a per-job one. The 84s spin-up plus the 14× reservation rate means a single isolated job gets crushed by the toggle. But a session — a stretch of clock time during which you fire many jobs back to back — can amortize the warm cost down to fractions of a cent per job. Concrete numbers, full mesh (100 shards × 4 GB), 60s of spin-up amortized into the budget:

Session shapeWall clockJobs runProvisioning $Compute $$/jobvs on-demand
Spin up, run 1 job~90 s1$0.105$0.005$0.11022× worse
Spin up, run 5 jobs~120 s5$0.140$0.025$0.0336.6× worse
Spin up, run 20 jobs~210 s20$0.245$0.100$0.01733.5× worse
1-hour fully-saturated session3,600 s~600$7.01$3.00$0.01673.3× worse
Compute-saturated infinity (theoretical floor)$/sec ratio$/sec ratio$0.005parity
Provisioned concurrency literally cannot beat on-demand on $/job. The reservation rate is 14× the per-second compute rate, so even a mythical 100% utilized warm window pays exactly the on-demand price plus the reservation tax. Provisioning never wins on cost — it wins on wall clock.

5.5 So what's it actually for?

Provisioning trades dollars for predictable latency. The 14× tax is the price of skipping cold starts and the L-A1AFA3CF spawn-rate ceiling on the provisioned fraction. Three regimes where that trade is worth taking:

RegimeWhy warm helpsCost-frame
Interactive demo5K×5L drops from ~6 s to ~2 s wall clock; live audience doesn't see latency$0.10 per 60s warm window — eat it as a sales cost
SLA predict throughputCloud predict at scale needs 100s of leaves spawning together; cold-start tail blows P99$7/hour during business hours — charge it to the customer's SLA tier
Continuous training pipelineHourly model rebuild on fresh data, jobs back-to-back, want each rebuild <15 s$84/day if always warm during work hours; tolerable if model rebuild is critical-path

For everything else — ad-hoc training, exploratory benches, the "morning bruv let me try this dataset" flow — on-demand is the right default. v9's small-job wall clock (~6 s for 5K×5L) is bad enough to justify the toggle when latency is the product, fine to live with when correctness is the product.

5.6 Honest per-million by usage profile

ProfileVolumeToggle?$/M train rowsWhat dominates
Infrequent1–5 jobs/monthoff~$5,000/MEC2 fixed cost — per-row math meaningless
Active~10K rows/dayoff$1.13/M + $325/M fixedEC2 still 290× usage
Production~100K rows/day, spot toggle50% friction$24.40/M + $32.50/M fixedactivation & fixed roughly even
Heavy1M rows/day, session toggle10% friction$11.70/M + $3.25/M fixedactivation dominates compute
Industrial10M rows/day, always warm0% friction (saturated)$3.34/M + $0.33/M fixedcompute ≈ activation, fixed gone
$ / job vs jobs per warm session (full 100-shard mesh) $0.12 $0.09 $0.06 $0.03 $0.01 $0.00 1 5 10 20 50 100+ jobs run within one warm window on-demand: $0.005/job $0.110 $0.033 $0.0205 $0.0173 $0.0070 toggle (60s budget per job) asymptote: 14× on-demand never crosses parity

The blue dashed line is on-demand — the toggle curve approaches 14× that and stops. Provisioning is a wall-clock product, not a cost product.

The cost class still holds. Even the worst realistic profile (Production, 50% friction) lands at $24.40/M train rows + $98/mo fixed. Vertex AI Tabular at $200/M is 8× more expensive at the same volume. The 1000× advantage in §3 is the headline number, not the worst-case number — honest framing is "4–1000× cheaper depending on profile."

6. EC2 Gatherer Cost (deep dive)

One t4g.2xlarge in eu-west-3, 8 vCPU / 32 GB:

On-demand : $0.1344/hour = $96.77/month
Reserved 1y : $0.0907/hour = $65.30/month  (32% off)

The gatherer hosts the FastAPI app (/v9/train, /v9/status, /v9/model, /grid/v9) and runs the SQS drainer thread. Idle CPU 99%, RAM 0.85 / 30 GB at current load — massively overprovisioned for a single user, perfect for multi-tenant scale.

7. Inference Cost

Two paths:

7.1 Local handoff — monceai.Snake.to_algorithmeai()

Download /v9/model/{id}, instantiate algorithmeai.Snake(path), predict locally. $0 per prediction, sub-millisecond per row on commodity hardware, zero network roundtrip.

This is the recommended path for v9 because the model JSON is byte-equivalent in semantics to a locally-trained Snake. There is no operational reason to keep inference in the cloud.

7.2 Cloud predict — /v9/predict/{id}

Not yet implemented. When wired, expected:

v9-predict (1 GB, ~5ms warm):
  invoke  : $0.0000002
  compute : 1 GB × 0.005s × $0.0000166667 = $0.000000083
  Total per Lambda ≈ $0.0000003 per call
  ≈ $0.30 per million predictions

8. Cost vs Provider Pricing

Comparing v9 to closed-source classification APIs at 10K-row training scale:

Provider10K-row train costWall clockInference
SnakeBatch v9$0.001~8s$0 local / $0.30/M cloud
Vertex AI Tabular$1–3 (sustained)~20m minimumper-call billed
SageMaker Autopilot$5–10~1hendpoint $/hr
OpenAI fine-tune classifier$10+ per 10K~30mtoken-billed

The SnakeBatch number isn't a discount; it's a different cost class. SAT-by-construction means we don't pay for backprop, we don't pay for hyperparameter search, we don't pay for retries. The work is polynomial in n and linear in spot Lambda time. There is no model to "fit" beyond constructing the formula.

© 2026 Charles Dana · Monce SAS · SnakeBatch v9 · /paper · /architecture · /math