SnakeBatch v9: 100-Shard Mesh with Self-Sustaining Tree Completion

Charles Dana · Monce SAS · May 2026

snakebatch.aws.monce.ai · /architecture · /economics · /concurrency · /math

Abstract

v9 spreads training across 100 identical Lambda functions (v9-worker-pipe-{0..99}). Sharding is for blast-radius isolation, not throughput multiplication: AWS gates on-demand spawn at the account-level L-A1AFA3CF rate (1000/min = 16.67/s), so 100 shards share one budget. Three roles dispatched by SQS message type: divide recursively halves until slice ≤ τ, conquer hands the slice to one leaf, leaf calls algorithmeai.Snake(local_pop, n_layers=1, bucket=user_bucket, noise=0) and emits literals to a single report queue. An EC2 c7g.large gatherer assembles the model from report messages.

The load-bearing invariant: every assembled bucket carries a unique complete-AND condition. Snake's native build_bucket_chain is honored at the leaf, so the v9 model.json is structurally a flat IF/ELIF/ELSE chain that algorithmeai.Snake rehydrates and traverses byte-for-byte identically to a locally-trained model.

1. The Bayesian-Contradiction Bug We Removed

The earlier conquer-then-fan design chunked indices into ⌈n/bucket⌉ slices and fired one leaf per chunk, all sharing the same cond_prefix. After assembly, the layer contained multiple buckets with identical conditions:

layer 0 (broken):
  bkt[0]: cond=[score < 50],   members={32 of B}, lookalikes={0..31: ...}
  bkt[1]: cond=[score < 50],   members={32 of B}, lookalikes={0..31: ...}
  bkt[2]: cond=[score < 50],   members={32 of B}, lookalikes={0..31: ...}
  ...
  bkt[10]: cond=[score ≥ 50, x < -3], members={32 of A}, ...

algorithmeai's traverse_chain walks top-to-bottom and returns the first matching bucket. So a row routed to score < 50 landed in bkt[0], saw 32 lookalikes, and the other 9 buckets carrying 165+ additional B-members became dead weight — their lookalikes never voted.

This produced phantom probability mass: a row with lookalike_tally = {A: 50} (unanimous) returned {A: 0.76, B: 0.24}. The 0.24 of B came from thin-pool fallback — a vote starved by undersampling.

Bayesian contradiction. A bucket's condition is a contract: every voting lookalike must satisfy that AND. Two buckets sharing the same condition compete for the same row, and the loser's votes vanish. The only stable shape is unique complete-AND per bucket.

2. The Fix — Conquer Fans One Leaf

The bucket parameter was being misused as a slicing knob at the conquer level, when it is supposed to be Snake's native partition size inside the leaf. The fix:

Conquer stops chunking. It fans exactly one leaf with all n ≤ τ indices.
Leaf calls Snake(local_pop, bucket=user_bucket) and lets Snake's build_bucket_chain do the IF/ELIF/ELSE partition with native per-bucket conditions.
_globalize prepends cond_prefix to each Snake-emitted condition: [*divide_path, *snake_subcondition]. Every assembled bucket gets a unique complete-AND, by construction.

3. Result: Confidence Restored to 1.0

Before vs after, identical 1000-row 3-class Gaussian (well-separated, σ=1, centers ≥6):

Variant	Fit	max_p mean	max_p min	p ≥ 0.99
v9 cloud (broken, chunked)	100%	0.9978	0.5963	989 / 1000
v9 cloud (one-leaf-per-conquer)	100%	1.000	1.000	1000 / 1000
algorithmeai (local twin)	100%	1.000	1.000	1000 / 1000

Tree shrunk from 166 leaves/job to 23 leaves/job for the same 5L training, because Snake's native partition is class-aware (4 buckets for 500 mixed-class rows, not 16). Wall time 6.06s — slower per-leaf, but the tree is dramatically simpler.

The v9 model.json is now byte-equivalent in semantics to a locally-trained Snake. You can download /v9/model/{id}, do Snake(path), and inference is identical to local training. No reconciliation logic, no special-casing, no fallback votes.

4. Architecture in One Picture

Client (monceai SDK)
  ↓ HTTPS
EC2 gatherer (c7g.large) — FastAPI: /v9/train /v9/status /v9/model
  ↓ SQS-style invoke
v9-worker-pipe-{0..99} — one binary, three roles by message:

      role=divide:                     role=conquer:                role=leaf:
      ----------                        -----------                  ---------
      n ≤ τ ?                          fan ONE leaf with all       Snake(local_pop,
        no  →  oppose → left/right      indices ≤ τ.             n_layers=1,
              fire 2 divides            wait DDB ack.                bucket=user_bucket,
        yes →  fire 1 conquer.        emit None.                   noise=0)
      emit None.                                                     SQS report → gatherer

Gatherer SQS drainer thread → assembles layers → model.json

5. What's Different from v6

v6 was one Lambda function (recursive). v9 is one Lambda binary deployed to 100 functions, dispatched by SQS message type. The reason isn't aggregate throughput (L-A1AFA3CF caps that at 1000/min account-wide regardless of how many functions you declare); it's blast-radius isolation. v6's recursive self-invocation under the May-2026 rate-limit tightening burned $1057 in 19 hours retrying into its own throttle. v9's 100 functions cap each shard's spawn at 1% of the budget, so a misbehaving job hits its own shard's wall instead of draining the account. Fan-out parallelism is governed by L-A1AFA3CF (free) plus toggled provisioned slots (paid) — see §7 and /concurrency §4.

The other change: v6 had a tautology shim in the leaf (_tautological_layer) to handle 1-class slices. v9 removed it. Snake handles the 1-class case natively (emits empty clauses + position-keyed lookalikes), and the tautology shim was emitting target-name lookalike keys incompatible with get_lookalikes_fast's int(key) cast. Removing the shim fixed inference at zero cost.

6. Validation — 3 Datasets × 5K rows

Live run, 2026-05-21, all predicts via /v9/predict-sync (cloud_threshold = 1, every row goes through the mesh). n_layers=5, bucket=250, noise=0.25. Two passes per dataset: perfect-fit (train on 5K, predict on the same 5K) and held-out 80/20 (train on 4K, predict on 1K unseen).

Dataset	Task	Pass	Train (s)	Predict (s)	Metric
Binary Gaussian (5K)	classification	perfect-fit (5K→5K)	4.80	6.09	acc 100.00%
Binary Gaussian (5K)	classification	held-out 80/20 (4K→1K)	4.01	4.76	acc 100.00%
3-class numeric+text (5K)	classification	perfect-fit	4.66	6.26	acc 100.00%
3-class numeric+text (5K)	classification	held-out 80/20	3.72	4.44	acc 100.00%
y = 2.5a + b² + N(0,0.5) (5K)	regression	perfect-fit	24.93	3.74	R² = 1.0000, RMSE = 0.00
y = 2.5a + b² + N(0,0.5) (5K)	regression	held-out 80/20	20.39	4.20	R² = 0.9845, RMSE = 0.96, MAE = 0.70

Numbers persist in v9/bench/results_3x5k.json with model_id and JSON path for every model trained, so any of these can be reloaded with algorithmeai.Snake(path) for offline replay.

Perfect fit is by construction — the Dana Theorem guarantees an indicator-to-CNF that's exact on the training set, and v9's unique-complete-AND invariant assembles it without contradiction. Held-out R² = 0.9845 on a continuous target with 4000 training rows and zero feature engineering is the corollary: when the construction is exact, generalization is what the bucketing actually preserves.

7. Scaling and Cost — from 5K Bench to 1M

The 5K bench in §6 demonstrates correctness. This section extrapolates to 1M rows under the empirical wall-clock fit, prices the AWS bypass needed to hold that curve, and gives a closed-form upper bound. Default training config holds τ = 1000 throughout — we do not tune it for cost.

7.1 Empirical wall-clock fit

Six sequential runs at L=25 (synthetic regression, N from 2K to 4K) fit a power law with R² = 0.994:

t(N, L) ≈ (L / 25) · 0.0427 · N0.777  seconds

This is observably the same regime described in /economics §5.4: mesh amortization across overlapping layers, parent-poll cost diluted by depth, leaves dominating per-row time. We treat it as the v9 wall-clock model up to the point where peak invocations/second crosses the AWS spawn ceiling — past that, the run does not slow down, it fails (see §7.3).

7.2 The pass/fail invariant

L-A1AFA3CF is binary, not soft. A 1-minute window at peak rate > 16.67/s without bypass causes AWS to reject the excess invocations; the divide tree's parent-poll waits never resolve, the gatherer sees zero leaves return, and the job wedges. CloudWatch's 10-second resolution shows this; the 1-minute SUM of throttles can read 0 while children downstream are silently dying. Empirically observed at N=7500, L=25, τ=1000: peak 17.40/s on the 15:38Z bar, full mesh stall.

So the feasibility predicate is binary:

FEASIBLE(N, L, τ, N_slots) ≡
    L · ⌈N/τ⌉ / spawn_window(N, τ)  ≤  16.67 + N_slots / t_leaf

with spawn_window(N, τ) ≈ 3.8 s · ⌈log₂(N/τ)⌉     (measured)
     t_leaf                  ≈ 0.7 s                              (4 GB warm)

7.3 Slot complexity

Solving the predicate for the minimum bypass slots needed to keep peak rate under the ceiling at τ=1000:

Nslots(N, L) = max​(0, ⌈(L · N / (3800 · log₂(N/1000)) − 16.67) · 0.7⌉)

               = O(L · N / log N)

Slot demand grows super-linearly in N because the depth-log dilution is weaker than the linear leaf count growth. This is the single most important fact about v9 economics at τ=1000: holding the N^0.777 wall-clock curve through 1M rows requires bypass capacity that scales nearly with N itself.

7.4 Cost decomposition

Two terms: leaf compute (linear in total leaves × · layers) and provisioned-slot reservation (linear in slot-count × wall-clock). Compute is calibrated against the 5K bench at $1.13/M training rows (/economics §3.1). Reservation is the eu-west-3 4 GB rate, $0.0000048673/GB-s, paid per second the toggle is on:

$compute(N, L)     = $1.13e-6 · (L/5) · N

$reservation(N, L) = Nslots(N, L) · 4 GB · t(N, L) · $0.0000048673

                   = $1.95e-5 · Nslots · t

$total(N, L)       = $compute + $reservation

7.5 Cost complexity

Substituting the asymptotic forms:

$_compute     = O(L · N)                         (linear leaves × per-leaf bill)
$_reservation = O(L² · N^1.777 / log N)        (slots × wall clock)
$_total       = O(L² · N^1.777 / log N)        for large N

The cost is super-linear in N once bypass kicks in. This is the counterpart to the sub-linear N^0.777 time complexity: holding wall clock sub-linear at fixed τ costs super-linear money. The two together are internally consistent — total Lambda-seconds (compute) bills the integral of work and is linear; reservation bills the wall-clock window times the slot count, and the slot count itself grows nearly with N.

The crossover at L=25 is around N≈30K: below it, compute dominates and the toggle is a rounding error; above it, the reservation term takes over and each doubling of N more than doubles total cost.

7.6 The 1M run, priced

Plugging N = 10⁶, τ=1000, t_leaf=0.7s into the equations:

L	peak rate	slots needed	wall clock	$ compute	$ reservation	$ total
1	26.3/s	7	78 s	$0.23	$0.01	$0.24
5	132/s	81	6.5 min	$1.13	$0.62	$1.75
25	658/s	449	33 min	$5.65	$17.16	$22.81

Headline. A 1M×5L Snake classifier trains in 6.5 minutes for $1.75 total, holding τ=1000 throughout. 81 provisioned slots toggled on at job start, off at job end — per-second billing means leaking a few minutes of toggle costs cents. Compute is $1.13 (linear in N), bypass reservation adds $0.62 (super-linear in N).

7.7 Why this matters for the headline

The naive extrapolation "5K cost × 200 = 1M cost" gives $0.05, which is the number that lived in /economics §3 for months. That number is wrong by 35×. It assumes free L-A1AFA3CF headroom — true at 5K (peak rate ~7/s, sub-ceiling), false at 1M (peak rate 132/s, four ceilings deep). The compute piece extrapolates linearly; the reservation piece is what makes the run feasible at all, and it is not in the 5K bench because the 5K bench did not need it.

The honest /economics-line-69 number for "1M×5L" at τ=1000 is $1.75, not $0.05. Past 1M, the L²·N^1.777 term dominates and SnakeBatch becomes a paid product on AWS rather than a free one. The lever to bend this back to linear is fixing the parent-poll chain-serialization (see /architecture §8), which dilutes the spawn-window per recursion level and lowers the slot requirement at fixed wall-clock. Until then, the τ=1000 invariant pins us to this curve.

8. Open Work — Real Noise Injection

Snake's noise parameter samples from local_pop, which post-divide is the routed slice. Injecting that as noise inside the leaf doesn't add diversity — routed members already satisfy the path's AND. Worse, if Snake ever picked noise members violating the path, it would reintroduce the Bayesian contradiction. So v9 hard-codes noise=0 at the leaf.

True regularization noise = members of the global population that satisfy the path's AND but were not selected by oppose for routing. That's a gatherer-side or pre-leaf-augment problem, not Snake's internal noise. Reserved for the next iteration.

9. Layout

v9/
  worker/handler_snake.py    one binary, three roles
  gatherer/v9_train.py       FastAPI: /v9/train /v9/status /v9/model
  gatherer/grid_v9.py        live grid: /grid/v9
  esquisses/                 perfect-fit probes + Bayesian-contradiction writeup