Single binary, three roles, dispatched by event["role"]. The
conquer role is the only blocking node — it parent-polls DDB
until the leaf below it acks. This is also the chain bottleneck observed in
practice (see /economics §5).
role=divide
input : {indices, cond_prefix, tau, bucket, s3_key, ...}
if len(indices) ≤ tau:
fire ONE conquer with indices, cond_prefix, bucket
return
oppose(A, B) on the slice's targets → literal lit
left = [i for i in indices if apply_literal(pop[i], lit)]
right = [i for i in indices if !apply_literal(pop[i], lit)]
fire divide(left, cond_prefix + [lit_pos])
fire divide(right, cond_prefix + [lit_neg])
return # <100ms
Divide nodes are cheap. They never block on children. Spot-concurrency holds because each divide occupies a Lambda slot for a few hundred milliseconds, not the full subtree wall time.
role=conquer
input : {indices, cond_prefix, bucket, s3_key, ...}
create DDB parent counter (expected=1)
fire ONE leaf with the full indices, bucket, cond_prefix
poll DDB until completed_children == 1 (or timeout)
emit None.
This is the only blocking node. It exists so the divide tree has a clear "all leaves below me have ack'd" semantic. We can later replace this with queue-as-truth completion without changing the leaf or divide.
role=leaf
input : {indices, cond_prefix, bucket, s3_key, ...}
local_pop = [pop[i] for i in indices]
model = Snake(local_pop, target_index, n_layers=1,
bucket=user_bucket, noise=0,
datatypes=GLOBAL, # <-- enforced
oppose_profile=GLOBAL,
workers=1)
# Snake's build_bucket_chain emits multiple buckets
# with NATIVE per-bucket conditions (IF/ELIF/ELSE).
for b in model.layers[0]:
new_cond = cond_prefix + (b.condition or [])
members = [indices[m] for m in b.members]
emit({condition: new_cond, clauses: b.clauses,
members, lookalikes: b.lookalikes,
origins: b.origins})
Position-keyed lookalikes from Snake ("0", "1", …) ride along
unchanged because they index into members[]. members is
remapped to global indices; lookalike keys stay positional and resolve correctly
on the gatherer side after assembly.
The gatherer:
target_index is column 0.N numeric or T text), enforces it for the
whole job — leaves never re-detect.N→float, T→str.pkg = {population, targets, header, datatypes, target_index, oppose_profile}.s3://snake-batch-monce/jobs/{jid}/pop.json.gz.Workers cache the package in /tmp on first read, then in process memory
(_pop_cache). Cold leaves pay the S3 read once per warm container; warm
leaves are zero-cost.
SQS drainer thread (4 receivers):
loop:
msgs = ReceiveMessage(v9-reports, max=10, wait=2s)
for m in msgs:
body = json.loads(m.Body)
ingest_report(body) # appends buckets to job["leaves_by_layer"]
delete_message_batch(...)
ingest_report(body):
layer = body["layer"]
job["leaves_by_layer"][layer] += body["buckets"]
job["covered_by_layer"][layer] |= {m for b in body["buckets"]
for m in b["members"]}
if all(coverage[l] == n_rows for l in range(n_layers)):
assemble_model(job)
assemble_model(job):
layers = [job["leaves_by_layer"][l] for l in range(n_layers)]
model.json = {version, header, datatypes, target,
targets, population, n_layers, bucket, noise: 0,
oppose_profile, layers}
job["status"] = "done"
The gatherer never builds clauses or runs Snake. It is a pure assembler.
The model JSON conforms exactly to algorithmeai.Snake.to_json() shape.
| Method | Path | Purpose |
|---|---|---|
| POST | /v9/train | Train; returns model_id + status_url + model_url |
| GET | /v9/status/{id} | Job progress, leaves_per_layer, coverage_per_layer |
| GET | /v9/model/{id} | Final assembled model JSON (algorithmeai-shape) |
| GET | /grid/v9 | Live 10x10 grid showing per-shard activity |
| GET | /paper | This series — the paper |
| GET | /architecture | You are here |
| GET | /economics | Cost model |
| GET | /math | Equations only |
| Property | How v9 enforces it |
|---|---|
| Per-shard isolation | 100 fns. A wedged job hits its own shard's wall, not the account-wide L-A1AFA3CF budget. Throughput still gated by L-A1AFA3CF (see §8.1). |
| Population is staged once | S3 + /tmp cache. Warm leaves are zero-cost on the data side. |
| Datatypes enforced globally | orchestrator detects, propagates, leaves never re-detect. |
| Position-keyed lookalikes | Snake emits "0"..."n-1"; members[] remap is the only translation. |
| Unique complete-AND per bucket | cond_prefix + Snake.condition; one leaf per route. |
| Tautology / 1-class slices | Snake handles natively — no shim, no target-name keys. |
The SDK is local-first. monceai.Snake wraps
algorithmeai.Snake: training fans out to the mesh; inference runs
in-process on a stripped model unless the batch is too big or the mode needs
the population.
monceai.Snake mirrors
algorithmeai.Snake.__init__ exactly — 13 positional/keyword
args identical (target_index, n_layers, bucket, noise, vocal,
workers, oppose_profile, lookahead, datatypes, …). v9 extras
(endpoint, cloud_threshold, tau, model_id) are keyword-only,
after a * separator.m.targets, m.header,
m.layers, m.oppose(A, B), m.apply_clause(X, c) —
every algorithmeai method is reachable on a cloud-trained instance with no
drift to maintain.model.json is
the exact shape algorithmeai.Snake.to_json() emits. So
algorithmeai.Snake(cloud.to_algorithmeai()) reloads it verbatim —
train on the cloud, deploy fully offline, zero compatibility shim.Snake(rows) → POST /v9/train;
Snake("model.json") with "layers" →
POST /v9/upload (S3-stages an externally trained model so
cloud predict-at-scale works against it);
Snake("v9-…") → reconnect.Snake.health() hits
GET /v9/health, returns
{sdk_version, backend_version, compatible, predict_tau, n_shards}.
Major-version mismatch flags compatible: false — fail loud at
startup, not on first predict.Training is intrinsically distributed: the v9 tree fans out 100s of leaf Lambdas in parallel, finishing in ~5s for 5K rows where a single-threaded algorithmeai would take similar time but use one CPU. Once the model is assembled, prediction is sub-millisecond per row in algorithmeai — the HTTPS round trip dwarfs the compute. Cloud predict only wins on parallelism above the dispatch threshold (see /math §10), or when the population dictionary is required (audit, lookalikes, augmented, candle).
v6 was a single recursive Lambda function, capped by AWS at 1000 reserved concurrency for the whole training job. At 150K rows, peak concurrent invocations exceeded the cap and tail latency exploded. The May 18, 2026 incident burned $1057 in 19 hours when v6 retried into its own throttle under AWS's tightened scaling-rate ceiling.
v9's 100 functions isolate recursion: peak concurrent invocations
spread by hash(model_id, path) % 100, so each shard sees ~1% of
the load and a misbehaving job hits its own shard's cap rather than draining
the account. The mesh is an isolation/availability mechanism, not a
throughput multiplier.
Earlier v9 docs claimed "100 shards × 1000 reserved = 100,000 aggregate concurrent." That number is wrong. Verified against AWS service-quotas on 2026-05-21:
| Lambda quota (eu-west-3) | Value | Adjustable? |
|---|---|---|
| L-B99A9384 Concurrent executions | 20,000 | yes (via support) |
| L-A1AFA3CF Concurrency scaling rate | 1,000 / minute | NO |
| UnreservedConcurrentExecutions | 17,859 | derived |
The scaling rate (L-A1AFA3CF) applies at the account level per region, not per function. AWS does not raise it. So 100 shards do not stack their 1000 caps into 100K — they all draw from the same 1000/min budget for warming new on-demand containers.
What sharding does buy:
What sharding does not buy: throughput. Empirically observed 2026-05-21, the L-A1AFA3CF ceiling is binary, not soft. A 1-minute window at peak rate > 16.67/s causes AWS to reject the excess and the divide tree's parent-poll waits never resolve — the run wedges. At τ=1000, the break is at N≈7.5K×25L (peak 17.40/s observed), and beyond it the only mechanism that keeps v9 feasible is provisioned-concurrency bypass:
FEASIBLE(N, L, τ, N_slots) ≡
L · ⌈N/τ⌉ / spawn_window ≤ 16.67 + N_slots / t_leaf
⇒ N_slots(N, L) = O(L · N / log N) at fixed τ=1000
So 1M×5L needs 81 slots toggled on for 6.5 minutes ($1.75 total); 1M×25L needs 449 slots for 33 minutes ($22.81). See /paper §7 for the derivation and /concurrency §4 for the live spawn-rate panel that catches a wedge in flight.
The serverless promise is pay only for what you use. The reality for compute-shaped workloads — trees of short-lived invokes that need real parallelism — has three teeth that don't show up on the price sheet:
| What AWS markets | What actually bills you |
|---|---|
| "Run code without thinking about servers" | EC2 gatherer required for SQS draining + assembly — $98/month before any Lambda fires |
| "Massive parallelism, scale to zero" | Account-wide 1000/min spawn ceiling that 100 shards cannot bypass |
| "$0.0000167/GB-s, sub-cent per invoke" | Cold-start chains at depth-3 bill 3× the useful compute as parent-poll slot-wait (measured: 1.71× parallelism on a 5-shard fan) |
| "Provisioned concurrency for predictable latency" | 14× the on-demand compute rate, paid 24/7 unless toggled off — a leaked toggle on 100 shards is $168/day |
v9's design absorbs the first three: gatherer is a known fixed cost, sharding controls blast-radius, the toggle hands the fourth tooth back to the user as a deliberate session primitive. See /economics §5 for the three-bucket cost breakdown and /concurrency for the live spawn-rate budget.
© 2026 Charles Dana · Monce SAS · SnakeBatch v9 · /paper · /economics · /math