SnakeBatch v9: Architecture

Charles Dana · Monce SAS · May 2026

snakebatch.aws.monce.ai · /paper · /economics · /concurrency · /math

1. System Diagram

Single binary, three roles, dispatched by event["role"]. The conquer role is the only blocking node — it parent-polls DDB until the leaf below it acks. This is also the chain bottleneck observed in practice (see /economics §5).

2. Three Roles, One Binary

2.1 `role=divide`

input : {indices, cond_prefix, tau, bucket, s3_key, ...}

if len(indices) ≤ tau:
    fire ONE conquer with indices, cond_prefix, bucket
    return

oppose(A, B) on the slice's targets → literal lit
left  = [i for i in indices if  apply_literal(pop[i], lit)]
right = [i for i in indices if !apply_literal(pop[i], lit)]

fire divide(left,  cond_prefix + [lit_pos])
fire divide(right, cond_prefix + [lit_neg])
return                                                  # <100ms

Divide nodes are cheap. They never block on children. Spot-concurrency holds because each divide occupies a Lambda slot for a few hundred milliseconds, not the full subtree wall time.

2.2 `role=conquer`

input : {indices, cond_prefix, bucket, s3_key, ...}

create DDB parent counter (expected=1)
fire ONE leaf with the full indices, bucket, cond_prefix
poll DDB until completed_children == 1 (or timeout)
emit None.

This is the only blocking node. It exists so the divide tree has a clear "all leaves below me have ack'd" semantic. We can later replace this with queue-as-truth completion without changing the leaf or divide.

2.3 `role=leaf`

input : {indices, cond_prefix, bucket, s3_key, ...}

local_pop = [pop[i] for i in indices]
model = Snake(local_pop, target_index, n_layers=1,
              bucket=user_bucket, noise=0,
              datatypes=GLOBAL,                 # <-- enforced
              oppose_profile=GLOBAL,
              workers=1)

# Snake's build_bucket_chain emits multiple buckets
# with NATIVE per-bucket conditions (IF/ELIF/ELSE).
for b in model.layers[0]:
    new_cond  = cond_prefix + (b.condition or [])
    members   = [indices[m] for m in b.members]
    emit({condition: new_cond, clauses: b.clauses,
          members, lookalikes: b.lookalikes,
          origins: b.origins})

Position-keyed lookalikes from Snake ("0", "1", …) ride along unchanged because they index into members[]. members is remapped to global indices; lookalike keys stay positional and resolve correctly on the gatherer side after assembly.

3. Population Staging

The gatherer:

Reorders columns so target_index is column 0.
Detects per-column datatype (N numeric or T text), enforces it for the whole job — leaves never re-detect.
Coerces values: N→float, T→str.
Builds pkg = {population, targets, header, datatypes, target_index, oppose_profile}.
Gzips + uploads to S3: s3://snake-batch-monce/jobs/{jid}/pop.json.gz.

Workers cache the package in /tmp on first read, then in process memory (_pop_cache). Cold leaves pay the S3 read once per warm container; warm leaves are zero-cost.

Why datatypes are enforced globally: a divide split can yield a slice where every value of column k looks numeric to the local sniffer, even though the global column is text. If the leaf re-detects, types diverge across leaves and the assembled chain becomes incoherent. v9 propagates the global datatypes vector everywhere.

4. Assembly — Gatherer Side

SQS drainer thread (4 receivers):
  loop:
    msgs = ReceiveMessage(v9-reports, max=10, wait=2s)
    for m in msgs:
        body = json.loads(m.Body)
        ingest_report(body)        # appends buckets to job["leaves_by_layer"]
        delete_message_batch(...)

ingest_report(body):
    layer = body["layer"]
    job["leaves_by_layer"][layer] += body["buckets"]
    job["covered_by_layer"][layer] |= {m for b in body["buckets"]
                                        for m in b["members"]}
    if all(coverage[l] == n_rows for l in range(n_layers)):
        assemble_model(job)

assemble_model(job):
    layers = [job["leaves_by_layer"][l] for l in range(n_layers)]
    model.json = {version, header, datatypes, target,
                  targets, population, n_layers, bucket, noise: 0,
                  oppose_profile, layers}
    job["status"] = "done"

The gatherer never builds clauses or runs Snake. It is a pure assembler. The model JSON conforms exactly to algorithmeai.Snake.to_json() shape.

5. Routes

Method	Path	Purpose
POST	`/v9/train`	Train; returns `model_id` + status_url + model_url
GET	`/v9/status/{id}`	Job progress, leaves_per_layer, coverage_per_layer
GET	`/v9/model/{id}`	Final assembled model JSON (algorithmeai-shape)
GET	`/grid/v9`	Live 10x10 grid showing per-shard activity
GET	`/paper`	This series — the paper
GET	`/architecture`	You are here
GET	`/economics`	Cost model
GET	`/math`	Equations only

6. Reliability Properties

Property	How v9 enforces it
Per-shard isolation	100 fns. A wedged job hits its own shard's wall, not the account-wide L-A1AFA3CF budget. Throughput still gated by L-A1AFA3CF (see §8.1).
Population is staged once	S3 + /tmp cache. Warm leaves are zero-cost on the data side.
Datatypes enforced globally	orchestrator detects, propagates, leaves never re-detect.
Position-keyed lookalikes	Snake emits `"0"..."n-1"`; `members[]` remap is the only translation.
Unique complete-AND per bucket	cond_prefix + Snake.condition; one leaf per route.
Tautology / 1-class slices	Snake handles natively — no shim, no target-name keys.

7. monceai SDK ↔ Mesh Dialogue

The SDK is local-first. monceai.Snake wraps algorithmeai.Snake: training fans out to the mesh; inference runs in-process on a stripped model unless the batch is too big or the mode needs the population.

7.1 algorithmeai synergies

Drop-in constructor. monceai.Snake mirrors algorithmeai.Snake.__init__ exactly — 13 positional/keyword args identical (target_index, n_layers, bucket, noise, vocal, workers, oppose_profile, lookahead, datatypes, …). v9 extras (endpoint, cloud_threshold, tau, model_id) are keyword-only, after a * separator.
__getattr__ forwarding. Unknown attributes lazy-load the stripped local Snake and forward through. m.targets, m.header, m.layers, m.oppose(A, B), m.apply_clause(X, c) — every algorithmeai method is reachable on a cloud-trained instance with no drift to maintain.
Byte-equivalent JSON. The mesh's assembled model.json is the exact shape algorithmeai.Snake.to_json() emits. So algorithmeai.Snake(cloud.to_algorithmeai()) reloads it verbatim — train on the cloud, deploy fully offline, zero compatibility shim.
Three constructor paths. Snake(rows) → POST /v9/train; Snake("model.json") with "layers" → POST /v9/upload (S3-stages an externally trained model so cloud predict-at-scale works against it); Snake("v9-…") → reconnect.
Compatibility check. Snake.health() hits GET /v9/health, returns {sdk_version, backend_version, compatible, predict_tau, n_shards}. Major-version mismatch flags compatible: false — fail loud at startup, not on first predict.

7.2 Why cloud-train + local-predict is the default

Training is intrinsically distributed: the v9 tree fans out 100s of leaf Lambdas in parallel, finishing in ~5s for 5K rows where a single-threaded algorithmeai would take similar time but use one CPU. Once the model is assembled, prediction is sub-millisecond per row in algorithmeai — the HTTPS round trip dwarfs the compute. Cloud predict only wins on parallelism above the dispatch threshold (see /math §10), or when the population dictionary is required (audit, lookalikes, augmented, candle).

8. Why a Mesh and Not a Tree of Functions

v6 was a single recursive Lambda function, capped by AWS at 1000 reserved concurrency for the whole training job. At 150K rows, peak concurrent invocations exceeded the cap and tail latency exploded. The May 18, 2026 incident burned $1057 in 19 hours when v6 retried into its own throttle under AWS's tightened scaling-rate ceiling.

v9's 100 functions isolate recursion: peak concurrent invocations spread by hash(model_id, path) % 100, so each shard sees ~1% of the load and a misbehaving job hits its own shard's cap rather than draining the account. The mesh is an isolation/availability mechanism, not a throughput multiplier.

8.1 The honest ceiling

Earlier v9 docs claimed "100 shards × 1000 reserved = 100,000 aggregate concurrent." That number is wrong. Verified against AWS service-quotas on 2026-05-21:

Lambda quota (eu-west-3)	Value	Adjustable?
L-B99A9384 Concurrent executions	20,000	yes (via support)
L-A1AFA3CF Concurrency scaling rate	1,000 / minute	NO
UnreservedConcurrentExecutions	17,859	derived

The scaling rate (L-A1AFA3CF) applies at the account level per region, not per function. AWS does not raise it. So 100 shards do not stack their 1000 caps into 100K — they all draw from the same 1000/min budget for warming new on-demand containers.

What sharding does buy:

Blast-radius isolation. A runaway job throttles inside its own shard instead of the whole account. v6's incident is impossible to repeat at the same severity.
Warm-pool persistence. 100 functions keep more containers warm than 1 function for the same usage pattern, lowering cold-start tail at low load.
Independent rollouts. Per-shard reserved concurrency, per-shard versioning, per-shard provisioning — we can warm 32 shards without warming all 100.

What sharding does not buy: throughput. Empirically observed 2026-05-21, the L-A1AFA3CF ceiling is binary, not soft. A 1-minute window at peak rate > 16.67/s causes AWS to reject the excess and the divide tree's parent-poll waits never resolve — the run wedges. At τ=1000, the break is at N≈7.5K×25L (peak 17.40/s observed), and beyond it the only mechanism that keeps v9 feasible is provisioned-concurrency bypass:

FEASIBLE(N, L, τ, N_slots) ≡
    L · ⌈N/τ⌉ / spawn_window  ≤  16.67 + N_slots / t_leaf

  ⇒  N_slots(N, L) = O(L · N / log N)     at fixed τ=1000

So 1M×5L needs 81 slots toggled on for 6.5 minutes ($1.75 total); 1M×25L needs 449 slots for 33 minutes ($22.81). See /paper §7 for the derivation and /concurrency §4 for the live spawn-rate panel that catches a wedge in flight.

8.2 Lambda teases cheap and charges real

The serverless promise is pay only for what you use. The reality for compute-shaped workloads — trees of short-lived invokes that need real parallelism — has three teeth that don't show up on the price sheet:

What AWS markets	What actually bills you
"Run code without thinking about servers"	EC2 gatherer required for SQS draining + assembly — $98/month before any Lambda fires
"Massive parallelism, scale to zero"	Account-wide 1000/min spawn ceiling that 100 shards cannot bypass
"$0.0000167/GB-s, sub-cent per invoke"	Cold-start chains at depth-3 bill 3× the useful compute as parent-poll slot-wait (measured: 1.71× parallelism on a 5-shard fan)
"Provisioned concurrency for predictable latency"	14× the on-demand compute rate, paid 24/7 unless toggled off — a leaked toggle on 100 shards is $168/day

v9's design absorbs the first three: gatherer is a known fixed cost, sharding controls blast-radius, the toggle hands the fourth tooth back to the user as a deliberate session primitive. See /economics §5 for the three-bucket cost breakdown and /concurrency for the live spawn-rate budget.