SnakeBatch v9: Architecture

Charles Dana · Monce SAS · May 2026

snakebatch.aws.monce.ai · /paper · /economics · /concurrency · /math

1. System Diagram

Client (monceai SDK / curl) HTTPS · POST /v9/train EC2 Gatherer · snakebatch-web · t4g.2xlarge · 8 vCPU / 32 GB app.py · v9_train.py · grid_v9.py · 4-thread SQS drainer stages pop.json.gz to S3 · fans 5 root divides via lambda.invoke (Event) hash(model_id, layer) % 100 → root shard v9-worker-pipe-{00..99} · 100 isolated Lambda functions 4096 MB · 300 s timeout · x86 · Layer: algorithmeai-snake:5 (Snake v5.4.5 + Cython) single binary handler_snake.py · dispatched by event["role"] role=divide recursive halve until n ≤ τ ~80 ms · fires 2 child divides non-blocking role=conquer fan ONE leaf with all indices ~150 ms · DDB parent_pk poll blocking ← chain bottleneck role=leaf Snake(local_pop, n_layers=1, bucket=b) ~400 ms warm · emits literals + path → SQS report queue SQS SendMessage (one per leaf bucket) v9-reports SQS · single queue gatherer._assemble_model → model.json (algorithmeai-shape) → S3

Single binary, three roles, dispatched by event["role"]. The conquer role is the only blocking node — it parent-polls DDB until the leaf below it acks. This is also the chain bottleneck observed in practice (see /economics §5).

2. Three Roles, One Binary

2.1 role=divide

input : {indices, cond_prefix, tau, bucket, s3_key, ...}

if len(indices) ≤ tau:
    fire ONE conquer with indices, cond_prefix, bucket
    return

oppose(A, B) on the slice's targets → literal lit
left  = [i for i in indices if  apply_literal(pop[i], lit)]
right = [i for i in indices if !apply_literal(pop[i], lit)]

fire divide(left,  cond_prefix + [lit_pos])
fire divide(right, cond_prefix + [lit_neg])
return                                                  # <100ms

Divide nodes are cheap. They never block on children. Spot-concurrency holds because each divide occupies a Lambda slot for a few hundred milliseconds, not the full subtree wall time.

2.2 role=conquer

input : {indices, cond_prefix, bucket, s3_key, ...}

create DDB parent counter (expected=1)
fire ONE leaf with the full indices, bucket, cond_prefix
poll DDB until completed_children == 1 (or timeout)
emit None.

This is the only blocking node. It exists so the divide tree has a clear "all leaves below me have ack'd" semantic. We can later replace this with queue-as-truth completion without changing the leaf or divide.

2.3 role=leaf

input : {indices, cond_prefix, bucket, s3_key, ...}

local_pop = [pop[i] for i in indices]
model = Snake(local_pop, target_index, n_layers=1,
              bucket=user_bucket, noise=0,
              datatypes=GLOBAL,                 # <-- enforced
              oppose_profile=GLOBAL,
              workers=1)

# Snake's build_bucket_chain emits multiple buckets
# with NATIVE per-bucket conditions (IF/ELIF/ELSE).
for b in model.layers[0]:
    new_cond  = cond_prefix + (b.condition or [])
    members   = [indices[m] for m in b.members]
    emit({condition: new_cond, clauses: b.clauses,
          members, lookalikes: b.lookalikes,
          origins: b.origins})

Position-keyed lookalikes from Snake ("0", "1", …) ride along unchanged because they index into members[]. members is remapped to global indices; lookalike keys stay positional and resolve correctly on the gatherer side after assembly.

3. Population Staging

The gatherer:

  1. Reorders columns so target_index is column 0.
  2. Detects per-column datatype (N numeric or T text), enforces it for the whole job — leaves never re-detect.
  3. Coerces values: N→float, T→str.
  4. Builds pkg = {population, targets, header, datatypes, target_index, oppose_profile}.
  5. Gzips + uploads to S3: s3://snake-batch-monce/jobs/{jid}/pop.json.gz.

Workers cache the package in /tmp on first read, then in process memory (_pop_cache). Cold leaves pay the S3 read once per warm container; warm leaves are zero-cost.

Why datatypes are enforced globally: a divide split can yield a slice where every value of column k looks numeric to the local sniffer, even though the global column is text. If the leaf re-detects, types diverge across leaves and the assembled chain becomes incoherent. v9 propagates the global datatypes vector everywhere.

4. Assembly — Gatherer Side

SQS drainer thread (4 receivers):
  loop:
    msgs = ReceiveMessage(v9-reports, max=10, wait=2s)
    for m in msgs:
        body = json.loads(m.Body)
        ingest_report(body)        # appends buckets to job["leaves_by_layer"]
        delete_message_batch(...)

ingest_report(body):
    layer = body["layer"]
    job["leaves_by_layer"][layer] += body["buckets"]
    job["covered_by_layer"][layer] |= {m for b in body["buckets"]
                                        for m in b["members"]}
    if all(coverage[l] == n_rows for l in range(n_layers)):
        assemble_model(job)

assemble_model(job):
    layers = [job["leaves_by_layer"][l] for l in range(n_layers)]
    model.json = {version, header, datatypes, target,
                  targets, population, n_layers, bucket, noise: 0,
                  oppose_profile, layers}
    job["status"] = "done"

The gatherer never builds clauses or runs Snake. It is a pure assembler. The model JSON conforms exactly to algorithmeai.Snake.to_json() shape.

5. Routes

MethodPathPurpose
POST/v9/trainTrain; returns model_id + status_url + model_url
GET/v9/status/{id}Job progress, leaves_per_layer, coverage_per_layer
GET/v9/model/{id}Final assembled model JSON (algorithmeai-shape)
GET/grid/v9Live 10x10 grid showing per-shard activity
GET/paperThis series — the paper
GET/architectureYou are here
GET/economicsCost model
GET/mathEquations only

6. Reliability Properties

PropertyHow v9 enforces it
Per-shard isolation100 fns. A wedged job hits its own shard's wall, not the account-wide L-A1AFA3CF budget. Throughput still gated by L-A1AFA3CF (see §8.1).
Population is staged onceS3 + /tmp cache. Warm leaves are zero-cost on the data side.
Datatypes enforced globallyorchestrator detects, propagates, leaves never re-detect.
Position-keyed lookalikesSnake emits "0"..."n-1"; members[] remap is the only translation.
Unique complete-AND per bucketcond_prefix + Snake.condition; one leaf per route.
Tautology / 1-class slicesSnake handles natively — no shim, no target-name keys.

7. monceai SDK ↔ Mesh Dialogue

The SDK is local-first. monceai.Snake wraps algorithmeai.Snake: training fans out to the mesh; inference runs in-process on a stripped model unless the batch is too big or the mode needs the population.

monceai.Snake (SDK) Snake(rows, target_index="label") cloud_threshold = 5000 POST /v9/train Gatherer (FastAPI · EC2) /v9/train · /v9/predict-sync · /v9/upload · /v9/health drainer: SQS reports → assembled model.json spawn leaves (hash mod 100) v9 mesh — 100 Lambda shards divide · conquer · leaf · predict_divide · predict_compute algorithmeai-snake:5 layer · 4GB · x86 model.json + model_stripped.json S3 · jobs/{model_id}/ small / pure-predict n ≥ threshold OR population mode Local · algorithmeai.Snake GET /v9/model/{id}/stripped → in-process get_prediction · get_probability · get_regression sub-ms / row · $0 Cloud · /v9/predict-sync predict_divide → predict_compute mesh get_audit · get_lookalikes · get_augmented · get_candle ~1s RTT + n·~1µs / row at saturation

7.1 algorithmeai synergies

7.2 Why cloud-train + local-predict is the default

Training is intrinsically distributed: the v9 tree fans out 100s of leaf Lambdas in parallel, finishing in ~5s for 5K rows where a single-threaded algorithmeai would take similar time but use one CPU. Once the model is assembled, prediction is sub-millisecond per row in algorithmeai — the HTTPS round trip dwarfs the compute. Cloud predict only wins on parallelism above the dispatch threshold (see /math §10), or when the population dictionary is required (audit, lookalikes, augmented, candle).


8. Why a Mesh and Not a Tree of Functions

v6 was a single recursive Lambda function, capped by AWS at 1000 reserved concurrency for the whole training job. At 150K rows, peak concurrent invocations exceeded the cap and tail latency exploded. The May 18, 2026 incident burned $1057 in 19 hours when v6 retried into its own throttle under AWS's tightened scaling-rate ceiling.

v9's 100 functions isolate recursion: peak concurrent invocations spread by hash(model_id, path) % 100, so each shard sees ~1% of the load and a misbehaving job hits its own shard's cap rather than draining the account. The mesh is an isolation/availability mechanism, not a throughput multiplier.

8.1 The honest ceiling

Earlier v9 docs claimed "100 shards × 1000 reserved = 100,000 aggregate concurrent." That number is wrong. Verified against AWS service-quotas on 2026-05-21:

Lambda quota (eu-west-3)ValueAdjustable?
L-B99A9384 Concurrent executions20,000yes (via support)
L-A1AFA3CF Concurrency scaling rate1,000 / minuteNO
UnreservedConcurrentExecutions17,859derived

The scaling rate (L-A1AFA3CF) applies at the account level per region, not per function. AWS does not raise it. So 100 shards do not stack their 1000 caps into 100K — they all draw from the same 1000/min budget for warming new on-demand containers.

What sharding does buy:

What sharding does not buy: throughput. Empirically observed 2026-05-21, the L-A1AFA3CF ceiling is binary, not soft. A 1-minute window at peak rate > 16.67/s causes AWS to reject the excess and the divide tree's parent-poll waits never resolve — the run wedges. At τ=1000, the break is at N≈7.5K×25L (peak 17.40/s observed), and beyond it the only mechanism that keeps v9 feasible is provisioned-concurrency bypass:

FEASIBLE(N, L, τ, N_slots) ≡
    L · ⌈N/τ⌉ / spawn_window  ≤  16.67 + N_slots / t_leaf

  ⇒  N_slots(N, L) = O(L · N / log N)     at fixed τ=1000

So 1M×5L needs 81 slots toggled on for 6.5 minutes ($1.75 total); 1M×25L needs 449 slots for 33 minutes ($22.81). See /paper §7 for the derivation and /concurrency §4 for the live spawn-rate panel that catches a wedge in flight.

8.2 Lambda teases cheap and charges real

The serverless promise is pay only for what you use. The reality for compute-shaped workloads — trees of short-lived invokes that need real parallelism — has three teeth that don't show up on the price sheet:

What AWS marketsWhat actually bills you
"Run code without thinking about servers"EC2 gatherer required for SQS draining + assembly — $98/month before any Lambda fires
"Massive parallelism, scale to zero"Account-wide 1000/min spawn ceiling that 100 shards cannot bypass
"$0.0000167/GB-s, sub-cent per invoke"Cold-start chains at depth-3 bill 3× the useful compute as parent-poll slot-wait (measured: 1.71× parallelism on a 5-shard fan)
"Provisioned concurrency for predictable latency"14× the on-demand compute rate, paid 24/7 unless toggled off — a leaked toggle on 100 shards is $168/day

v9's design absorbs the first three: gatherer is a known fixed cost, sharding controls blast-radius, the toggle hands the fourth tooth back to the user as a deliberate session primitive. See /economics §5 for the three-bucket cost breakdown and /concurrency for the live spawn-rate budget.

© 2026 Charles Dana · Monce SAS · SnakeBatch v9 · /paper · /economics · /math