The Math — SnakeBatch

Charles Dana · Monce SAS · May 2026

/paper/v9 · /architecture/v9 · /economics/v9

One page. Equations only. The way Charles thinks about it.

1. Snake as Indicator-to-CNF

Let P be a finite population, T a discrete target, and 1_C(x) the indicator of class C ⊂ P. The Dana Theorem (2024):

∀ C ⊂ P, ∃ φ_C ∈ CNF : 1_C(x) = ¬φ_C(x)

constructed in polynomial time by:

φ_C = ∧_{f ∉ C} ∨_k oppose(t_k, f)

where oppose(t, f) returns a literal L with L(t) = 1, L(f) = 0 for some t ∈ C. Each clause excludes at least one non-member without falsifying any member. The construction is O(|P|² · m) where m = |features|.

2. Bucketing Linearizes

Partition P into n/b buckets via an IF/ELIF/ELSE chain. Each bucket holds ≤ b members, so SAT construction inside it is O(b²m).

T_train(n, m, L, b) = O(L · n · b · m)

Linear in n, linear in m, L = layers, b = bucket size (constant; default 250 in algorithmeai, user-controlled in v9).

3. Lookalike Probability

For a query X routed to bucket B at layer l, let negated(X) = {i : clause_i(X) = 0}. The lookalike pool is:

Λ_l(X) = { m ∈ B.members : Λ-cond_m ⊆ negated(X) }

and predicted class probability is:

P(c | X) = ( ∑_l=1..L |{m ∈ Λ_l(X) : T(m) = c}| ) / ( ∑_l=1..L |Λ_l(X)| )

Continuous targets: each unique y is its own class. Perfect fit on training data is by construction (every row is its own singleton class with at least one defining clause).

4. v9 Tree Topology

Binary divide until slice size ≤ τ; then conquer hands the slice to one leaf. Per layer:

depth = ⌈ log₂(n / τ) ⌉ · leaves = conquers = ⌈ n / τ ⌉ · divides = ⌈ n / τ ⌉ − 1

Total per training:

W(n, L) = L · (3 · ⌈ n/τ ⌉ − 1)

5. Wall Clock

Divides chain through depth, conquers wait once for their leaf, leaves run in parallel:

T_wall(n, L) ≈ depth · λ_invoke + T_leaf + bus_drain

With λ_invoke ≈ 100ms warm, T_leaf ≈ 0.5s, bus_drain ≈ 1s:

T_wall(10⁷, 5) ≈ 13 · 0.1 + 0.5 + 1 = 2.8s per layer = ~14s for 5L parallel-invoked layers

6. The Routing Invariant (Bayesian)

Let path(X) be the divide-tree leaf X reaches. Each bucket carries a condition — a conjunction of literals along its path:

cond(B) = ∧_i ℓ_i, X ∈ B ⇔ X satisfies cond(B)

Invariant. For any two assembled buckets B₁, B₂:

cond(B₁) ≠ cond(B₂)

This is unique complete-AND per bucket. Equivalent: the assembled chain is a flat IF/ELIF/ELSE where at most one bucket matches any X. traverse_chain's "first match wins" is then order-independent.

7. The Bayesian Contradiction (what we removed)

If cond(B₁) = cond(B₂), both buckets fight for the same X. traverse_chain returns B₁; B₂'s lookalikes never vote. Worse:

For X with P_true(c | X) = 1 and unanimous lookalikes in B₂,
the model returns P_obs(c | X) = |Λ₁(X)| / total
which can be < 1 because B₂'s mass is missing.
The thin pool then hallucinates opposing class probability mass.

Cure. One leaf per route ⇒ Snake's build_bucket_chain is the only source of bucket conditions, and its output respects unique complete-AND by construction.

8. The Noise Trap

Snake's noise samples from local_pop = the routed slice. Adding noise this way still respects cond(B) (the noise members are already on the path). True regularization noise samples from:

N_cond(B) = { p ∈ P_global : p satisfies cond(B), p ∉ routed(B) }

If we sampled any p violating cond(B):

P(c | X) would integrate over members violating cond(B) — a Bayesian contradiction: voters must satisfy the path's AND.

Hence v9: noise = 0 at the leaf, until post-routing global injection lands.

9. Cost

$(n, L) = L · ( ⌈ n/τ ⌉ · C_leaf + (⌈ n/τ ⌉ − 1) · C_div + ⌈ n/τ ⌉ · C_cnq )

With C_leaf ≈ $2.7×10^-5, C_div ≈ $5.4×10^-6, C_cnq ≈ $10^-5:

$(10⁷, 5) ≈ 5 · (313 · 2.7×10^-5 + 312 · 5.4×10^-6 + 313 · 10^-5) ≈ $0.50

10. Empirical Time Complexity (3 datasets × 5K, May 2026)

We have two training sizes per dataset (n=4000 held-out, n=5000 perfect-fit). Two points fit a linear extrapolation in the form T(n) = a + b·n. The mesh fan-out and SQS round-trip are baked into a; b is the true per-row leaf cost amortized across shards.

Task	T(4000)	T(5000)	a (s)	b (ms / row)	Predict T (1K rows)
binary	4.01s	4.80s	0.85	0.79	4.76s
3-class	3.72s	4.66s	−0.04	0.94	4.44s
regression	20.39s	24.93s	2.23	4.54	4.20s

Two-point fits are honest about what they are. Reading the slopes:

Classification: ~1ms / training row, dominated by SAT clause construction at the leaf. Holds across binary and 3-class shapes (the 3-class slope is barely higher despite 50% more class buckets per leaf).
Regression: ~4.5ms / training row — ~5× slower. Continuous targets force every unique y to be its own class, so the oppose loop's outer cardinality is O(distinct(y)) ≈ O(n) in the worst case. Linear in n still, but with a much larger constant.
a (offset): ~0.5–2.5s for the full pipeline (HTTPS + drainer warmup + SQS bus drain). Below 1K rows, a dominates — explaining why 1K and 5K have similar wall clocks.

Asymptotic form holds. T_train(n,L,b,m) = O(L · n · b · m) predicted by §2; measured per-row slopes are within the expected range for the leaf cost model. The mesh's fan-out keeps the multiplier flat across shards rather than serial, so n grows in the horizontal axis where the algorithm is linear.

Predict slope — depends on the model, not just n

Per-row cloud predict cost is not a constant. It traverses the assembled bucket chain, whose length grows with training size and layer depth. Two empirical points:

Model trained on	Layers	Predict 1K rows	Rows/s
Toy ~800 rows	3	~3.4s	~1400
Nature 22,147 rows	5	~18s	~250

So:

T_predict(n; M) ≈ RTT + n · r_c(M) where r_c(M) scales with assembled-bucket-chain length

The dispatch threshold (§11) is therefore a function of which model the user is predicting against, not just how many rows they ship. The SDK's default conservatively assumes the lighter end of this range.

11. SDK Dispatch Threshold (when local beats cloud)

Let n = batch size at predict time, r_l(M) = local algorithmeai per-row latency on assembled model M, r_c(M) = cloud per-row latency on the same model at the leaf, RTT = HTTPS + Lambda warm + drainer overhead:

T_local(n) ≈ n · r_l(M) · T_cloud(n) ≈ RTT + n · r_c(M)

Cloud wins when T_cloud(n) < T_local(n). Solve:

n* = RTT / (r_l(M) − r_c(M))

Critically, both r_l(M) and r_c(M) scale together with the same model. Their difference is what governs the crossover, and the difference is small — the cloud isn't "faster per row," it's "more cores at once." A few empirical anchors:

Model	r_l(M)	r_c(M)	RTT	n*
Toy ~800/3L	~0.01ms	~0.7ms	~1s	~1500
Nature 22K/5L	~0.05ms	~4ms	~1s	~250

n* is lower on heavier models because the cloud's "n more parallel shards" advantage shrinks: each shard does more work per row, so adding 18 chunks doesn't 18× the throughput — the slowest leaf still gates wall clock.

The SDK default cloud_threshold = 500 (since v2.2.1) is the no-information default; 500 rows guarantees the user sees the mesh activate (a demo property) without sandbagging local speed. For production workloads on a known model, override:

Snake(model_id=…, cloud_threshold=N)

where N is sized to the empirical r_l / r_c of that model. Big-train, deep-layer models prefer cloud_threshold closer to 100; toy models prefer 5000+.

Population modes are always cloud. get_audit, get_lookalikes, get_augmented, get_candle read the population dictionary. The SDK ships a stripped model locally (no population); the mesh has the full one. Dispatch ignores n for these modes — cloud unconditionally.

11. Headline

n	L	Wall	Cost	Fit on D
10³	5	3.8s	$0.0006	100%
10⁵	5	~20s	~$0.005	100%
10⁶	5	~30s	~$0.05	100% (predicted)
10⁷	5	~50s	~$0.50	100% (predicted)

Perfect fit on D is not aspirational. It's a theorem of the construction. The Dana Theorem guarantees φ_C exists in poly time; v9 produces it distributed, with unique complete-AND per bucket; algorithmeai inference resolves it identically to a local Snake. The 100% column is what the math says, every time.