April 2026 · Research Paper

Derived Data Abundance

How multi-modal embedding decomposition solves the AI training-data crisis.

Chris Royse — Co-Founder, Teleox.ai

Abstract

The training-data crisis is misframed.

The AI industry is hitting the training-data wall. Epoch AI projects the ~300 trillion tokens of public human text will be exhausted by frontier models between 2026 and 2032. This paper argues the crisis is misframed: the bottleneck is not raw volume but labeled, meaningful, multi‑dimensional data.

Derived Data Abundance — the principle of running existing data through N independent embedding models to produce 100x+ meaningful training signal per corpus — offers a third path past the wall. We demonstrate it through two implemented systems: ClipCannon (video, 7 modalities, 4,044 dimensions) and Context Graph (text, 13 embedders, 78 cross-correlations).

We introduce meaning compression as a new category — distinct from weight, activation, and data compression — and argue it is the missing primitive for scaling training data beyond the public-text stock. The resulting training method is Teleological Constellation Training (TCT).

The full paper preserves the author's original phrasing, including the initial 91× multiplication figure from early implementations. Current Teleox systems and external reporting use the refined figure of 100x+, scaling toward 50+ embedders.

The data wall, in one screen

Eight numbers from the paper.

300T

tokens of public human text — the training stock

2026–32

window when frontier labs exhaust it

independent embedders per input (Context Graph)

cross-correlations per input (Context Graph)

4,044

embedding dimensions in ClipCannon (video)

12,000

labeled samples extracted from 16 minutes of video

100x+

effective training data from existing corpora

implemented systems proving the principle

Three findings

What changes when meaning is the unit.

Meaning compression is a new category.

Distinct from weight, activation, and data compression — which all reduce bits per unit of information. Meaning compression does the opposite: it increases meaningful signal per byte of raw data. That is the missing primitive for scaling training corpora past the human-text stock.

N embedders × one corpus = N-dimensional labels.

Each embedder extracts a different dimension of meaning — visual, semantic, temporal, causal, relational. All are grounded in real observations, not model outputs, so the derived data is immune to the model-collapse failure mode inherent in synthetic generation.