There is a number doing the rounds at every observability conference this year. AI workloads, the slide says, produce ten to fifty times the telemetry of a traditional service. The number is usually quoted without sourcing and presented as an inevitability of the new architecture. I spent the last fortnight putting an actual measurement harness around the question, and the answer is more interesting than the slide.

The short version. On the same RAG application, with the same auto-instrumentation, the choice of Python package alone is roughly a four-times swing on bytes per request. The newer of the two libraries puts roughly a kilobyte of metadata on the span and ships the conversation content as separate log records. The older one writes every message, twice, as flat string attributes. And the per-request multiplier, real as it is, gets dwarfed by request rate once you start comparing daily volumes against an actual REST service.

How the test was set up

The workload is a RAG application against a customer-support knowledge base. Each request classifies intent, embeds the user query, pulls the top five matches from Qdrant, re-ranks them, calls a chat model with the assembled context, validates the response, and returns. Five LLM calls per request, one vector query, plus the HTTP entry and exit.

Both runs use auto-instrumentation through opentelemetry-instrument. No spans are written by hand. The collector accepts OTLP and writes one file per (service, signal) pair, filtered on service.name. The harness fires fifty requests, stops the collector, and counts the bytes.

Held constant across both runs: a system prompt of 200 tokens, retrieved context of around 3,000 tokens, completions of 500 tokens for the full-length prompts. 1,536-dimension embeddings, matching text-embedding-3-small. An OpenAI-compatible mock service whose wire shape is identical to a real OpenAI call. The same openai, qdrant-client, FastAPI, and OpenTelemetry SDK versions in both app variants.

One hundred trials per library, varied prompts

Each trial fires ten requests against the RAG app, waits three seconds for the collector's batch processor to flush, and records the byte deltas on the three telemetry files. Each request samples a prompt from a pool of thirty questions ranging from four-token factoids ("Reset admin password?") to one-hundred-token multi-part troubleshooting questions. The mock LLM returns a completion whose length scales with the prompt: short prompts get one to two sentences, medium prompts three to six, long prompts eight to fifteen.

Numbers reported below are means across one hundred trials per library, around one thousand requests per library. Standard deviation, p50, p95, min, and max are computed from the same per-trial samples and reported alongside the means where it matters. The full per-trial JSONL and the aggregated stats are reproducible from the harness in the repo.

What varies between trials: question content and length, completion length (scaled to prompt size with a small noise component), token counts, batch boundaries. Held constant: instrumentation versions, the mock LLM's behaviour, the knowledge-base contents, the collector configuration, and the surrounding infrastructure.

OpenInference: the package most teams ship today

OpenInference is the Arize instrumentation that ships by default in Phoenix and most observability vendor templates. It captures the full prompt, completion, token usage, model, temperature, and tool calls as flat attributes on every LLM span.

Bytes per signal · OpenInference run

Mean across 100 trials × 10 requests · prompt content capture on

traces123.4 KiB
metrics2.5 KiB
logs0.06 KiB

Per request: mean 125.96 KiB, sd 37.33 KiB · range 57.81–197.91 KiB.

The main chat completion span dominates. It carries the 200-token system prompt, the 3,000-token retrieved context, and the 500-token response, all serialised to flat span attributes. The embedding span is the second-largest line because the 1,536-float query vector is also captured by default.

Component contribution to per-request bytes · OpenInference

Estimated proportions, rescaled to the measured mean

main chat span~78 KiB
embedding span~23 KiB
intent + re-rank + validate~18 KiB
qdrant span~3.7 KiB
http envelope~3 KiB

Five LLM spans plus embedding plus Qdrant plus the HTTP envelope, ≈126 KiB per request, ±37 KiB depending on prompt length.

OpenTelemetry GenAI v2: same workload, different encoding

The OpenTelemetry GenAI semantic conventions are the OTel-authored spec for LLM telemetry, implemented in the opentelemetry-instrumentation-openai-v2 Python package. Structured metadata stays on the span as attributes; message bodies move to span events that the collector routes onto the OTLP Logs API.

Bytes per signal · OTel GenAI v2 run

Mean across 100 trials × 10 requests · prompt content capture enabled

traces10.1 KiB
logs18.3 KiB
metrics2.5 KiB

Per request: mean 30.89 KiB, sd 3.47 KiB · range 24.97–37.62 KiB.

The pipeline still makes the same five LLM calls plus the embedding and the vector query. What has changed is how much each span weighs once serialised, and where the message bodies live.

Component contribution to per-request bytes · OTel GenAI v2

Same components, fewer bytes per span; message bodies routed to log records

main chat (span + logs)~17 KiB
intent + re-rank + validate~6.8 KiB
embedding span~2.7 KiB
qdrant span~2.5 KiB
http envelope~1.9 KiB

Application logs now flow through OTLP; they were silent in the OpenInference run because that SDK version handled the log exporter differently.

The same workload, on one chart

Both runs share the same steps and inputs. The only thing that differs is the package on the LLM client.

Bytes per request · both libraries

Mean across 100 trials per library · lower is better

openinference125.96 KiB
otel genai v230.89 KiB

OpenInference 126 KiB ± 37 · OTel GenAI v2 31 KiB ± 3 · ratio 4.08×.

The biggest spread is on the LLM spans, where the prompt and completion attributes live. Embedding spans diverge too: OpenInference captures the 1,536-float query vector by default, the GenAI conventions do not.

Where the prompt actually lives

Both libraries capture the same information. They put it in different places in the OTLP payload, and those places carry different overhead per byte. I pulled the biggest chat completion span out of one trial and counted the bytes attribute by attribute. The OpenInference span weighs 6,595 bytes against 1,087 bytes for the OTel GenAI v2 span. The breakdown below explains where the gap comes from.

openinference · 6,595 B

all on one span

2,083 B
input.valuefull OpenAI request body serialised as a JSON string
1,232 B
llm.input_messages.1.message.contentuser message + retrieved context, as a flat string
951 B
output.valuefull OpenAI response body serialised as a JSON string
781 B
llm.input_messages.0.message.contentsystem prompt, as a flat string
602 B
llm.output_messages.0.message.contentassistant completion, as a flat string
115 B
llm.invocation_parametersparameters as a JSON string
84 B
llm.output_messages.0.message.role"assistant"
80 B
llm.input_messages.0.message.role"system"
78 B
llm.input_messages.1.message.role"user"
~589 B
ten other metadata attributesmime types, token counts, model, finish_reason, system

otel gen_ai v2 · 1,087 B span + ~1,360 B logs

metadata on the span, messages as separate log records

105 B
gen_ai.response.finish_reasons["stop"]
92 B
gen_ai.response.idthe chatcmpl-... identifier
73 B
gen_ai.response.modelmodel name on the response
72 B
gen_ai.request.modelmodel name on the request
67 B
gen_ai.usage.output_tokensinteger
66 B
gen_ai.usage.input_tokensinteger
66 B
gen_ai.request.max_tokensinteger
66 B
gen_ai.operation.name"chat"
63 B
server.address"mock-llm"
60 B
gen_ai.system"openai"
53 B
server.portinteger
+ logs
linked by traceId + spanId
~458 B
gen_ai.system.messagesystem prompt content as a log record body
~337 B
gen_ai.user.messageuser prompt + retrieved context
~565 B
gen_ai.choicethe completion

Three things drive the difference.

OpenInference captures the conversation twice

input.value (2,083 B) and output.value (951 B) hold the full OpenAI request and response bodies as serialised JSON strings, identical to what the SDK sent on the wire. On top of those, every individual message is also captured as a flat llm.input_messages.N.message.content attribute. The same conversation is encoded in both shapes because some tools read the flattened version and some read the structured one. The duplication alone costs about three kilobytes per chat span.

Attribute keys repeat on the wire for every message

OpenInference uses fully-qualified flat keys: llm.input_messages.0.message.content, llm.input_messages.1.message.content, llm.output_messages.0.message.content, and so on. Every key is written character-for-character inside an OTLP wrapper for every message. A four-message conversation pays roughly two hundred bytes of pure key-string overhead plus another hundred of wrapping. The GenAI conventions reuse short attribute names (role, content) inside a typed event body, so the wrapper happens once per message rather than once per field per message.

OTel splits message bodies out of the span and into log records

When prompt capture is enabled, OTel GenAI v2 emits the message bodies as log records via the OTLP Logs API. Each log record carries traceId and spanId so a UI can still join them back to the trace. The span itself stays around one kilobyte of metadata. OpenInference keeps every byte of conversation content as attributes on the chat span. The total bytes are not zero either way, but the storage and routing are very different: log-tier ingest in most observability platforms is cheaper per byte than trace-tier, and message bodies can be sampled, redacted, and retained independently of the trace itself.

The same chat completion costs about 6.6 KB on a single OpenInference span versus about 2.5 KB across an OTel span plus three linked log records. The visible per-trial total ratio in the side-by-side chart (4.08x) is lower than the trace-only ratio (around 12x) because the OTel side ships those log records, which add to its logs file. The full picture is: OpenInference puts most of the AI telemetry on the trace; OTel spreads it across trace, log, and metric. The total is smaller with OTel, but the distribution looks different on the bill.

Why this matters past the byte count

  • Redaction is easier on events. A regex over llm.input_messages.*.message.content hits every message body indiscriminately. Events let you redact by role, by message type, or by content classification.
  • Attribute size limits exist. Collectors and backends truncate attribute values above a threshold; a prompt that overflows gets silently cut, and the trace loses the part the engineer wanted to read.
  • Attribute keys are designed for low-cardinality dimensions an analyst groups by. A unique prompt per request abuses that design even when the spec permits it.
  • Events route. In an OTel pipeline that fans logs and traces to different backends, events-as-logs let you keep conversation history without paying trace-tier ingest for it.

Reading the conversation back

The encoding choice has a consequence I have not addressed yet. Once the message bodies move from span attributes to log records, the trace is no longer self-contained. The span tells you which model was called, how long it took, what the token usage was, and what finish reason it returned. The actual conversation, the prompt the user sent and the answer they got back, lives somewhere else, joined only by trace ID and span ID.

That is fine, until you want to reconstruct what actually happened on a given chatbot turn. To pull the prompt, the retrieved context, the completion, and the spans that produced them onto one screen, you need a backend that can run a single query across both signals: spans filtered by trace ID, logs filtered by the same trace ID, results merged on span ID and sorted by timestamp. If logs and traces live in separate stores with separate query languages, you are going to spend a lot of time copying trace IDs between two browser tabs.

This is the operational cost on the other side of the byte-level win. The space savings are real, the redaction story is better, the per-signal retention is more honest. But you need an observability platform that treats traces and logs as querying surfaces in the same workspace, rather than two products with the same vendor's logo on top. A siloed backend, where logs go to one product and traces to another with a brittle deep-link between them, turns "show me everything for this chatbot conversation" from a single query into a manual archaeology project. The library choice is reversible inside a sprint. The platform choice usually isn't.

Now turn off prompt capture

To turn prompt content capture off, I added two more services to the Docker compose that reuse the same images as the original rag-openinference and rag-otel containers, but each sets one environment variable that tells the LLM instrumentation to stop putting prompt and completion bodies into its telemetry: OPENINFERENCE_HIDE_* on the OpenInference side, OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=false on the OTel side. Then I re-ran the same harness, fifty trials of ten requests, against all four services.

OpenInference dropped from 129 KiB per request to 18, while OTel only dropped from 31 to 23. The same flag, the same workload, but a seven-times reduction on one side and roughly a quarter on the other.

Bytes per request · content on vs content off

Mean across 50 trials per library, ten requests per trial · sd shown alongside

openinference129.4 ± 43.5
openinference redacted18.4 ± 3.2
otel genai v230.8 ± 3.6
otel genai v2 redacted23.1 ± 2.9

All values KiB. Reduction from content-on: OpenInference 7.03×, OTel 1.33×. Once both are redacted, OpenInference comes in at 0.80× the size of OTel.

The asymmetry

The size of each library's content reduction tracks how much of its bytes were content to begin with. Almost all of OpenInference's volume sits on the trace: a trial produces 1.17 MiB of traces against 12.59 KiB of metrics and 78.85 KiB of logs, roughly 93% trace. When the content attributes are removed, the trace collapses to 92.69 KiB and the per-trial total drops from 1.26 MiB to 184.34 KiB, mostly from that single signal getting much lighter.

OTel spreads the bytes more evenly. Only 94 KiB of its 308 KiB per trial sit on the trace, with 194 KiB living in logs because the GenAI conventions ship each message body as its own log record. Disabling content capture trims those record bodies without removing the records themselves, so the trace barely moves and the total comes in at 230 KiB per trial, a much smaller fall.

Per-trial bytes by signal · all four configurations

Segments are proportional within each row · totals on the right · one trial = ten requests

tracesmetricslogs
openinference1.26 MiB
openinference redacted184.3 KiB
otel genai v2308.4 KiB
otel genai v2 redacted230.6 KiB

OpenInference is mostly traces; OTel is mostly logs, and stays that way even when redacted.

The asymmetry comes from where each library puts the bytes in the first place. OpenInference's content was a much bigger share of a much bigger total, because the system prompt, the retrieved context, and the completion all sit as flat span attributes; the same span also carries input.value and output.value as serialised JSON copies of the same content. When you flip the switch, all of that disappears. OTel never had that surface area to begin with.

The order flips

With content on, OpenInference is roughly four times heavier than OTel. With content off, OpenInference is lighter than OTel, at 18.4 KiB per request against 23.1. The reason OTel doesn't fall further is structural: turning content off empties the message bodies but leaves the records, the histograms, and the per-message routing in place, all of which still cost bytes to ship even when the bodies are empty. That residual structure is OTel's floor under redaction, and it sits above OpenInference's because OpenInference, once you strip the content, has nothing left to ship except a handful of metadata attributes.

For a team that has firmly decided not to capture prompts, OpenInference-redacted is the cheapest deployment available, and the trade is that you also give up the per-signal structure OTel keeps even when redacted. The redacted OTel run still has independent log records per message and per choice, even with empty bodies, which gives you somewhere to put the content back later if you change your mind about a sampled tenant or a specific trace cohort. Once you've stripped content from OpenInference, the structure that would let you flip it back on selectively isn't there to flip.

Variance collapses without content

The means describe the centre of each distribution, but the spread around the centre is the part capacity planning has to budget for, and the spread changes more dramatically with redaction than the mean does. OpenInference with content on has a standard deviation of 43.5 KiB, which is 33.6% of its mean, and the range across five hundred requests spans 57.6 to 244.0 KiB. The size of the prompt is what drives that spread: a short factoid question and a one-sentence completion land at one end, and a one-hundred-token multi-part troubleshooting question with a paragraph-long completion lands at the other. When the content goes, that source of variation goes with it. Standard deviation falls to 3.2 KiB, the range tightens to 13.4 to 23.6 KiB per request, and what's left is mostly fixed envelope.

OTel sits at 11.7% relative variance with content on and 12.4% with content off, almost no movement either way, because the bytes that change with prompt length were never as large a share of the total in the first place. The steadiest of the four configurations is OTel-redacted, where every request comes in inside a 17.2 to 28.6 KiB envelope.

Per-request byte spread · min to max, with p50 to p95 band and mean tick

Shaded band: p50 to p95 · vertical tick: mean · horizontal scale 0 to 245 KiB

openinference 57.6 to 244
openinference redacted 13.4 to 23.6
otel genai v2 24.8 to 37.1
otel genai v2 redacted 17.2 to 28.6

All values KiB. Relative variance (sd as a percentage of mean): OpenInference content on 33.6%, OpenInference redacted 17.3%, OTel content on 11.7%, OTel redacted 12.4%.

The practical effect is that the redacted OTel run is the easiest of the four to put a forecast against. The default OpenInference run is the hardest, by roughly three times on relative variance and an order of magnitude on the absolute range, which means any capacity model built off its average will be wrong on a quarter of the requests in either direction.

App logs are a side channel

Turning off the GenAI capture flag stops the LLM instrumentation from putting prompt content into the telemetry, but it doesn't touch any other code in the application that happens to do the same job. The application I was measuring has this line near the top of its request handler:

log.info("ask received id=%s q=%s", req_id, body.question[:80])

That statement has nothing to do with the GenAI instrumentation. It flows through the OTLP logs pipeline like any other application log, and the GenAI capture flag has nothing to say about it. Grepping the redacted OTel logs file for known prompt phrases turned up "reset the admin password" sitting in the rag-scope log records, exactly as the application emitted it.

openinference redacted · what's left on the span

content attributes removed; metadata stays

2,083 B
input.valueremoved
1,232 B
llm.input_messages.1.message.contentremoved
951 B
output.valueremoved
781 B
llm.input_messages.0.message.contentremoved
602 B
llm.output_messages.0.message.contentremoved
115 B
llm.invocation_parametersstays
~242 B
llm.*_messages.*.message.rolethree role attributes still set
~589 B
ten other metadata attributesmime types, token counts, model, finish_reason, system

otel gen_ai v2 redacted · span and log records

span metadata unchanged; log records still emitted with empty content

~880 B
gen_ai.* span attributesunchanged from the content-on run
+ logs
still emitted, linked by traceId + spanId
~160 B
gen_ai.system.messagerecord present, body empty
~160 B
gen_ai.user.messagerecord present, body empty
~160 B
gen_ai.choicerecord present, body empty
+ side
application log: "ask received id=... q=..."prompt content still present, not redacted by the flag

And it isn't just this one log line. The same content can reach the collector through request-tracing middleware that records query strings, through custom span attributes the application sets by hand, through a sidecar that mirrors requests into a queue for audit, or through anywhere else in the codebase that touches the body of the request. The GenAI capture flag has no view of any of that. The only honest way to know whether a "redacted" deployment is actually redacted is to read what's reaching the collector, rather than trusting what the instrumentation documentation says it captures.

Pick a quadrant on purpose

Laid out together, the four configurations describe four operating points for the same workload, each with its own trade-off, and they're worth picking deliberately rather than inheriting from whichever template the team copied off a vendor blog when they first wired this up.

content on
content off
openinference
openinference · default 129.4 KiB / req

Richest for debugging, cheapest to reach (it ships in most observability templates), most expensive at scale. Variance 34% of mean, so capacity planning is the hardest of the four.

openinference · redacted 18.4 KiB / req

The lowest floor on bytes. No prompts and no completions visible. The trace is the only signal carrying any LLM context at all; per-message structure does not exist.

otel genai v2
otel · default 30.8 KiB / req

Moderately heavy by total volume, spread across signals so per-tier cost decisions stay tractable. Prompts and completions are addressable as log records, redactable per message, routable to a separate backend.

otel · redacted 23.1 KiB / req

The steadiest of the four (12% relative variance, 17.2 to 28.6 KiB envelope). Keeps the per-signal structure but no message bodies. A reasonable default for teams running close to a budget who expect to re-enable capture on a sampled basis.

The interesting question is which corner of this matrix actually matches what the team needs to do with the telemetry, and whether the surrounding code (the application logs, the middleware, the custom spans the app sets by hand) honours the same redaction posture as the LLM client does. The four-library data set lives in results/trials-4lib.jsonl, with aggregated stats in results/stats-4lib.json, and you can reproduce the whole thing by running ./run-trials.sh 50 10 against the harness on GitHub.

When does OpenInference stop being worth it?

If you've read this far and decided you still want OpenInference for its richer per-span attributes (the OpenInference span carries input.value, output.value, structured prompt templates, and per-message role and content pairs that OTel's GenAI conventions do not), one workable strategy is to keep OpenInference instrumented on every request and to sample only the prompt-content capture. On a sampled span you get OpenInference's full shape at 129.4 KiB; on an unsampled span you get its content-redacted shape at 18.4 KiB. The question is how high you can set that sampling rate before you have spent more bytes per day than you would have just using OTel GenAI v2 with content on for every span.

Per-request bytes vs OpenInference content sample rate

OpenInference instrumented on every request, only prompt-content capture is sampled · OTel GenAI v2 captures content on every span

openinference, sampled content otel genai v2, content always on break-even at 11.2% oi cheaper otel cheaper 0% 25% 50% 75% 100% openinference content capture rate 0 25 50 75 100 125 kib per request

Reference points along the OpenInference curve: 5% sampling lands at 24.0 KiB per request, 10% at 29.5 KiB, 25% at 46.1 KiB, 50% at 73.9 KiB, 100% at 129.4 KiB.

The crossing point sits a long way from the centre of the range: capturing prompt content on a quarter of OpenInference spans already costs roughly half again as much per day as full-coverage OTel, a fifty-percent rate costs a little over twice as much, and one hundred percent lands on the four-times ratio the earlier sections of this post documented. The line in the chart looks gently sloped on a 0-to-140 KiB y-axis, but relative to the break-even at 30.8 KiB it climbs steeply almost immediately.

If your operational tolerance for prompt content capture sits comfortably below 11%, OpenInference with sampled content is genuinely cheaper per day than full-coverage OTel, and you keep the richer per-span attribute shape on the spans that do carry the content. Above 11%, you are paying for OpenInference's verbosity without getting the coverage benefit, and the same daily byte budget gets you more debuggable spans on OTel with content captured on every trace.

One caveat worth flagging. The 11.2% break-even holds under uniform sampling, where the spans selected for content capture are an unbiased slice of all requests. Most sampling in practice isn't uniform. Teams sample on errors, on high latency, on flagged conversations, on specific tenants, and those requests almost always carry longer prompts than the average request does. The break-even on interestingness-weighted sampling drifts down accordingly, into the 8% to 9% region on this workload, because the spans selected for capture carry more bytes than the population mean. Worth measuring on your own traffic once before fixing a rate against an averaged number.

Are these volumes higher than traditional services?

Per request, yes, by a meaningful margin. But that is not the right unit. The bill cares about bytes per day, and request rates between the two patterns differ by orders of magnitude.

The traditional baseline, same harness

I also instrumented a non-AI control. A FastAPI app running a Postgres full-text search against the same knowledge base, returning the best-matching row. Same OpenTelemetry collector, same auto-instrumentation, same workload shape. Just no LLM in the path.

Bytes per request · traditional vs RAG

Both on the OTel GenAI v2 stack · same harness, same collector

traditional REST5.4 KiB
rag (otel)30.89 KiB

Two spans per request against around fifteen. A factor of roughly 5.7× on bytes. Material, not catastrophic.

Request rate is where the orders of magnitude live

Traditional REST traffic patterns sit at a very different scale from chatbot traffic. A handful of public data points:

That gap is roughly three to six orders of magnitude. Plotted out:

Daily telemetry volume by workload

Per-request bytes from this study · request rates from the references above

internal copilot @ 1 RPS2.7 GB/day
heavy chatbot @ 10 RPS27 GB/day
traditional API @ 1k RPS467 GB/day
rag front-door @ 1k RPS2.7 TB/day

A traditional API at 1,000 RPS produces about 467 GB/day from auto-instrumentation alone. A chatbot at 1 RPS, even at 31 KiB per request, produces 2.7 GB/day. The traditional system, despite being 5.7× lighter per request, generates around 170× the daily volume because it serves 1,000× the requests.

The RAG path overtakes the traditional path on absolute daily volume only when the AI workload replaces or fronts a high-throughput service. An AI front door for product search that handles 1,000 RPS at 31 KiB per request is 2.7 TB per day. An agent fronting the entire support intake at the same rate is the same number. Once the request rate matches a traditional API, the per-request multiplier compounds with it and the bill goes vertical. For workloads where the LLM sits beside a small audience (internal copilots, dev tools, niche assistants), the per-request multiplier is real but does not dominate the org's overall telemetry footprint.

What this means in practice

RAG is not expensive to observe. A handful of operational choices are, and each one moves the volume by a measurable amount.

Pick the instrumentation library deliberately. OpenInference and OTel GenAI v2 capture the same insight at different verbosities. On this workload that is a 4× swing on total bytes and a 12× swing on trace ingest. Pick, configure, re-measure.

Encoding is the loudest knob. Flat attribute encoding pays per-key overhead on every entry. Event encoding shares the schema. The same content costs different bytes.

Drop embedding vectors from traces. A 1,536-float query vector is around 25 KB serialised as a JSON array. OpenInference captures it on the embedding span by default; almost nobody reads it back from the observability tier. Store it in the vector DB instead.

Retune alerts for span count. Fifteen spans per request means fifteen places a status code or timeout can fire. Default alerting policies that worked on the REST app will page on every blip.

Redact through structure, not regex. An attribute named llm.input_messages.0.message.content resists selective redaction. An event with a typed role and content field redacts cleanly.

Sample, do not strip. Stripping prompts to save money leaves nothing for an agent or a reviewer to learn from later. Capture in full at a sample rate that scales inversely with traffic.

Run your own numbers

None of this needs to be taken on trust. The harness is open source, runs against any OpenTelemetry collector, and writes its raw byte counts into a JSONL file you can pick over yourself. It defaults to fifty requests per app; raise REQUESTS for a bigger sample. The code lives here, and I would be genuinely curious to hear what numbers other people get on their own RAG shapes.

The slide deck claim was ten to fifty times. The measured number, library choice held constant, is closer to six on bytes per request and effectively negative on bytes per day, until the AI workload starts replacing the things people actually use a lot. The explosion is real, but it is on the per-request line, and most of it sits in the encoding rather than the architecture.