Price Convergence Index Methodology

A robust dispersion statistic for the AI model pricing market, computed weekly across four quality tiers with bootstrap confidence intervals, and published as an open dataset. This page contains the formal definition, the rationale behind every choice, a worked example with last week's actual numbers, the data lineage, the changelog, and the academic references.

View the live PCI tracker /api/pci/series

1. Formal definition

For a tier of N models with positive blended prices p_1, ..., p_N, let L_i = log10(p_i). The PCI for that tier is the biweight midvariance of (L_1, ..., L_N), expressed as a square-root scale (i.e., comparable to a standard deviation).

blended_price(model) = input_price * 0.30 + output_price * 0.70

L_i = log10(blended_price(model_i))

med = median(L)

mad = median(|L - med|)

u_i = (L_i - med) / (9 * mad)

numerator = sum_i (L_i - med)^2 * (1 - u_i^2)^4 for |u_i| < 1

denominator = sum_i (1 - u_i^2)(1 - 5*u_i^2) for |u_i| < 1

biweight_midvariance = N * numerator / (denominator^2)

PCI(tier) = sqrt(biweight_midvariance)

We report the PCI as the square root of the biweight midvariance so its units are comparable to a log10 standard deviation. A PCI of 0.30 means the "typical" deviation of a model in the tier is 10^0.30 ≈ 2x the median price. A PCI of 1.00 means the typical deviation is 10x.

2. The four tiers

Rolling top-10 frontier

The 10 highest-ranked coding models on the live LMC leaderboard at the moment each weekly snapshot was archived. Membership changes whenever a new flagship launches or an old one is overtaken. This is the "what does the current frontier cost" series.

Composition-fixed top-10

The 10 model slugs that hold the top 10 in the most recent week, applied retroactively to every prior week. We do not backfill prices for slugs that did not exist - we simply report a smaller n for those weeks. This is the "how have these specific 10 models repriced over time" series.

Mid-tier (ranks 11 to 30)

The next 20 models below the rolling frontier. This is the "business workhorse" band - mature endpoints that production buyers default to when the frontier is overkill or too expensive.

Commodity (paid coding, < $50/M output)

Every paid coding model with output price below $50 per million tokens. This is the cloud-buyer surface and the largest tier by sample size. Outliers above $50 are excluded so that one experimental endpoint does not dominate the dispersion.

3. Worked example - 2026-w27

The most recent rolling top-10 frontier had n=10 priced models. Their median blended price was $20.75/M, the IQR of log10 prices was 0.472 (so the middle 50 percent spans a 3.0x band), and the sample standard deviation of log10 prices was 0.418.

n = 10

median price = $20.75/M tokens

raw stdev (n-1, log10) = 0.418

IQR (log10) = 0.472 [middle 50% spans 3.0x]

PCI biweight = 0.417

BCa 80% CI = (0.315, 0.556)

The robust PCI (0.417) sits below the raw sample stdev (0.418) because the biweight estimator partially downweights the most extreme model in the tier. The CI width (0.241) reflects how much uncertainty there is at n=10. Since 19 weeks ago the median price has moved from $10.32 to $20.75, a -101.0% drop, while the PCI itself has compressed by 56.9%.

4. Why log space, why biweight

Log space. Token pricing varies across roughly four orders of magnitude, from sub-cent per million for the cheapest distilled open-source models to several hundred dollars per million for premium reasoning specialists. Linear coefficients of variation are dominated by the most expensive observation and produce numbers that move when one frontier model launches a $200/M tier. log10 makes the data approximately symmetric and lets us treat the spread additively.

Biweight midvariance vs sample stdev. Sample stdev has 100 percent Gaussian efficiency but 0 percent breakdown - one experimental flagship priced at 10x the median controls the value. Biweight midvariance retains 87 percent efficiency at the Gaussian (you barely lose information when the data is well-behaved) while tolerating up to 50 percent contamination (you lose almost no information when the data is not).

Biweight midvariance vs MAD. MAD has 50 percent breakdown but only 37 percent efficiency. With n=10 frontier models, that efficiency loss matters. Biweight is the standard Tukey-recommended compromise.

Bessel correction. The raw sample stdev companion uses the n-1 divisor (sample variance), not n (population variance). The previous buried implementation in pricing-history used the population formula, which biases dispersion estimates downward for small samples. The new lib uses n-1.

5. What PCI is not

It is not a price level. Two quarters can both report PCI = 0.30 while having medians 20x apart. Always read the median price column alongside the PCI value.
It is not a forecast. We refuse to publish a long-run floor estimate until we have at least 20 weekly observations - the current 7 weeks cannot identify it.
It is not a price-war detector. If every provider drops their price by 30 percent in lockstep, the median falls but the PCI does not move.
It is not a market-share statistic. PCI weights every model in the tier equally regardless of usage volume. A 1 percent-share cheap model and a 50 percent-share expensive model count the same.
It is not a forecast of any specific provider's pricing strategy. The LOPO decomposition tells you which provider is currently moving the variance, not which way they will move next.

6. Data lineage

Source. Pricing, capabilities, and benchmark data are pulled hourly from a multi-source pipeline of upstream provider APIs and benchmark feeds. Every weekly snapshot is the union of those refreshes for the seven days ending Sunday.

Snapshot cadence. A cron job archives the full model catalog every Sunday at 23:00 UTC into data/weekly-snapshots/[year]-w[week].json. Each snapshot is immutable after capture.

Ranking source. The "top 10" in any week is determined by the live LMC leaderboard at the time the PCI tracker is rendered, not by the deprecated composite_score field stored inside historical snapshots. This guarantees the same top 10 you see on the homepage is the same top 10 used to compute frontier PCI.

Image and video models. Excluded from every tier. Their pricing structures (per-image, per-second-of-video) cannot be expressed in dollars per million tokens.

Free tier. Excluded from dispersion. Free tier shadow-pricing is appropriate for value-style indices but would distort a dispersion measure by anchoring the lower tail at zero.

7. Changelog

v1.0 - April 2026

Initial public release. Biweight midvariance headline, sample stdev companion, IQR(log10) readable measure, BCa bootstrap 80% CIs, four tiers (rolling top-10, composition-fixed, mid-tier, commodity), 30/70 input/output blended price, image/video/embedding excluded, output_price >= $500/M dropped, per-provider LOPO decomposition. No floor predictions until n >= 20 weekly observations.

v0.x - March 2026

Internal precursor lived as a section inside pricing-history using population variance (n divisor), no CIs, no LOPO, broad market tier polluted by image and video models. Deprecated and replaced by v1.0.

8. References

Mosteller, F. and Tukey, J. (1977). Data Analysis and Regression: A Second Course in Statistics. Addison-Wesley. Original biweight midvariance derivation.
Wilcox, R. R. (2017). Introduction to Robust Estimation and Hypothesis Testing (4th ed.). Academic Press. Modern treatment of biweight efficiency and breakdown.
Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman and Hall. BCa bootstrap.
Künsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. Annals of Statistics, 17(3): 1217-1241. Moving block bootstrap.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer. BCa theory and practical guidance.

Frequently Asked Questions

Median absolute deviation (MAD) has 50 percent breakdown but only 37 percent Gaussian efficiency, so it wastes information when the data is reasonable. Sample stdev has 100 percent efficiency but 0 percent breakdown, so a single weird flagship destroys it. Biweight midvariance is the standard Tukey-recommended compromise: 87 percent efficiency at the Gaussian and 50 percent breakdown, with a smoothly-weighted contamination cutoff at roughly 4 MADs. For a tier of 10 frontier models where we expect zero or one outlier per week, biweight is the right tool.

For each weekly sample of log10 prices we run 2000 BCa (bias-corrected accelerated) bootstrap iterations, computing the biweight midvariance on each resample. The bias correction z0 is the inverse normal CDF of the proportion of replicates below the observed value. The acceleration is from a jackknife on the sample. We use a deterministic LCG seed so the published intervals are reproducible across builds.

Adjacent weeks of PCI are not independent because the underlying frontier set is mostly stable week to week. Using an iid bootstrap on the slope would understate the standard error. Moving block bootstrap with block size 2 preserves the local autocorrelation structure while still randomizing block order, giving honest CIs that are wider but correct for our small sample.

Real production workloads sit somewhere between RAG (input-heavy) and reasoning (output-heavy). 30/70 leans output-heavy because the median LMC user is a developer building chat or generation features where output cost dominates the bill. We do not claim 30/70 is universal - the methodology page on the LMC ValueScore methodology uses 85/15 for chat-oriented value rankings precisely because that is a different audience. Both are documented and consistent within their respective indices.

We keep all snapshots since the series began. The current count (n=20) reflects how recently we started archiving weekly data. As n grows we will publish an asymptote estimate, additional confidence intervals, and a stationarity test. We are deliberately conservative about adding statistics that the data does not yet support.