Why Statistics Increase AI Citation Probability by 31%

The most important number in AI content optimisation comes from a 2024 KDD paper by Aggarwal et al. That number is 31% — the approximate increase in AI citation probability associated with including statistics in your content.

This essay is a close reading of that study: what they actually measured, how they measured it, what the 31% means precisely, and what it means practically for the content you're writing today.

What Aggarwal et al. 2024 studied

The study analysed a large corpus of web content and its citation behaviour across AI systems — specifically measuring which content properties were associated with higher AI citation probability. The research was peer-reviewed and published at KDD (Knowledge Discovery and Data Mining), a top-tier venue in data science and machine learning research.

The research team identified specific content properties and tested each against observed citation behaviour. The goal was to move beyond intuition about what AI systems "prefer" and establish quantifiable correlations between content characteristics and citation outcomes.

"Including statistics in content increased AI citation probability by approximately 31%."

That is the headline finding. But understanding why requires understanding what they were measuring.

What "citation probability" means in this context

Citation, in the Aggarwal et al. framework, means that an AI system references the content as a source when generating an answer. This is distinct from absorption — where AI draws from content to shape its generated answer without necessarily listing it as a citation. Both phenomena matter, but the Aggarwal study focuses specifically on citation selection.

Citation probability, then, is a relative measure: given a set of content that could plausibly answer a query, how much more likely is content with statistics to be cited versus equivalent content without statistics?

The answer, per their analysis, is approximately 31% more likely.

Evidence tier note. Aggarwal et al. 2024 is peer-reviewed, published at KDD. It is the strongest evidence signal available in our tool set. The 31% figure is directionally robust — though the exact magnitude will vary by query type, content length, and AI platform. Use it as a directional anchor, not a precise conversion rate.

The source attribution compounding effect

The Aggarwal study found a second, closely related signal: named source attribution. Adding an explicit source reference — "Per Smith et al. (2023)…" or "As documented by [Organisation]…" — alongside a statistic increased citation probability by approximately 30% in their analysis.

This is a distinct mechanism from the statistic itself. Statistics signal that claims are quantified. Source attribution signals that claims are verifiable. Together, they appear to combine additively: content with both statistics and named attribution outperforms content with either signal alone.

This is why the Evidence Density Score measures these as separate dimensions — Evidence Richness captures statistics and quotations; the source attribution effect is embedded in the evidence weighting.

The readability finding

A second major finding from Aggarwal et al. 2024 concerns readability. The study found that content readable at Flesch-Kincaid grade 8–10 was associated with meaningfully higher AI extraction rates than content at grade 12+ or grade 14+.

This finding is counterintuitive to writers trained on academic or professional publishing conventions, where complexity is often associated with authority. For AI extraction, the inverse appears to be true: readable, accessible content is extracted more reliably than dense, complex content at equivalent information quality.

Grade 8–10 is approximately the reading level of well-written newspaper journalism. Clear sentences, active voice, accessible vocabulary — not dumbed down, but efficient. The Readability Analyser surfaces your current grade level and flags content that sits significantly above the 10-grade threshold.

What this means for content you're writing today

The Aggarwal findings translate into three concrete priorities for any piece of content you're producing:

First: include at least three statistical claims per 1,000 words. A statistical claim is any sentence that contains a specific number tied to a measurement — a percentage, a ratio, a count, a study size, a date-stamped finding. "Studies show it helps" is not a statistical claim. "Including statistics increased citation probability by 31% (Aggarwal et al. 2024)" is.

Second: name the source of your statistical claims. Anonymous statistics have lower citation value than attributed ones. "31% higher citation probability" is weaker than "Aggarwal et al. (2024) found approximately 31% higher citation probability." The attribution is part of the evidence signal — not just an ethical citation practice.

Third: write at grade 8–10. Run your content through the Readability Analyser. If your Flesch-Kincaid grade is above 12, the gap between your current complexity and the extraction sweet spot is measurable. The passive voice rate and sentence length distribution tell you where to make targeted edits.

One number to remember. +31% citation probability from statistics. Peer-reviewed. Published at KDD 2024. The strongest single signal in the Evidence Density Score — and the first thing the "Where to focus first" panel surfaces when evidence richness is deficient.

The limits of this research

The Aggarwal et al. study is the best evidence we have — peer-reviewed, rigorous, published at a top venue. But it has limits worth naming.

The study measures correlation, not causation. Statistics are associated with higher citation probability, but the mechanism is not definitively established. It may be that statistical content tends to be more specific, more verifiable, and more structured — and those properties, rather than the numbers themselves, drive the citation behaviour. The practical implication is the same, but the mechanism matters for edge cases.

The study also reflects AI system behaviour at a specific point in time. AI citation behaviour evolves as models are updated. The 31% figure should be treated as a directional anchor — a reliable signal, not a fixed conversion rate.

And the study primarily captures citation selection — not absorption, not user engagement, not conversion. High citation probability means you're more likely to be referenced. Whether that reference leads to a click, a read, or a purchase is a separate question.

Measure your evidence density now.

The Evidence Density Score applies the Aggarwal et al. findings directly — measuring statistics, quotations, readability, and structure in a single 0–100 score. Peer-reviewed source. No signup.

Try Evidence Density Score → Next: Zhang et al. 2026 explained

Why Statistics in Your Content Increase AI Citation Probability by 31%