Zhang et al. 2026 Explained

In 2026, a research team published a preprint analysing 21,143 citations across three major AI platforms — ChatGPT, Google AI Overviews, and Perplexity. They found that high-influence pages (those whose content was visibly absorbed into AI-generated answers) differed from low-influence pages across six measurable structural dimensions. The differences were large. In some cases, enormous.

This dispatch is an explanation of what they found, what the preprint status means for how confidently you should act on it, and how the Absorption Analyser applies it.

The study at a glance

Zhang et al. 2026 is a preprint — it has not yet completed peer review. This matters, and we'll address it directly. But the study is large: 21,143 citations, across three major AI platforms, with each citation examined for the structural properties of the source page.

The researchers distinguished between two groups: high-influence pages (those visibly absorbed into AI answers) and low-influence pages (those not absorbed despite being relevant by topic). The analysis identified which structural properties most reliably differentiated the two groups.

The six properties — and the ratios

The findings are striking in their magnitude. High-influence pages were not marginally different from low-influence pages. They were categorically different across multiple dimensions.

High-influence pages were on average 11.44 times longer than low-influence pages.

This is the most immediately visible structural differentiator. High-influence pages are not slightly longer — they are an order of magnitude longer. This reflects the substantive depth that AI systems appear to prefer when selecting content to draw from.

The heading density finding is equally large: high-influence pages had 12.50 times more headings than low-influence pages. This ratio suggests that heading structure is not a cosmetic property — it's a primary signal. Well-structured content, navigable by heading, provides more parseable, more citable, more extractable units.

Paragraph density showed a similar pattern: high-influence pages had 5.69 times more paragraphs. Content divided into clear, bounded paragraphs is easier for AI systems to extract discrete facts from than content presented as long, uninterrupted blocks.

Definitional and comparative language

Beyond structural properties, Zhang et al. found that two language patterns were associated with higher absorption probability: definitional sentences and comparative sentences.

A definitional sentence explicitly states what something is: "Evidence density is a measure of how much verifiable, well-structured content a document contains." These sentences give AI systems a directly extractable, standalone explanation — a high-value extraction target.

Comparative sentences signal analytical depth. Sentences using "compared to", "unlike", "whereas", "by contrast", and similar constructions — content that explicitly situates one thing in relation to another — showed approximately 55% higher absorption probability in the Zhang et al. analysis.

Statistics presence — the cross-study corroboration

Statistics presence is the one dimension in the Absorption Analyser corroborated by both Zhang et al. 2026 and Aggarwal et al. 2024 (peer-reviewed). Pages with statistics showed approximately 61% higher absorption probability in the Zhang analysis — and +31% higher citation probability in the Aggarwal study.

The cross-study corroboration matters. When a preprint finding aligns with a peer-reviewed finding, confidence in the directional signal increases — even if the precise magnitude differs between studies. Statistics presence is labelled Tier 1 in the Absorption Analyser's focus panel precisely because of this corroboration.

Confidence tier note. Statistics presence is the only absorption signal corroborated by peer-reviewed research (Aggarwal et al. 2024). The remaining five absorption signals (word count, heading count, paragraph count, definitional language, comparative language) are directional — sourced from Zhang et al. 2026 preprint only. Act on them as directional guidance, not hard benchmarks.

What "preprint" actually means for how you use this

A preprint is a research paper that has been shared publicly before completing peer review. Peer review is a formal process in which independent experts in the field evaluate the methodology, analysis, and conclusions — and can request revisions, flag errors, or reject the paper.

Preprints are normal in fast-moving research fields. They allow findings to be shared and discussed before the slower peer review process completes. But they carry a meaningful caveat: the methodology has not been independently validated. Findings may change, be qualified, or in rare cases be retracted after peer review.

For our tools, this means:

Zhang et al. 2026 signals are labelled "directional" — not "high confidence." The Absorption Analyser's focus panel puts statistics presence (corroborated by Aggarwal 2024) first, above the purely Zhang-sourced signals, regardless of gap size. We review Zhang et al.'s peer-review status on a scheduled basis — next check: August 2026. If peer-reviewed, the tools are upgraded. If retracted or significantly revised, the tools are rebuilt.

What the findings mean practically

If you take the Zhang et al. findings at directional face value, the practical implications are straightforward:

Write longer. Not padded — substantive. High-influence pages are long because they cover topics with depth, not because they repeat themselves. An 800-word article on a topic that warrants 2,000 words is leaving structural signal on the table.

Use more headings. Not decorative headings — structural ones that segment your content into bounded, topic-specific sections. At least one H2 per major idea. H3s for sub-points within that idea. The visualiser makes heading density visible instantly.

Define your terms. Every piece of content has at least one concept worth defining explicitly. "What is X" is not just an SEO question — it's an absorption question. Clear definitional sentences are extraction anchors.

Make comparisons explicit. Don't leave contrasts implied — state them. "Unlike keyword density metrics, evidence density measures…" is more extractable than writing that assumes the reader understands the distinction implicitly.

Measure your absorption signals.

The Absorption Analyser scores all six Zhang et al. dimensions — with evidence tiers clearly labelled and a prioritised focus panel showing what to fix first.

Try Absorption Analyser → Next: Evidence vs. keyword density

The Structure Properties That Make AI Systems Quote You: Zhang et al. 2026 Explained