AI research methodology: the framework, the common mistakes, and the evidence that separates a defensible answer from a confident one.
When we describe Verdikt's research pipeline to founders, the number that gets the most questions is 40. Why 40 sources? Why not five good ones? Why not 100?
The answer starts with the problem we were actually trying to solve, which is not coverage. It is hallucination risk.
A language model asked to research a startup idea without grounding will produce a confident, well-structured memo full of market data, competitive context, and risk analysis. A meaningful fraction of that content will be fabricated, not because the model is trying to deceive, but because it has learned that confident, specific-sounding answers are the expected output format. If the question requires a number the model does not know, it produces a plausible number. If it requires a company name, it produces a plausible name. The result reads like research and is not.
The solution is retrieval: pull the actual documents, quote the actual sentences, cite the actual sources. When the model has access to a retrieved document that contains the number it needs, it uses the number from the document instead of synthesizing one. The quality of the output is bounded by the quality of the retrieved content.
This is where the source count becomes a design question rather than a marketing one. If you retrieve from five sources, the breadth of your research is constrained to what those five sources cover. For a startup idea in a well-documented market, five sources might be sufficient. For an idea at the intersection of two categories, in a geography outside the major English-language business press, with a regulatory dimension that only appears in one specialized publication, five sources will miss things.
We settled on 40 sources as the minimum viable breadth for covering: the market sizing question (requires government and industry data), the competitive landscape (requires company filings, press, and analyst reports), the regulatory environment (requires jurisdiction-specific sources), the technical feasibility (requires research publications and technical press), and the customer evidence (requires qualitative sources: forums, reviews, job postings, support threads).
Each of those five research dimensions has its own source hierarchy. The market sizing question is best answered by government statistical offices and industry associations, not by Gartner. The competitive landscape is best answered by company careers pages and LinkedIn, not just by Crunchbase. The regulatory picture requires the actual regulation and the agency guidance documents, not a summary.
Forty is also not the ceiling. For ideas with significant international dimension, we extend to additional country-specific sources. For ideas in heavily regulated sectors, we add primary regulatory documents. The 40 is the floor that ensures we do not miss a whole category of evidence.
The other reason source count matters is auditability. Every claim in a Verdikt report links to the source that supports it. If the claim is wrong, you can trace it to its origin. If the source is wrong, you know that the problem is the source, not the reasoning. That traceability is impossible without a named source library. It is the thing that makes the memo defensible rather than confident.
Forty sources is not a feature. It is a consequence of wanting to produce something you can defend in a room where someone is trying to find the hole.
The tier system, in plain terms
Not every source carries the same weight. Verdikt grades every source into one of four tiers before it enters the citation pack.
Tier 1 is primary data published by the original collector: SEC EDGAR filings, Bureau of Labor Statistics, Federal Reserve Economic Data (FRED), US Census, Eurostat, Office for National Statistics (UK), Japan’s Ministry of Economy METI, Brazil’s INEP, and India’s MOSPI. These sources are slow, ugly, and bullet-proof. A claim with Tier 1 backing is one a partner cannot dismantle in five minutes of follow-up.
Tier 2 is named expert publication: SaaStr, OpenView’s Expansion SaaS Benchmarks, ChartMogul’s SaaS Benchmarks Report, First Round Review, NfX, a16z, Andreessen Horowitz blog, and major industry trade publications. These sources are slower to be wrong than secondary aggregators and they expose their methodology when asked. A claim with Tier 2 backing is one a partner accepts with a follow-up question rather than dismissing.
Tier 3 is named community evidence: Crunchbase, PitchBook, G2, Capterra, ProductHunt, Hacker News, IndieHackers, and Reddit communities like r/SaaS or r/startups. These tell you what builders are doing and where the discussion is, not what the underlying data says. They are weighted lower for any quantitative claim and weighted higher for any "what is the buyer doing on Tuesday" claim.
Tier 4 is unnamed or aggregator output: blog post summaries with no methodology, AI summaries that do not cite primary sources, and "industry experts say" claims. A claim that traces only to Tier 4 is rewritten or dropped.
The tier mix on a typical Verdikt report is roughly 14 Tier 1, 18 Tier 2, and 10 Tier 3 sources. That is what 40 cited sources actually looks like when you grade them. The brand promise is not "40 random links." It is "40 sources weighted by what they can defend."
Where source breadth catches hallucinations
The most common AI hallucination pattern is not a made-up statistic; it is a real statistic attached to the wrong source. The model says "according to the Census, 25 percent of SMBs use accounting software" and the actual Census tables say something close but different. Source breadth catches this in two ways. First, the same claim appears across multiple unrelated databases, which means if the AI cites the wrong one, a different stage of the pipeline catches the mismatch. Second, the citation library is generated separately from the prose so the prose-vs-source check is a deterministic step, not a vibe.
Stanford’s HAI center has published research on the prevalence of legal-citation hallucinations in commercial AI tools. The same pattern shows up in market-research outputs from any general-purpose AI. The fix is not "better models." The fix is a citation library that is auditable, a tier system that weights sources by what they can support, and a pipeline that separates the claim from its citation.
What this looks like on a Verdikt report cover
A Verdikt verdict ships with the citation count on the cover (typically 35 to 50), the tier mix below it (e.g. "14 Tier 1 · 18 Tier 2 · 10 Tier 3"), and a re-run hook for the three weakest claims. The reader knows, before reading a single section, how dense the citation pack is and where the soft spots are. The soft spots are the parts to ask about. The dense spots are the parts to act on. The structure is the brief in miniature: this is what we know, this is what we measured, this is what we are not sure about. Forty sources is the consequence of wanting to make that distinction visible on the page.