Search strategy and database coverage

This page is the Cochrane-equivalent search-protocol publication for the Heavy Metal Index. It documents which academic and grey-literature databases are queried, how candidate papers are deduped against the existing corpus, the scoring rubric that determines whether a paper is auto-fetched for ingest, and the inclusion / exclusion criteria. A reader auditing the defensibility of the Index — regulator, brand-legal expert, journalist, or competing standards body — should be able to land here and replicate the search.

The page describes the working pipeline as of the date in the frontmatter. The history of the pipeline is in git, which is the authoritative version record. Substantive changes are noted in log.md.

Databases queried

The Index runs a continuous, multi-database literature search via the paper-lookup tool, which wraps the public REST APIs of ten complementary databases. Coverage is intentionally redundant: a paper that is missed by one source is usually caught by another, and the cross-database dedupe step (below) is the protection against double-counting.

DatabaseScopeRole in the IndexAccess
PubMedBiomedical and life-sciences literature, MEDLINE-indexedPrimary source for toxicology, epidemiology, exposure assessmentNCBI E-utilities API
PMC (PubMed Central)Open-access biomedical full textUsed for full-text extraction on a paper-by-paper basis when PMC carries the version of recordNCBI E-utilities API
OpenAlexGeneral scholarly works including books, datasets, softwarePrimary discovery surface for non-MEDLINE journals (Food Chemistry, Science of the Total Environment, EFSA Journal, etc.)OpenAlex API
CrossrefDOI registration metadata, ~140M recordsRequired: DOI dedupe and canonical metadata recoveryCrossref REST API (polite pool)
Semantic ScholarCross-disciplinary citation graph, ~200M papersSecondary discovery; citation-graph queries for forward and backward citation chains on high-priority papersSemantic Scholar Academic Graph API
UnpaywallOpen-access status lookup keyed by DOIRequired: open-access URL discovery for auto-fetch eligibilityUnpaywall API
bioRxivBiology preprintsUsed when a topic involves emerging methods or recent surveys not yet through peer reviewbioRxiv API
medRxivHealth-sciences preprintsSame role as bioRxiv for clinical / epidemiology topicsmedRxiv API
arXivPhysics, math, computer science, and quantitative biology preprintsRarely used for the Index’s food-and-supply-chain scope; reserved for methods papers (ICP-MS, XRF, sample-prep)arXiv API
COREAggregator of open-access repositoriesReserved for topics where the OA URL is needed but Unpaywall does not resolveCORE API

The polite-pool identifier for Crossref and Unpaywall is the operator email (karen@paleofoundation.com). The user-agent string sent on PDF fetches is HeavyMetalIndex/0.1 (https://heavymetalindex.com; karen@paleofoundation.com).

Search topics

Search topics are derived in three ways.

Gap-driven (autonomous). A nightly daemon scans the wiki for ingredient × metal cells with status: pending or fewer than 5 contributing sources, derives a topic from the cell (ingredient name + metal + matrix + year window), and runs the search. The daemon-driven flow is OpenAlex-only and lives in tools/wishlist/. Audit logs are written to data/evidence/autonomy/wishlist-*.csv.

Topic-driven (operator-triggered). The operator can run /discover topic="<free text>" to fan out across all 10 databases for an explicit topic. The skill is defined in .claude/skills/discover/SKILL.md and audit-logs every run to data/evidence/discovery/discover-<date>.csv.

Cell-driven (cross-cut). The operator can run /discover gap=<cell-id> to target a specific ingredient × metal cell listed in data/evidence/cells-needing-synthesis.csv. The skill reads the cell, synthesizes a topic, and runs the multi-database search with the same downstream pipeline.

Dedupe protocol

Every candidate paper from every database is checked against the existing corpus before being scored or fetched. The dedupe index is rebuilt from wiki/sources/ and data/evidence/autonomy/manual-fetch-tracker.csv at the start of every discovery run (tools/discovery/build-corpus-index.mjs).

The dedupe rules, applied in order:

  1. DOI exact match (case-insensitive). If a candidate’s DOI matches an existing source page, the candidate is flagged in_corpus and not scored.
  2. Normalised title match (lowercase, alphanumeric-only, whitespace-collapsed). Catches DOI-less preprints and version-of-record duplicates that share a title.
  3. Cite-key match against the manual-fetch tracker. Catches papers already queued for ingest but not yet promoted to source pages. Flagged queued_for_ingest.

Only candidates that pass all three checks proceed to scoring. The audit log records the dedupe status of every candidate so a reviewer can verify that no duplicate was promoted.

Scoring rubric for auto-fetch

For each novel candidate, the discovery pipeline computes a recommend_score:

SignalWeight
Title contains both a metal term and a food / matrix term+3
Abstract contains both a metal term and a food / matrix term (and title scored 0)+2
Open-access URL available (via Unpaywall or CORE)+1
Publication year ≥ 2020+1
Journal in the A-tier list (Food Chemistry, EHP, JAFC, STOTEN, FAC, Nature Food, Lancet, NEJM, JAMA, BMJ, EFSA Journal)+1
Title or abstract contains a negative keyword (biosensor, perovskite, transgenic, AMR, case report, etc.)−3

Default auto-fetch threshold: score ≥ 4 AND open-access URL available. Candidates below threshold or without an OA URL are logged with their score and reason but are not fetched.

A-tier journal list and negative-keyword list are versioned with the discovery tools at tools/wishlist/find-candidates.mjs and .claude/skills/discover/SKILL.md. Changes to either are tracked in git.

Inclusion and exclusion criteria

Included in the wiki corpus:

  • Peer-reviewed primary research on the occurrence, exposure, toxicology, mitigation, or regulation of lead, cadmium, inorganic and total arsenic, methylmercury and total mercury, nickel, aluminium, chromium, hexavalent chromium, tin, antimony, or uranium in food, ingredients, supply-chain inputs (soil, water, fertilisers, equipment, packaging), or directly comparable matrices.
  • Government and intergovernmental scientific opinions, risk assessments, regulatory guidance, action levels, total-diet surveys, and Total Diet Study programmes from FDA, EPA, EFSA, JECFA, WHO, Codex Alimentarius, ATSDR, OEHHA, FSANZ, China NHC, Mexico SSA, and equivalent national bodies.
  • Systematic reviews, meta-analyses, and review chapters on the included scope.
  • Industry, NGO, and consumer-organisation testing reports (Consumer Reports, Healthy Babies Bright Futures, Environmental Working Group, and similar) at category and ingredient level only; brand-level rankings are not reproduced (see HMTc firewall).

Excluded from the wiki corpus:

  • Clinical metallomics in medicine outside the dietary exposure pathway.
  • Occupational exposure in industrial settings.
  • Environmental exposure outside the food system (a substantial portion of which lives at WikiBiome).
  • Brand-by-brand contamination tables, certificates of analysis, and internal lab data (these belong in the private brand-intelligence build, not the public Index).
  • Single-case clinical reports without aggregate occurrence or exposure data.
  • News commentary used as primary evidence (admitted only as a lead to verify against a primary source).

Quality assessment (evidence tiers)

Every source page is assigned an evidence_tier (A / B / C) in its frontmatter. Synthesis claims lean A-tier; B-tier and C-tier sources are admitted with explicit attribution.

  • A-tier. Peer-reviewed primary research and systematic reviews; government and intergovernmental scientific opinions and risk assessments; total-diet surveys and recurring monitoring programmes.
  • B-tier. Industry white papers; NGO and consumer-organisation reports with disclosed methodology; reputable trade publications.
  • C-tier. News coverage, blog material, conference abstracts without full text, and other sources used as leads rather than evidence in their own right.

The fine-grained Evidence Fitness layer (EF-1 through EF-X) used inside the structured evidence register is documented separately at methodology.md § Evidence Fitness.

Reproducibility

A reader who wants to reproduce a specific discovery run can:

  1. Read the audit log at data/evidence/discovery/discover-<date>.csv for the run’s full per-candidate decision trail.
  2. Re-run the same topic via the /discover skill (.claude/skills/discover/SKILL.md) — the dedupe index and scoring rubric are deterministic given the same corpus state.
  3. For the daemon-driven gap-fill flow, the equivalent logs are at data/evidence/autonomy/wishlist-*.csv and the gap-driven query construction is documented in tools/wishlist/build-targeted-queries.mjs.

The complete corpus dedupe index is rebuilt from on-disk state on every run (data/evidence/discovery/corpus-index.json) so no historical version is required to verify a current dedupe decision.

What this page does not do

This page documents the search strategy. It does not document:

  • The synthesis workflow (how source pages become ingredient and product page claims). See CLAUDE.md § Part 9.
  • The routing layer (how source frontmatter fans out to product, ingredient, metal, and regulation destination pages). See CLAUDE.md § Part 5b and the live audit at data/evidence/product_source_routing_audit.csv.
  • The HMTc standards-setting methodology (how literature percentiles are pooled and certification thresholds chosen). That work is published separately at the Heavy Metal Tested & Certified program manual; the wiki / HMTc firewall is documented at editorial-standards.

For the full corpus coverage and gap surface, see coverage.