Raw Markdown Corpus Pilot
This generated pilot indexes the marker/PyTorch markdown extraction layer without promoting every paper into curated source pages. It is the bridge between the 23,260 raw markdown documents and the public wiki source layer.
Counts
- Raw markdown documents scanned: 23,260.
- Candidate markdown files screened: 2739.
- Machine-tagged food/heavy-metal candidates: 2707.
- Unique candidate FM records after de-duplication: 2596.
- Pilot records emitted: 150.
- Pilot records already promoted to curated source pages: 6.
How ChatGPT Should Use This Layer
- Search or filter the corpus catalog to find potentially relevant papers.
- Treat corpus tags as machine-extracted candidates, not final claims.
- Promote only load-bearing sources into
wiki/sources/. - Put finished-product values on product pages and ingredient-only values on ingredient pages.
- Preserve metal species, units, basis, matrix, geography, method, and review state before using values for HMTc standards logic.
Pilot Indexes
- By metal: Al, Cd, Cr, MeHg, Ni, Pb, Sn, U, iAs, tAs, tHg.
- By product row: baby-cereals-dry-non-rice, baby-cereals-dry-rice-based, fruit-purees, infant-formula-powder-non-soy, infant-formula-powder-soy-based, plant-milks-non-soy-non-rice, plant-milks-rice-based, plant-milks-soy-based, root-vegetable-purees.
- Promotion queue: promotion-queue.
Data Files
data/corpus/markdown-corpus-pilot-catalog.ndjsondata/corpus/markdown-corpus-pilot-summary.json