Raw Markdown Corpus Pilot

This generated pilot indexes the marker/PyTorch markdown extraction layer without promoting every paper into curated source pages. It is the bridge between the 23,260 raw markdown documents and the public wiki source layer.

Counts

  • Raw markdown documents scanned: 23,260.
  • Candidate markdown files screened: 2739.
  • Machine-tagged food/heavy-metal candidates: 2707.
  • Unique candidate FM records after de-duplication: 2596.
  • Pilot records emitted: 150.
  • Pilot records already promoted to curated source pages: 6.

How ChatGPT Should Use This Layer

  1. Search or filter the corpus catalog to find potentially relevant papers.
  2. Treat corpus tags as machine-extracted candidates, not final claims.
  3. Promote only load-bearing sources into wiki/sources/.
  4. Put finished-product values on product pages and ingredient-only values on ingredient pages.
  5. Preserve metal species, units, basis, matrix, geography, method, and review state before using values for HMTc standards logic.

Pilot Indexes

Data Files

  • data/corpus/markdown-corpus-pilot-catalog.ndjson
  • data/corpus/markdown-corpus-pilot-summary.json

3 items under this folder.