P4 Batch 3 — Ingest Report
Date: 2026-05-12 Handles processed: 200 (food-matrix-filtered P4 tier, year-descending; first use of food-matrix filter) Net new source pages: 2 Duplicates created and removed: 8 False positives: ~190
Summary
Batch 3 was the first batch using the food-matrix-filtered P4 handle list (2,506 → 2,492 handles after excluding already-ingested handles). Yield improved modestly over unfiltered batches but remains low because:
-
Handle overlap with earlier batches: Group 1 received 6 handles already ingested in P4 batch 1 (reksten2020, reksten2021, uzomah2021, adelusi2024, taylor2025, porwollik2026). All 6 created duplicate pages that were identified and removed post-batch. Root cause: the manifest’s
already_ingestedflag is populated for only 4 handles; this field is not maintained. Fix applied: future batches will cross-referencewiki/sources/*.mdfrontmatter at list-generation time to exclude already-ingested handles. -
Large filesystem gap: Approximately 120 of the 200 handles in this batch are absent from
raw/markdown/. The FM_12022213–FM_12048xxx range is largely missing — these handles correspond to theraw 2/directory of PDFs that have not yet been Marker-converted. This is a structural gap, not a sorting issue. Karen should run Marker conversion onraw 2/before these handles can be processed. -
OCR year artifacts persist at list top: Handles labeled 2027–2029 by the manifest are OCR artifacts and are mostly false positives (cocaine-induced PRES, hafnium oxide materials science, spinach hydroponic stress, etc.). These 5 handles are included but yield 0 pages.
Net New Source Pages
lee2025-knhanes-mercury-cadmium-arsenic-obesity (FM_11785309)
PLoS ONE 2025. KNHANES 2008–2012 biomonitoring cross-section, n=6,609 Korean adults. Mean serum Hg 3.8 µg/L, serum Cd 1.0 µg/L, urinary As 111.7 µg/L. High Hg (≥5.1 µg/L) associated with overweight/obesity (OR=1.57); combined high-Hg+high-sodium OR=3.61. Urinary As is total arsenic unspeciated; not an iAs data point. Evidence tier: A. Metals: tHg, Cd, tAs. Matrices: biomonitoring, dietary exposure. Jurisdictions: KR.
jermilova2025-mackenzie-mercury-fish-bayesian (FM_11844342 area — actual handle from group 3)
Integrated Environmental Assessment and Management 2025. Bayesian network risk models for Hg in the Mackenzie River Basin (Canada), 2005–2020. Five freshwater species: lake whitefish mean 0.332 µg/g tHg, northern pike mean 0.938 µg/g tHg. 17–30% of commercial catch predicted to exceed the 0.5 µg/g Canadian commercial sale guideline depending on region and species. Health Canada pTWI thresholds: 3.3 µg/kg bw/week (adult males), 1.4 (women of childbearing age), 0.7 (children). Evidence tier: A. Metals: tHg. Matrices: freshwater-fish. Jurisdictions: CA.
Duplicates Found and Removed (8)
All eight were created by group 1 or group 2 agents processing handles already ingested in prior batches:
| Removed (duplicate) | Kept (original) | Handle | Prior batch |
|---|---|---|---|
| reksten2020-angola-fish | reksten2020-angola-fish-metals | FM_7278876 | P4 batch 1 |
| reksten2021-bay-bengal-fish | reksten2021-bay-bengal-fish-metals | FM_8160839 | P4 batch 1 |
| uzomah2021-nigeria-fish-review | uzomah2021-nigeria-fish-contaminants | FM_8465269 | P4 batch 1 |
| adelusi2024-dairy-cattle-feeds-sa | adelusi2024-dairy-feed-south-africa | FM_11167146 | P4 batch 1 |
| taylor2025-seafood-review | taylor2025-seafood-benefits-contaminants | FM_12071223 | P4 batch 1 |
| porwollik2026-rhodiola-supplements | porwollik2026-rhodiola-supplements-us-market | FM_12810810 | P4 batch 1 |
| guimaraes2025-tapajós-mercury-fish-systematic-review | auzier-guimaraes2025-mercury-tapajos-fish | FM_11741060 | P4 batch 2 |
| naz2025-punjnad-fish-trace-elements-pakistan | naz2025-trace-elements-punjnad-fish | FM_11761173 | P4 batch 2 |
Filesystem Gap Finding
The FM_12022213–FM_12048xxx range is almost entirely absent from raw/markdown/. Group 2 reported 32 missing handles, group 3 reported 37 missing, group 4 reported 49 missing. These handles appear in the triage manifest but the corresponding PDF-to-markdown conversion has not been completed. This range likely corresponds to the raw 2/ directory (an untracked directory containing ~175 handles’ worth of PDFs). Marker conversion needed before these handles are processable.
Process Fix Applied
For batch 4 and all subsequent batches, handle lists are generated by cross-referencing against wiki/sources/*.md frontmatter to exclude already-ingested FM handles. The updated food-matrix-filtered list excludes all 141 FM handles already present in source pages, yielding 2,492 processable handles.
Batch Commits
4bc2370— group 1: 6 pages (all duplicates), 24 FP (20 missing handles)467af7d— group 2: 3 pages (2 duplicates, 1 unique), 47 FP (32 missing handles)2961cee— group 3: 1 page (unique), 49 FP (37 missing handles)a5c59dc— group 4: 0 pages, 50 FP (49 missing handles)b0f167a— dedup: 8 duplicate source pages removed