P4 Batch 3 — Ingest Report

Date: 2026-05-12 Handles processed: 200 (food-matrix-filtered P4 tier, year-descending; first use of food-matrix filter) Net new source pages: 2 Duplicates created and removed: 8 False positives: ~190

Summary

Batch 3 was the first batch using the food-matrix-filtered P4 handle list (2,506 → 2,492 handles after excluding already-ingested handles). Yield improved modestly over unfiltered batches but remains low because:

  1. Handle overlap with earlier batches: Group 1 received 6 handles already ingested in P4 batch 1 (reksten2020, reksten2021, uzomah2021, adelusi2024, taylor2025, porwollik2026). All 6 created duplicate pages that were identified and removed post-batch. Root cause: the manifest’s already_ingested flag is populated for only 4 handles; this field is not maintained. Fix applied: future batches will cross-reference wiki/sources/*.md frontmatter at list-generation time to exclude already-ingested handles.

  2. Large filesystem gap: Approximately 120 of the 200 handles in this batch are absent from raw/markdown/. The FM_12022213–FM_12048xxx range is largely missing — these handles correspond to the raw 2/ directory of PDFs that have not yet been Marker-converted. This is a structural gap, not a sorting issue. Karen should run Marker conversion on raw 2/ before these handles can be processed.

  3. OCR year artifacts persist at list top: Handles labeled 2027–2029 by the manifest are OCR artifacts and are mostly false positives (cocaine-induced PRES, hafnium oxide materials science, spinach hydroponic stress, etc.). These 5 handles are included but yield 0 pages.

Net New Source Pages

lee2025-knhanes-mercury-cadmium-arsenic-obesity (FM_11785309) PLoS ONE 2025. KNHANES 2008–2012 biomonitoring cross-section, n=6,609 Korean adults. Mean serum Hg 3.8 µg/L, serum Cd 1.0 µg/L, urinary As 111.7 µg/L. High Hg (≥5.1 µg/L) associated with overweight/obesity (OR=1.57); combined high-Hg+high-sodium OR=3.61. Urinary As is total arsenic unspeciated; not an iAs data point. Evidence tier: A. Metals: tHg, Cd, tAs. Matrices: biomonitoring, dietary exposure. Jurisdictions: KR.

jermilova2025-mackenzie-mercury-fish-bayesian (FM_11844342 area — actual handle from group 3) Integrated Environmental Assessment and Management 2025. Bayesian network risk models for Hg in the Mackenzie River Basin (Canada), 2005–2020. Five freshwater species: lake whitefish mean 0.332 µg/g tHg, northern pike mean 0.938 µg/g tHg. 17–30% of commercial catch predicted to exceed the 0.5 µg/g Canadian commercial sale guideline depending on region and species. Health Canada pTWI thresholds: 3.3 µg/kg bw/week (adult males), 1.4 (women of childbearing age), 0.7 (children). Evidence tier: A. Metals: tHg. Matrices: freshwater-fish. Jurisdictions: CA.

Duplicates Found and Removed (8)

All eight were created by group 1 or group 2 agents processing handles already ingested in prior batches:

Removed (duplicate)Kept (original)HandlePrior batch
reksten2020-angola-fishreksten2020-angola-fish-metalsFM_7278876P4 batch 1
reksten2021-bay-bengal-fishreksten2021-bay-bengal-fish-metalsFM_8160839P4 batch 1
uzomah2021-nigeria-fish-reviewuzomah2021-nigeria-fish-contaminantsFM_8465269P4 batch 1
adelusi2024-dairy-cattle-feeds-saadelusi2024-dairy-feed-south-africaFM_11167146P4 batch 1
taylor2025-seafood-reviewtaylor2025-seafood-benefits-contaminantsFM_12071223P4 batch 1
porwollik2026-rhodiola-supplementsporwollik2026-rhodiola-supplements-us-marketFM_12810810P4 batch 1
guimaraes2025-tapajós-mercury-fish-systematic-reviewauzier-guimaraes2025-mercury-tapajos-fishFM_11741060P4 batch 2
naz2025-punjnad-fish-trace-elements-pakistannaz2025-trace-elements-punjnad-fishFM_11761173P4 batch 2

Filesystem Gap Finding

The FM_12022213–FM_12048xxx range is almost entirely absent from raw/markdown/. Group 2 reported 32 missing handles, group 3 reported 37 missing, group 4 reported 49 missing. These handles appear in the triage manifest but the corresponding PDF-to-markdown conversion has not been completed. This range likely corresponds to the raw 2/ directory (an untracked directory containing ~175 handles’ worth of PDFs). Marker conversion needed before these handles are processable.

Process Fix Applied

For batch 4 and all subsequent batches, handle lists are generated by cross-referencing against wiki/sources/*.md frontmatter to exclude already-ingested FM handles. The updated food-matrix-filtered list excludes all 141 FM handles already present in source pages, yielding 2,492 processable handles.

Batch Commits

  • 4bc2370 — group 1: 6 pages (all duplicates), 24 FP (20 missing handles)
  • 467af7d — group 2: 3 pages (2 duplicates, 1 unique), 47 FP (32 missing handles)
  • 2961cee — group 3: 1 page (unique), 49 FP (37 missing handles)
  • a5c59dc — group 4: 0 pages, 50 FP (49 missing handles)
  • b0f167a — dedup: 8 duplicate source pages removed