Archive Methodology
This page documents how the archive was constructed — from identifying documents to extracting their contents to assessing quality. The goal is reproducibility: anyone with the same tools and sources should be able to reconstruct what we have and verify it against the provenance records.
Document Identification
Section titled “Document Identification”The starting point was Hamilton’s own publication record, traced through three academic databases:
- DBLP (dblp.org/pid/72/4181.html) — computer science bibliography
- IEEE Xplore (ieeexplore.ieee.org/author/37086475658) — IEEE publications
- ACM Digital Library (dl.acm.org/profile/81502669990) — ACM publications
From these, we built the initial bibliography. Citations within Hamilton’s own papers led to additional documents: the MIT R-700 series, NASA Special Publications, and the GSOP documents her team produced during Apollo and Skylab.
The scope expanded in concentric circles:
- Hamilton’s own publications — journal articles, conference papers, the 2019 Draper retrospective
- Documents Hamilton authored at MIT — GSOP sections for Colossus and Skylark
- Institutional accounts of the software effort — Johnson & Giller (1971), Hall (1977)
- NASA program-level documents — SP-287, the Managing the Moon Program oral history
- Modern technical analyses — Averill’s AGC hardware analysis, NASA retrospectives
- Physical archives — the Smithsonian NASM.1986.0158 collection (documented but not digitized)
The Compendium tracks every identified document regardless of acquisition status.
PDF Acquisition
Section titled “PDF Acquisition”Documents were acquired from three categories of sources, each with different legal and access characteristics.
NASA Technical Reports Server (NTRS)
Section titled “NASA Technical Reports Server (NTRS)”The primary source for government documents. NTRS provides direct PDF downloads for most reports via a predictable URL pattern:
ntrs.nasa.gov/api/citations/{NTRS_ID}/downloads/{NTRS_ID}.pdfSome older documents are citation-only on the current NTRS platform (notably Johnson & Giller 1971, NTRS ID 19750067792). For these, alternative sources were used.
Open Access Author Copies
Section titled “Open Access Author Copies”Hamilton’s 2008 USL paper is available as an author’s copy from Hamilton Technologies, Inc. (htius.com/Articles/r12ham.pdf). The 2019 retrospective was published through Draper Labs’ “Hack the Moon” public outreach site. A Wayback Machine snapshot was recorded for the author’s copy to guard against link rot.
Community Archives
Section titled “Community Archives”The Virtual AGC project at ibiblio.org maintains mirrors of Apollo-era documents and was the source for one Skylark GSOP section not available on NTRS. MIT’s Digital Apollo archive provided the primary copy of the Johnson & Giller volume.
The Averill (2022) paper was obtained directly from arXiv as an open-access preprint.
Extraction Pipeline
Section titled “Extraction Pipeline”Once a PDF is acquired, it goes through a structured extraction process.
Step 1: Metadata Capture
Section titled “Step 1: Metadata Capture”PDF metadata is extracted and stored as metadata.json in the document’s extracted directory. This captures producer application, creation date, page count, and other embedded properties that help assess document provenance.
Step 2: Structure Analysis
Section titled “Step 2: Structure Analysis”Document structure (headings, sections, page dimensions) is analyzed and stored as structure.json. This informs the extraction approach — a clean digital PDF from InDesign needs different handling than a 1972 typewritten scan.
Step 3: Text Extraction
Section titled “Step 3: Text Extraction”For digital-origin PDFs (Hamilton 2019, Hamilton-Hackler 2008, Averill 2022), direct text extraction produces clean results with minimal post-processing.
For scanned documents (the 1972 GSOP sections, Johnson & Giller 1971, NASA SP-287), OCR is required. The OCR pipeline produces markdown text, but accuracy varies significantly by source quality.
Step 4: Image Extraction
Section titled “Step 4: Image Extraction”Figures, diagrams, and in the case of scanned documents, page images are extracted to an images/ subdirectory. This preserves visual content that text extraction cannot capture — particularly important for the GSOP memory allocation tables and the USL paper’s FMap/TMap diagrams.
Step 5: Table Extraction
Section titled “Step 5: Table Extraction”Structured tables are extracted to JSON format in a tables/ subdirectory. This is especially relevant for the GSOP documents, which are primarily composed of memory allocation tables.
Directory Layout
Section titled “Directory Layout”Each document follows a consistent extracted directory structure:
extracted/{source-dir}/ full-text.md # or {filename}.md -- primary text extraction metadata.json # PDF metadata structure.json # Document structure analysis images/ # Extracted figures and page images {filename}_page_{N}_img_{M}.png tables/ # Structured table data table_{N}_page_{M}.jsonQuality Assessment
Section titled “Quality Assessment”Extraction quality varies substantially across the archive. Each document’s NOTES.md records known issues.
Quality Tiers
Section titled “Quality Tiers”Clean digital PDFs — Text extraction is reliable. Minor issues with formatting artifacts (bullet characters, pull quotes, citation superscripts).
- Hamilton 2019 (OpenOffice/Google Docs origin)
- Hamilton-Hackler 2008 (Adobe InDesign CS2)
- Averill 2022 (pdfTeX/LaTeX)
Scanned documents with usable OCR — Text is readable but contains errors. Tables and mathematical notation are particularly prone to OCR artifacts.
- Colossus GSOP sections (1972 typewritten)
- Skylark GSOP Sections 2 and 7
- Johnson & Giller 1971 (337 pages, Acrobat Capture from 2001)
- Hall 1977
- NASA SP-287
Low-confidence OCR — Extracted text should be treated as unreliable without manual verification.
- Skylark GSOP Section 4 (47% confidence, 14.1 MB scanned document)
Not extracted — Binary formats or pending processing.
- Hamilton 2004 MAPLD (PowerPoint .ppt)
- Managing the Moon Program (extraction pending)
Common OCR Issues
Section titled “Common OCR Issues”Across the scanned 1970s documents, recurring problems include:
- Word-run errors — spaces dropped between words, producing concatenated strings
- Character confusion — 0/O, 1/l, and similar ambiguities in typewritten text
- Table structure loss — column alignment destroyed during extraction
- Name garbling — proper names misrecognized (e.g., “Gflruth” for “Gilruth”)
- Mathematical notation — subscripts, superscripts, and special symbols misread
- Page furniture — headers, footers, and page numbers appearing inline with body text
The NOTES.md Convention
Section titled “The NOTES.md Convention”Every source directory contains a NOTES.md file that serves as the structured analysis record for that document. This is the heart of the archive’s knowledge layer.
Each NOTES.md follows a consistent template:
# Author (Year) -- Transcription Notes
## CitationFull bibliographic citation, source URLs, legal status, file details.
## Relationship to Hamilton's Body of WorkHow this document connects to Hamilton's core themes:error prevention, Apollo flight software, priority scheduling, USL.
## Key Concepts### Technical Content SummaryNumbered list of the document's substantive contributions.
### Key Equations / Algorithms / DiagramsNotable visual and formal content.
## Insights for Modern ApplicationWhat contemporary engineers can learn from this document.
## Cross-ReferencesHow this document connects to others in the archive.
## Transcription Notes### Source QualityAssessment of the PDF's origin and condition.
### Known IssuesNumbered list of extraction problems and caveats.This structure ensures that every document in the archive has been read, analyzed, and connected to the broader collection — not just stored as a file.
The Compendium as Living Catalog
Section titled “The Compendium as Living Catalog”The Compendium serves as the master catalog. It tracks every document the archive knows about. The acquisition pipeline moves documents through three stages:
- Seeking — Identified through bibliography research but no source found yet.
- Located — A source has been identified (URL, ISBN, archive reference) but the document has not yet been downloaded and verified.
- In Archive — PDF downloaded, SHA-256 recorded in
COLLECTION.md, NOTES.md written, extraction performed.
All identified publications have completed this pipeline. The compendium is the first thing updated when a new document is identified and the last thing updated when extraction is complete.
The Code Review Agent
Section titled “The Code Review Agent”The archive includes a code review agent (agents/margaret-hamilton.md) that encodes Hamilton’s engineering methodology as a structured review process. The agent is grounded in her published work and applies her principles to modern code:
- Development Before the Fact — prevention over detection
- Interface Error Taxonomy — the six categories (ambiguous, incomplete, inconsistent, wrong, unnecessary, over-specified) that account for 75% of software defects
- Priority-Based Recovery — the 1202/1201 principle of graceful degradation under load
- Asynchronous Thinking — no operation should block indefinitely; shared state must be protected
- End-to-End System Thinking — every component is part of a larger system
The agent follows a four-phase review process: interface analysis, failure mode enumeration, recovery architecture assessment, and prevention opportunity identification. It produces structured findings classified by Hamilton’s taxonomy.
This is not a style guide or linter. It asks: “Could this design have prevented the error from being possible in the first place?”
Tools and Process
Section titled “Tools and Process”The archive was assembled using:
- Claude Code — primary research assistant for document analysis, NOTES.md authoring, extraction pipeline coordination, and site generation
- PDF extraction tools — metadata extraction, text extraction (direct and OCR), image extraction, table extraction to JSON
- SHA-256 verification — every file hashed at download time, recorded in
COLLECTION.md - Wayback Machine — archival snapshots recorded for author-hosted copies to guard against link rot
- The Virtual AGC project (ibiblio.org) — community preservation effort that maintains mirrors of Apollo-era documentation
- NASA Technical Reports Server — primary acquisition source for government documents
- Starlight — documentation site framework for presenting the archive
What the Tools Cannot Do
Section titled “What the Tools Cannot Do”Automated extraction does not replace reading. Every document in the archive was read and analyzed by a human or an attentive language model before its NOTES.md was written. The cross-references, the “Relationship to Hamilton’s Body of Work” sections, and the “Insights for Modern Application” sections represent analytical work that no extraction pipeline produces on its own.
The 47% OCR confidence on Skylark Section 4 is an honest number. The archive records what it knows and what it does not.