Archive Methodology

This page documents how the archive was constructed — from identifying documents to extracting their contents to assessing quality. The goal is reproducibility: anyone with the same tools and sources should be able to reconstruct what we have and verify it against the provenance records.

Document Identification

The starting point was Hamilton’s own publication record, traced through three academic databases:

DBLP (dblp.org/pid/72/4181.html) — computer science bibliography
IEEE Xplore (ieeexplore.ieee.org/author/37086475658) — IEEE publications
ACM Digital Library (dl.acm.org/profile/81502669990) — ACM publications

From these, we built the initial bibliography. Citations within Hamilton’s own papers led to additional documents: the MIT R-700 series, NASA Special Publications, and the GSOP documents her team produced during Apollo and Skylab.

The scope expanded in concentric circles:

Hamilton’s own publications — journal articles, conference papers, the 2019 Draper retrospective
Documents Hamilton authored at MIT — GSOP sections for Colossus and Skylark
Institutional accounts of the software effort — Johnson & Giller (1971), Hall (1977)
NASA program-level documents — SP-287, the Managing the Moon Program oral history
Modern technical analyses — Averill’s AGC hardware analysis, NASA retrospectives
Physical archives — the Smithsonian NASM.1986.0158 collection (documented but not digitized)

The Compendium tracks every identified document regardless of acquisition status.

PDF Acquisition

Documents were acquired from three categories of sources, each with different legal and access characteristics.

NASA Technical Reports Server (NTRS)

The primary source for government documents. NTRS provides direct PDF downloads for most reports via a predictable URL pattern:

ntrs.nasa.gov/api/citations/{NTRS_ID}/downloads/{NTRS_ID}.pdf

Some older documents are citation-only on the current NTRS platform (notably Johnson & Giller 1971, NTRS ID 19750067792). For these, alternative sources were used.

Open Access Author Copies

Hamilton’s 2008 USL paper is available as an author’s copy from Hamilton Technologies, Inc. (htius.com/Articles/r12ham.pdf). The 2019 retrospective was published through Draper Labs’ “Hack the Moon” public outreach site. A Wayback Machine snapshot was recorded for the author’s copy to guard against link rot.

Community Archives

The Virtual AGC project at ibiblio.org maintains mirrors of Apollo-era documents and was the source for one Skylark GSOP section not available on NTRS. MIT’s Digital Apollo archive provided the primary copy of the Johnson & Giller volume.

arXiv

The Averill (2022) paper was obtained directly from arXiv as an open-access preprint.

Extraction Pipeline

Once a PDF is acquired, it goes through a structured extraction process.

Step 1: Metadata Capture

PDF metadata is extracted and stored as metadata.json in the document’s extracted directory. This captures producer application, creation date, page count, and other embedded properties that help assess document provenance.

Step 2: Structure Analysis

Document structure (headings, sections, page dimensions) is analyzed and stored as structure.json. This informs the extraction approach — a clean digital PDF from InDesign needs different handling than a 1972 typewritten scan.

Step 3: Text Extraction

For digital-origin PDFs (Hamilton 2019, Hamilton-Hackler 2008, Averill 2022), direct text extraction produces clean results with minimal post-processing.

For scanned documents (the 1972 GSOP sections, Johnson & Giller 1971, NASA SP-287), OCR is required. The OCR pipeline produces markdown text, but accuracy varies significantly by source quality.

Step 4: Image Extraction

Figures, diagrams, and in the case of scanned documents, page images are extracted to an images/ subdirectory. This preserves visual content that text extraction cannot capture — particularly important for the GSOP memory allocation tables and the USL paper’s FMap/TMap diagrams.

Step 5: Table Extraction

Structured tables are extracted to JSON format in a tables/ subdirectory. This is especially relevant for the GSOP documents, which are primarily composed of memory allocation tables.

Directory Layout

Each document follows a consistent extracted directory structure:

extracted/{source-dir}/
  full-text.md          # or {filename}.md -- primary text extraction
  metadata.json         # PDF metadata
  structure.json        # Document structure analysis
  images/               # Extracted figures and page images
    {filename}_page_{N}_img_{M}.png
  tables/               # Structured table data
    table_{N}_page_{M}.json

Quality Assessment

Extraction quality varies substantially across the archive. Each document’s NOTES.md records known issues.

Quality Tiers

Clean digital PDFs — Text extraction is reliable. Minor issues with formatting artifacts (bullet characters, pull quotes, citation superscripts).

Hamilton 2019 (OpenOffice/Google Docs origin)
Hamilton-Hackler 2008 (Adobe InDesign CS2)
Averill 2022 (pdfTeX/LaTeX)

Scanned documents with usable OCR — Text is readable but contains errors. Tables and mathematical notation are particularly prone to OCR artifacts.

Colossus GSOP sections (1972 typewritten)
Skylark GSOP Sections 2 and 7
Johnson & Giller 1971 (337 pages, Acrobat Capture from 2001)
Hall 1977
NASA SP-287

Low-confidence OCR — Extracted text should be treated as unreliable without manual verification.

Skylark GSOP Section 4 (47% confidence, 14.1 MB scanned document)

Not extracted — Binary formats or pending processing.

Hamilton 2004 MAPLD (PowerPoint .ppt)
Managing the Moon Program (extraction pending)

Common OCR Issues

Across the scanned 1970s documents, recurring problems include:

Word-run errors — spaces dropped between words, producing concatenated strings
Character confusion — 0/O, 1/l, and similar ambiguities in typewritten text
Table structure loss — column alignment destroyed during extraction
Name garbling — proper names misrecognized (e.g., “Gflruth” for “Gilruth”)
Mathematical notation — subscripts, superscripts, and special symbols misread
Page furniture — headers, footers, and page numbers appearing inline with body text

The NOTES.md Convention

Every source directory contains a NOTES.md file that serves as the structured analysis record for that document. This is the heart of the archive’s knowledge layer.

Each NOTES.md follows a consistent template:

# Author (Year) -- Transcription Notes

## Citation
Full bibliographic citation, source URLs, legal status, file details.

## Relationship to Hamilton's Body of Work
How this document connects to Hamilton's core themes:
error prevention, Apollo flight software, priority scheduling, USL.

## Key Concepts
### Technical Content Summary
Numbered list of the document's substantive contributions.

### Key Equations / Algorithms / Diagrams
Notable visual and formal content.

## Insights for Modern Application
What contemporary engineers can learn from this document.

## Cross-References
How this document connects to others in the archive.

## Transcription Notes
### Source Quality
Assessment of the PDF's origin and condition.

### Known Issues
Numbered list of extraction problems and caveats.

This structure ensures that every document in the archive has been read, analyzed, and connected to the broader collection — not just stored as a file.

The Compendium as Living Catalog

The Compendium serves as the master catalog. It tracks every document the archive knows about. The acquisition pipeline moves documents through three stages:

Seeking — Identified through bibliography research but no source found yet.
Located — A source has been identified (URL, ISBN, archive reference) but the document has not yet been downloaded and verified.
In Archive — PDF downloaded, SHA-256 recorded in COLLECTION.md, NOTES.md written, extraction performed.

All identified publications have completed this pipeline. The compendium is the first thing updated when a new document is identified and the last thing updated when extraction is complete.

The Code Review Agent

The archive includes a code review agent (agents/margaret-hamilton.md) that encodes Hamilton’s engineering methodology as a structured review process. The agent is grounded in her published work and applies her principles to modern code:

Development Before the Fact — prevention over detection
Interface Error Taxonomy — the six categories (ambiguous, incomplete, inconsistent, wrong, unnecessary, over-specified) that account for 75% of software defects
Priority-Based Recovery — the 1202/1201 principle of graceful degradation under load
Asynchronous Thinking — no operation should block indefinitely; shared state must be protected
End-to-End System Thinking — every component is part of a larger system

The agent follows a four-phase review process: interface analysis, failure mode enumeration, recovery architecture assessment, and prevention opportunity identification. It produces structured findings classified by Hamilton’s taxonomy.

This is not a style guide or linter. It asks: “Could this design have prevented the error from being possible in the first place?”

Tools and Process

The archive was assembled using:

Claude Code — primary research assistant for document analysis, NOTES.md authoring, extraction pipeline coordination, and site generation
PDF extraction tools — metadata extraction, text extraction (direct and OCR), image extraction, table extraction to JSON
SHA-256 verification — every file hashed at download time, recorded in COLLECTION.md
Wayback Machine — archival snapshots recorded for author-hosted copies to guard against link rot
The Virtual AGC project (ibiblio.org) — community preservation effort that maintains mirrors of Apollo-era documentation
NASA Technical Reports Server — primary acquisition source for government documents
Starlight — documentation site framework for presenting the archive

What the Tools Cannot Do

Automated extraction does not replace reading. Every document in the archive was read and analyzed by a human or an attentive language model before its NOTES.md was written. The cross-references, the “Relationship to Hamilton’s Body of Work” sections, and the “Insights for Modern Application” sections represent analytical work that no extraction pipeline produces on its own.

The 47% OCR confidence on Skylark Section 4 is an honest number. The archive records what it knows and what it does not.