✦ | Ricardo Yang

Table of Contents

NASA Project Exploration & eXtraction a specialized search system designed to simplify access to NASA’s research and technology portfolio. The system transforms complex, heterogeneous datasets into a unified and searchable knowledge base, significantly improving the discoverability of scientific and technical work.

Technical Evolution & Performance
#

The project successfully implemented a state-of-the-art retrieval hierarchy, progressing from simple text matching to a sophisticated hybrid system. The Hybrid Architecture represents the peak of performance by merging traditional keyword precision with AI-driven conceptual depth.

Model Stage	Architecture Focus	Effectiveness (MAP)
Base	Simplistic lexical matching (Baseline)	0.462
Tuned	Custom schema & field boosting	0.483
Advanced	echnical synonymy & stop-word retention	0.499
Semantic	AI-powered vector embeddings (OpenAI)	0.510
Hybrid	RRF Fusion of Lexical + Semantic	0.560

The Result: The Hybrid approach emerged as the most robust solution overall, achieving a 0.560 Mean Average Precision (MAP) by successfully “filling in the gaps” where keywords alone were insufficient

The Architecture
#

Behind the scenes, a robust pipeline handles the unique challenges of specialized technical data.

Intelligent ETL Pipeline: A custom Python-based pipeline was built to scrape project narratives directly from the NASA research portal.
High-Coverage Extraction: The scraping operation achieved a 98.4% success rate, extracting detailed descriptions and maturity ratings for over 16,000 projects.
Data Normalization: The system implements automated cleaning, including ISO 8601 date standardization and multi-strategy matching for technology taxonomies.
Dual-Engine Storage: The architecture utilizes Apache Solr for indexing and MongoDB for persistent metadata storage.
Semantic Understanding: High-dimensional OpenAI embeddings (3072-dim) are utilized to capture conceptual relevance even when exact terminology differs.

Evaluation Pipeline
#

To ensure high-quality retrieval, a rigorous evaluation framework was implemented based on the standard Text Retrieval Conference (TREC) framework.

TREC Pooling Methodology: To create a manageable ground truth from a massive collection, the system utilized a pooling strategy where the top 100 results from each retrieval model were merged to form a unified Judgment Pool.
Hybrid Relevance Assessment: A two-stage model combined AI efficiency with human expertise.
LLM-as-a-Judge: GPT-4o mini initially screened the judgment pool, providing graded relevance scores and textual justifications.
Human Adjudication: “Uncertain” cases (scores between 40-60) were flagged for manual review, where human expertise resolved 48 borderline document-topic pairs to ensure final judgment credibility.
Standard Metrics: Performance was benchmarked using standard IR metrics, including MAP, P@10, and nDCG.

Documentation
#

Detailed information regarding the system’s architecture, methodology, and performance evaluations can be explored through the technical academic report and the source code.

Technical Evolution & Performance #

The Architecture #

Evaluation Pipeline #

Documentation #

Technical Evolution & Performance
#

The Architecture
#

Evaluation Pipeline
#

Documentation
#