COMPLETEDCase study

PhenoEval AI

Biomedical NLP evaluation platform for HPO phenotype recognition tools

PhenoEval AI was a UNSW capstone project built for an external genomics research client to benchmark phenotype concept recognition tools against a manually annotated HPO gold corpus. The system compared model outputs using precision, recall, F1, semantic similarity, and error-analysis workflows to support better phenotype extraction from clinical case-report text.

Biomedical NLPHPOPhenoBERTPhenoGPTPhenoTaggerModel EvaluationSemantic SimilarityError AnalysisPythonscikit-learnpandasReactFlaskPostgreSQL

Problem

Biomedical phenotype mentions can be noisy, contextual, and semantically varied. Better evaluation helps clarify model strengths, extraction failures, and downstream reliability for clinical text workflows.

What I Built

As the Data Scientist & Analyst on the capstone team, I worked on an evaluation-focused platform for benchmarking PhenoTagger, PhenoBERT, and PhenoGPT against a gold corpus of HPO annotations. The system was designed to load annotated case-report data, compare predictions against ground truth, calculate evaluation metrics, and surface error patterns through dashboards and reports.

Client / Research Context

The project was delivered as a UNSW capstone with an external academic genomics client from the University of Sydney. The research context focused on Human Phenotype Ontology recognition for clinical and genomic case-report text, with particular relevance to bone-related phenotype studies and downstream genome interpretation.

UNSW COMP9900 capstone project
External University of Sydney genomics research client
Human Phenotype Ontology concept recognition
Clinical case-report text
Bone-related phenotype research context
Gold corpus evaluation workflow

Architecture

PhenoTagger / PhenoBERT / PhenoGPT model runners
Gold corpus loading and preprocessing
HPO annotation normalization
Prediction-to-ground-truth comparator
Precision, recall, and F1 metric pipeline
Semantic similarity scoring
False-positive / false-negative error analysis
React evaluation dashboard
Flask REST API
PostgreSQL / SQLite results storage
Experiment tracking and reproducible reports

Core Features

Gold corpus ingestion
Baseline model evaluation
Multi-model output comparison
Precision / recall / F1 reporting
Semantic similarity analysis
False-positive and false-negative review
Boundary and synonym error inspection
Dashboard-ready evaluation summaries
Fine-tuning preparation workflow
Client-ready reporting

Evaluation Workflow

PhenoEval separated model execution from evaluation so each tool could be benchmarked through a shared comparison layer. Model outputs were normalized, matched against HPO gold annotations, scored using standard metrics, and reviewed for recurring error patterns.

Load annotated case reports
Normalize HPO labels and spans
Run phenotype recognition tools
Standardize model output format
Compare predictions with gold labels
Calculate precision, recall, and F1
Compute semantic similarity for near matches
Classify model errors
Generate dashboard/report outputs

Validation and Research Constraints

No clinical deployment scope
No new clinical annotation creation
Evaluation limited to available gold corpus
Model performance depends on annotation consistency
False positives and false negatives reviewed separately
Semantic similarity used to capture near-miss predictions
Fine-tuning treated as optional extension if time allowed

Tech Stack

Biomedical NLP, HPO, PhenoTagger, PhenoBERT, PhenoGPT, Transformers, PyTorch, TensorFlow, scikit-learn, pandas, spaCy, Python, React, Flask, PostgreSQL/SQLite, FAISS/Pinecone-ready vector layer, GitHub Actions

Current Status

Completed UNSW capstone project with evaluation workflow, benchmark design, and client-facing reporting scope.

Next Steps

Add stronger visual error-analysis dashboards
Package model runners behind a unified API
Add more semantic similarity explainability
Extend evaluation to additional phenotype tools
Document fine-tuning experiments and result deltas
Add architecture diagrams and sample annotated cases