PhenoEval AI
Biomedical NLP evaluation platform for HPO phenotype recognition tools
PhenoEval AI was a UNSW capstone project built for an external genomics research client to benchmark phenotype concept recognition tools against a manually annotated HPO gold corpus. The system compared model outputs using precision, recall, F1, semantic similarity, and error-analysis workflows to support better phenotype extraction from clinical case-report text.
Problem
Biomedical phenotype mentions can be noisy, contextual, and semantically varied. Better evaluation helps clarify model strengths, extraction failures, and downstream reliability for clinical text workflows.
What I Built
As the Data Scientist & Analyst on the capstone team, I worked on an evaluation-focused platform for benchmarking PhenoTagger, PhenoBERT, and PhenoGPT against a gold corpus of HPO annotations. The system was designed to load annotated case-report data, compare predictions against ground truth, calculate evaluation metrics, and surface error patterns through dashboards and reports.
Client / Research Context
The project was delivered as a UNSW capstone with an external academic genomics client from the University of Sydney. The research context focused on Human Phenotype Ontology recognition for clinical and genomic case-report text, with particular relevance to bone-related phenotype studies and downstream genome interpretation.
- UNSW COMP9900 capstone project
- External University of Sydney genomics research client
- Human Phenotype Ontology concept recognition
- Clinical case-report text
- Bone-related phenotype research context
- Gold corpus evaluation workflow
Architecture
- PhenoTagger / PhenoBERT / PhenoGPT model runners
- Gold corpus loading and preprocessing
- HPO annotation normalization
- Prediction-to-ground-truth comparator
- Precision, recall, and F1 metric pipeline
- Semantic similarity scoring
- False-positive / false-negative error analysis
- React evaluation dashboard
- Flask REST API
- PostgreSQL / SQLite results storage
- Experiment tracking and reproducible reports
Core Features
- Gold corpus ingestion
- Baseline model evaluation
- Multi-model output comparison
- Precision / recall / F1 reporting
- Semantic similarity analysis
- False-positive and false-negative review
- Boundary and synonym error inspection
- Dashboard-ready evaluation summaries
- Fine-tuning preparation workflow
- Client-ready reporting
Evaluation Workflow
PhenoEval separated model execution from evaluation so each tool could be benchmarked through a shared comparison layer. Model outputs were normalized, matched against HPO gold annotations, scored using standard metrics, and reviewed for recurring error patterns.
- Load annotated case reports
- Normalize HPO labels and spans
- Run phenotype recognition tools
- Standardize model output format
- Compare predictions with gold labels
- Calculate precision, recall, and F1
- Compute semantic similarity for near matches
- Classify model errors
- Generate dashboard/report outputs
Validation and Research Constraints
- No clinical deployment scope
- No new clinical annotation creation
- Evaluation limited to available gold corpus
- Model performance depends on annotation consistency
- False positives and false negatives reviewed separately
- Semantic similarity used to capture near-miss predictions
- Fine-tuning treated as optional extension if time allowed
Tech Stack
Biomedical NLP, HPO, PhenoTagger, PhenoBERT, PhenoGPT, Transformers, PyTorch, TensorFlow, scikit-learn, pandas, spaCy, Python, React, Flask, PostgreSQL/SQLite, FAISS/Pinecone-ready vector layer, GitHub Actions
Current Status
Completed UNSW capstone project with evaluation workflow, benchmark design, and client-facing reporting scope.
Next Steps
- Add stronger visual error-analysis dashboards
- Package model runners behind a unified API
- Add more semantic similarity explainability
- Extend evaluation to additional phenotype tools
- Document fine-tuning experiments and result deltas
- Add architecture diagrams and sample annotated cases