COMPLETEDCase study

PhenoEval AI

Biomedical NLP evaluation platform for HPO phenotype recognition tools

PhenoEval AI was a UNSW capstone project built for an external genomics research client to benchmark phenotype concept recognition tools against a manually annotated HPO gold corpus. The system compared model outputs using precision, recall, F1, semantic similarity, and error-analysis workflows to support better phenotype extraction from clinical case-report text.

Biomedical NLPHPOPhenoBERTPhenoGPTPhenoTaggerModel EvaluationSemantic SimilarityError AnalysisPythonscikit-learnpandasReactFlaskPostgreSQL

Problem

Biomedical phenotype mentions can be noisy, contextual, and semantically varied. Better evaluation helps clarify model strengths, extraction failures, and downstream reliability for clinical text workflows.

What I Built

As the Data Scientist & Analyst on the capstone team, I worked on an evaluation-focused platform for benchmarking PhenoTagger, PhenoBERT, and PhenoGPT against a gold corpus of HPO annotations. The system was designed to load annotated case-report data, compare predictions against ground truth, calculate evaluation metrics, and surface error patterns through dashboards and reports.

Client / Research Context

The project was delivered as a UNSW capstone with an external academic genomics client from the University of Sydney. The research context focused on Human Phenotype Ontology recognition for clinical and genomic case-report text, with particular relevance to bone-related phenotype studies and downstream genome interpretation.

  • UNSW COMP9900 capstone project
  • External University of Sydney genomics research client
  • Human Phenotype Ontology concept recognition
  • Clinical case-report text
  • Bone-related phenotype research context
  • Gold corpus evaluation workflow

Architecture

  • PhenoTagger / PhenoBERT / PhenoGPT model runners
  • Gold corpus loading and preprocessing
  • HPO annotation normalization
  • Prediction-to-ground-truth comparator
  • Precision, recall, and F1 metric pipeline
  • Semantic similarity scoring
  • False-positive / false-negative error analysis
  • React evaluation dashboard
  • Flask REST API
  • PostgreSQL / SQLite results storage
  • Experiment tracking and reproducible reports

Core Features

  • Gold corpus ingestion
  • Baseline model evaluation
  • Multi-model output comparison
  • Precision / recall / F1 reporting
  • Semantic similarity analysis
  • False-positive and false-negative review
  • Boundary and synonym error inspection
  • Dashboard-ready evaluation summaries
  • Fine-tuning preparation workflow
  • Client-ready reporting

Evaluation Workflow

PhenoEval separated model execution from evaluation so each tool could be benchmarked through a shared comparison layer. Model outputs were normalized, matched against HPO gold annotations, scored using standard metrics, and reviewed for recurring error patterns.

  • Load annotated case reports
  • Normalize HPO labels and spans
  • Run phenotype recognition tools
  • Standardize model output format
  • Compare predictions with gold labels
  • Calculate precision, recall, and F1
  • Compute semantic similarity for near matches
  • Classify model errors
  • Generate dashboard/report outputs

Validation and Research Constraints

  • No clinical deployment scope
  • No new clinical annotation creation
  • Evaluation limited to available gold corpus
  • Model performance depends on annotation consistency
  • False positives and false negatives reviewed separately
  • Semantic similarity used to capture near-miss predictions
  • Fine-tuning treated as optional extension if time allowed

Tech Stack

Biomedical NLP, HPO, PhenoTagger, PhenoBERT, PhenoGPT, Transformers, PyTorch, TensorFlow, scikit-learn, pandas, spaCy, Python, React, Flask, PostgreSQL/SQLite, FAISS/Pinecone-ready vector layer, GitHub Actions

Current Status

Completed UNSW capstone project with evaluation workflow, benchmark design, and client-facing reporting scope.

Next Steps

  • Add stronger visual error-analysis dashboards
  • Package model runners behind a unified API
  • Add more semantic similarity explainability
  • Extend evaluation to additional phenotype tools
  • Document fine-tuning experiments and result deltas
  • Add architecture diagrams and sample annotated cases