๐ฅ MedVidBench Leaderboard
Interactive leaderboard for evaluating Video-Language Models on the MedVidBench benchmark - 8 medical video understanding tasks across 8 surgical datasets.
๐ Paper: MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding ๐ Project: yuhaosu.github.io/MedGRPO ๐พ Dataset: huggingface.co/datasets/UIIAmerica/MedVidBench ๐ป GitHub: github.com/YuhaoSu/MedGRPO
Current Rankings
Submit Your Model Results
Upload your model's predictions only on the MedVidBench test set (6,245 samples) to be added to the leaderboard.
๐ Requirements
- Run inference on the full test set (download from HuggingFace)
- Upload predictions JSON in the format below (NO ground truth needed)
- Provide model info (name, organization)
๐ Expected File Format
Your predictions JSON should contain 6,245 samples with this structure:
[
{
"id": "video_id&&start&&end&&fps",
"qa_type": "tal",
"prediction": "Your model's answer here"
},
{
"id": "another_video&&0&&10&&1.0",
"qa_type": "video_summary",
"prediction": "The surgeon performs..."
}
]
Required fields:
id: Sample identifier (matches test data from HuggingFace dataset)qa_type: Task type (tal/stg/next_action/dense_captioning/video_summary/region_caption/skill_assessment/cvs_assessment)prediction: Your model's answer (text output)
Important:
- โ Submit predictions only (no ground truth needed)
- โ Must include all 6,245 test samples
- โ Format can be list or dict (dict values will be extracted)
- โ Do NOT include ground truth fields (server handles this securely)
โ๏ธ Evaluation Process
After upload, the system will:
- Validate your predictions file format
- Merge your predictions with server-side ground truth (private)
- Run evaluation for all 8 tasks across 10 metrics
- Add to leaderboard if successful
Evaluation takes: ~5-10 minutes (includes LLM judge for caption quality assessment)
Security: Ground truth data is stored privately and never exposed to users.
MedVidBench Benchmark Tasks
The benchmark evaluates models across 8 diverse tasks spanning video, segment, and frame-level understanding:
Temporal Action Localization (TAL) | skill_assessment | TAG_mIoU@0.3, TAG_mIoU@0.5 | Generate captions for multiple events with temporal localization |
Temporal Action Localization (TAL) | tal | TAG_mIoU@0.3, TAG_mIoU@0.5 | Identify and temporally localize surgical actions in video |
Spatiotemporal Grounding (STG) | stg | STG_mIoU | Localize objects in both space (bbox) and time (temporal span) |
Next Action Prediction (NAP) | next_action | NAP_acc | Predict the next surgical step given current video context |
Dense Video Captioning (DVC) | dvc | DVC_llm, DVC_F1 | Generate captions for multiple events with temporal localization |
Video Summary (VS) | vs | VS_llm | Generate comprehensive summary of surgical procedure |
Region Caption (RC) | rc | RC_llm | Describe specific spatial regions in surgical frames |
Skill Assessment (SA) | skill_assessment | SA_acc | Evaluate surgeon skill level (novice/intermediate/expert) |
CVS Assessment | cvs_assessment | CVS_acc | Score clinical variables in surgical performance |
Evaluation Metrics
- TAL (Temporal Action Localization): mAP@0.5 - mean Average Precision at IoU threshold 0.5
- STG (Spatiotemporal Grounding): mIoU - mean Intersection over Union (spatial + temporal)
- Next Action: Accuracy - Classification accuracy
- DVC (Dense Video Captioning): LLM Judge - GPT-4.1/Gemini scoring (average of top-5 aspects)
- VS (Video Summary): LLM Judge - GPT-4.1/Gemini scoring (average of top-5 aspects)
- RC (Region Caption): LLM Judge - GPT-4.1/Gemini scoring (average of top-5 aspects)
- Skill Assessment: Accuracy - Surgical skill level classification (JIGSAWS)
- CVS Assessment: Accuracy - Clinical variable scoring
LLM Judge Details
Caption tasks (DVC, VS, RC) use GPT-4.1 or Gemini-Pro with rubric-based scoring (1-5 scale) across 5 key aspects:
- R2: Relevance & Medical Terminology
- R4: Actionable Surgical Actions
- R5: Comprehensive Detail Level
- R7: Anatomical & Instrument Precision
- R8: Clinical Context & Coherence
The final score is the average across these 5 aspects.
Test Set Statistics
- Total samples: 6,245
- Source datasets: 8 (AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, NurViD, JIGSAWS)
- Video frames: ~103,742
- Task distribution:
- TAL: ~800 samples
- STG: ~900 samples
- Next Action: ~700 samples
- DVC: ~800 samples
- VS: ~900 samples
- RC: ~1000 samples
- Skill Assessment: ~600 samples
- CVS Assessment: ~545 samples
About MedVidBench
MedVidBench is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding. It was introduced in the MedGRPO paper (Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding).
Key Features
- 8 diverse tasks covering multiple levels of video understanding
- 8 source datasets from various surgical procedures
- 6,245 test samples with high-quality annotations
- Automatic evaluation with standardized metrics
- LLM-based judging for caption quality assessment
Paper
@article{su2024medgrpo,
title={MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
journal={arXiv preprint arXiv:2512.06581},
year={2025}
}
Links
- ๐ Paper: https://arxiv.org/abs/2512.06581
- ๐ Project Page: https://yuhaosu.github.io/MedGRPO/
- ๐พ Dataset: https://huggingface.co/datasets/UIIAmerica/MedVidBench
- ๐ป GitHub: https://github.com/YuhaoSu/MedGRPO
- ๐ Leaderboard: https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard
Dataset
The MedVidBench benchmark includes:
- 21,060 training samples
- 6,245 test samples
- Multi-modal annotations (video, text, temporal spans, bounding boxes)
- 8 source datasets covering various medical procedures
License
- Dataset: CC BY-NC-SA 4.0 (Non-commercial, Share-alike)
- Leaderboard Code: Apache 2.0
- Evaluation Scripts: MIT
Contact
For questions or issues:
- Open an issue on GitHub
- Visit the project page
- Email: Contact via GitHub