MedVidBench Leaderboard

Current Rankings

The leaderboard displays all submitted models ranked by their performance across 10 metrics on 8 medical video understanding tasks.

Note: Models with all caption metrics (DVC_llm, VS_llm, RC_llm) at 0.0 can be re-evaluated with LLM judge using the section below.

Leaderboard Rankings

Leaderboard Rankings

1	Qwen2.5-VL-7B-MedGRPO	Qwen AI	0.9136	0.4075	0.2437	0.2015	0.2164	0.1555	3.1712	0.2169	4.184	3.442	2025-01-14	gaozhongpai@gmail.com

Leaderboard Rankings

1	Qwen2.5-VL-7B-MedGRPO	UII	0.914	0.427	0.244	0.202	0.216	0.156	3.797	0.21	4.184	3.442	2025-01-14	gaozhongpai@gmail.com
2	test	test	0.9136	0.4075	0.2437	0.2015	0.2164	0.1555	3.1712	0.2169	4.406	3.392	2026-01-14
3	Qwen2.5-VL-7B	Qwen AI	0.2824	0.2313	0	0.0114	0.1705	0.1106	0	0.1696	0	0	2026-01-14
4	Gemini-2.5-flash	Google	0.0709	0.4282	0	0.0467	0.1964	0.1453	0	0	0	0	2026-01-14
5	GPT-4.1	OpenAi	0.0031	0.3463	0	0.0152	0.1694	0.1085	0	0.1159	0	0	2026-01-14

🤖 Run LLM Judge Evaluation

If a model was submitted with --skip-llm-judge (caption metrics are 0.0), you can run LLM judge evaluation here. This will compute DVC_llm, VS_llm, and RC_llm scores using GPT-4.1/Gemini.

✅ Background Execution: The evaluation runs in the background - you can close the browser and come back later!

Note: This feature is only available when ALL three caption metrics (DVC_llm, VS_llm, RC_llm) are 0.0.

Model Name

Submit Your Model Results

Upload your model's predictions only on the MedVidBench test set (6,245 samples) to be added to the leaderboard.

📋 Requirements

Run inference on the full test set (download from HuggingFace)
Upload predictions JSON in the format below (NO ground truth needed)
Provide model info (name, organization)

📄 Expected File Format

Your predictions JSON should contain 6,245 samples with this structure:

[
  {
    "id": "video_id&&start&&end&&fps",
    "qa_type": "tal",
    "prediction": "Your model's answer here"
  },
  {
    "id": "another_video&&0&&10&&1.0",
    "qa_type": "video_summary",
    "prediction": "The surgeon performs..."
  }
]

Required fields:

id: Sample identifier (matches test data from HuggingFace dataset)
qa_type: Task type (tal/stg/next_action/dense_captioning/video_summary/region_caption/skill_assessment/cvs_assessment)
prediction: Your model's answer (text output)

Important:

✅ Submit predictions only (no ground truth needed)
✅ Must include all 6,245 test samples
✅ Format can be list or dict (dict values will be extracted)
❌ Do NOT include ground truth fields (server handles this securely)

⚙️ Evaluation Process

After upload, the system will:

Validate your predictions file format
Merge your predictions with server-side ground truth (private)
Run evaluation for all 8 tasks across 10 metrics
Add to leaderboard if successful

Evaluation takes: ~5-10 minutes (includes LLM judge for caption quality assessment)

Security: Ground truth data is stored privately and never exposed to users.

Model Name

Unique identifier for your model

Organization / Author

Who developed this model?

Contact (Optional)

For follow-up questions

MedVidBench Benchmark Tasks

The benchmark evaluates models across 8 diverse tasks spanning video, segment, and frame-level understanding:


Temporal Action Localization (TAL)	skill_assessment	TAG_mIoU@0.3, TAG_mIoU@0.5	Generate captions for multiple events with temporal localization


Temporal Action Localization (TAL)	tal	TAG_mIoU@0.3, TAG_mIoU@0.5	Identify and temporally localize surgical actions in video
Spatiotemporal Grounding (STG)	stg	STG_mIoU	Localize objects in both space (bbox) and time (temporal span)
Next Action Prediction (NAP)	next_action	NAP_acc	Predict the next surgical step given current video context
Dense Video Captioning (DVC)	dvc	DVC_llm, DVC_F1	Generate captions for multiple events with temporal localization
Video Summary (VS)	vs	VS_llm	Generate comprehensive summary of surgical procedure
Region Caption (RC)	rc	RC_llm	Describe specific spatial regions in surgical frames
Skill Assessment (SA)	skill_assessment	SA_acc	Evaluate surgeon skill level (novice/intermediate/expert)
CVS Assessment	cvs_assessment	CVS_acc	Score clinical variables in surgical performance

Evaluation Metrics

TAL (Temporal Action Localization): mAP@0.5 - mean Average Precision at IoU threshold 0.5
STG (Spatiotemporal Grounding): mIoU - mean Intersection over Union (spatial + temporal)
Next Action: Accuracy - Classification accuracy
DVC (Dense Video Captioning): LLM Judge - GPT-4.1/Gemini scoring (average of top-5 aspects)
VS (Video Summary): LLM Judge - GPT-4.1/Gemini scoring (average of top-5 aspects)
RC (Region Caption): LLM Judge - GPT-4.1/Gemini scoring (average of top-5 aspects)
Skill Assessment: Accuracy - Surgical skill level classification (JIGSAWS)
CVS Assessment: Accuracy - Clinical variable scoring

LLM Judge Details

Caption tasks (DVC, VS, RC) use GPT-4.1 or Gemini-Pro with rubric-based scoring (1-5 scale) across 5 key aspects:

R2: Relevance & Medical Terminology
R4: Actionable Surgical Actions
R5: Comprehensive Detail Level
R7: Anatomical & Instrument Precision
R8: Clinical Context & Coherence

The final score is the average across these 5 aspects.

Test Set Statistics

Total samples: 6,245
Source datasets: 8 (AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, NurViD, JIGSAWS)
Video frames: ~103,742
Task distribution:
- TAL: ~800 samples
- STG: ~900 samples
- Next Action: ~700 samples
- DVC: ~800 samples
- VS: ~900 samples
- RC: ~1000 samples
- Skill Assessment: ~600 samples
- CVS Assessment: ~545 samples

About MedVidBench

MedVidBench is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding. It was introduced in the MedGRPO paper (Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding).

Key Features

8 diverse tasks covering multiple levels of video understanding
8 source datasets from various surgical procedures
6,245 test samples with high-quality annotations
Automatic evaluation with standardized metrics
LLM-based judging for caption quality assessment

Paper

@article{su2024medgrpo,
  title={MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
  author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
  journal={arXiv preprint arXiv:2512.06581},
  year={2025}
}

Links

📄 Paper: https://arxiv.org/abs/2512.06581
🌐 Project Page: https://yuhaosu.github.io/MedGRPO/
💾 Dataset: https://huggingface.co/datasets/UIIAmerica/MedVidBench
💻 GitHub: https://github.com/YuhaoSu/MedGRPO
🏆 Leaderboard: https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard

Dataset

The MedVidBench benchmark includes:

21,060 training samples
6,245 test samples
Multi-modal annotations (video, text, temporal spans, bounding boxes)
8 source datasets covering various medical procedures

License

Dataset: CC BY-NC-SA 4.0 (Non-commercial, Share-alike)
Leaderboard Code: Apache 2.0
Evaluation Scripts: MIT

Contact

For questions or issues:

Open an issue on GitHub
Visit the project page
Email: Contact via GitHub

Admin Panel

This panel allows administrators to manage leaderboard submissions.

Features:

View all submissions
Delete individual models
Cleanup test/dummy data

Note: Admin password is set via ADMIN_PASSWORD environment variable in HuggingFace Spaces settings.

Admin Password

🏥 MedVidBench Leaderboard