๐Ÿฅ MedVidBench Leaderboard

Interactive leaderboard for evaluating Video-Language Models on the MedVidBench benchmark - 8 medical video understanding tasks across 8 surgical datasets.

๐Ÿ“„ Paper: MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding ๐ŸŒ Project: yuhaosu.github.io/MedGRPO ๐Ÿ’พ Dataset: huggingface.co/datasets/UIIAmerica/MedVidBench ๐Ÿ’ป GitHub: github.com/YuhaoSu/MedGRPO

Current Rankings

The leaderboard displays all submitted models ranked by their performance across 10 metrics on 8 medical video understanding tasks.

Note: Models with all caption metrics (DVC_llm, VS_llm, RC_llm) at 0.0 can be re-evaluated with LLM judge using the section below.

Leaderboard Rankings


๐Ÿค– Run LLM Judge Evaluation

If a model was submitted with --skip-llm-judge (caption metrics are 0.0), you can run LLM judge evaluation here. This will compute DVC_llm, VS_llm, and RC_llm scores using GPT-4.1/Gemini.

โœ… Background Execution: The evaluation runs in the background - you can close the browser and come back later!

Note: This feature is only available when ALL three caption metrics (DVC_llm, VS_llm, RC_llm) are 0.0.