FusionBench

A comprehensive Benchmark for QA under Ambiguity and Heterogeneity.

About FusionBench

FusionBench is a comprehensive and complex benchmark designed to evaulate large language models (LLMs) Question Answering (QA) capabilities under both ambiguity and heterogeneity. It has 7,179 QA pairs, with approximately 15.8K text snippets, 137.7K table entries, and 198.1K triples overall.

News

Submission

Citation

              
            
⚙ This website is modified from Bird-Bench.
Leaderboard - Evidence-aware QA with Multiple Answers
Model AR AP F1^A ER EP F1^E EAR EAP F1^EA
Mar 25, 2025 Gemini-2.5-Pro w/ R 0.810 0.815 0.812 0.882 0.851 0.866 0.782 0.783 0.783
Mar 25, 2025 Gemini-2.5-Pro 0.750 0.823 0.785 0.785 0.897 0.837 0.704 0.747 0.725
Jan 22, 2025 DeepSeek-R1 w/ R 0.741 0.812 0.775 0.830 0.929 0.877 0.706 0.768 0.736
Apr 18, 2024 LLaMA-3-70B-Instruct 0.730 0.793 0.760 0.834 0.917 0.874 0.689 0.752 0.719
Mar 25, 2025 Gemini-2.5-Flash 0.750 0.738 0.744 0.834 0.895 0.863 0.728 0.716 0.722
Aug 08, 2025 Qwen-Plus w/ R 0.719 0.780 0.748 0.872 0.947 0.908 0.691 0.745 0.717
Nov 6, 2024 DeepSeek-V3 0.720 0.771 0.745 0.855 0.939 0.895 0.688 0.737 0.712
Aug 08, 2025 Qwen-Plus 0.702 0.797 0.746 0.816 0.925 0.867 0.674 0.754 0.712
July 18, 2024 GPT-4o-Mini 0.686 0.796 0.737 0.783 0.933 0.851 0.650 0.765 0.703
Mar 13, 2024 GPT-4o 0.669 0.814 0.734 0.761 0.926 0.835 0.636 0.766 0.695
Jun 6, 2024 GLM-4-Air 0.660 0.770 0.711 0.775 0.927 0.844 0.625 0.726 0.672
Aug 07, 2025 Qwen3-8B 0.612 0.782 0.687 0.767 0.966 0.855 0.598 0.766 0.672
Aug 07, 2025 Qwen3-32B 0.626 0.771 0.691 0.789 0.975 0.872 0.612 0.753 0.675
Jun 6, 2024 GLM-4-Plus 0.609 0.799 0.691 0.711 0.952 0.814 0.579 0.761 0.657
Apr 18, 2024 LLaMA-3-8B-Instruct 0.608 0.705 0.653 0.727 0.830 0.775 0.546 0.634 0.587
Jun 6, 2024 GLM-4-9B 0.542 0.746 0.628 0.498 0.857 0.630 0.472 0.649 0.547
Leaderboard - Evidence-aware QA with Unique Answer
Model AR AP F1^A ER EP F1^E EAR EAP F1^EA
Jan 21, 2026 Qwen-Plus w/ R 0.838 0.626 0.717 0.902 0.864 0.883 0.779 0.585 0.668
Jan 21, 2026 DeepSeek-R1 w/ R 0.801 0.619 0.698 0.879 0.837 0.857 0.747 0.582 0.655
Jan 21, 2026 Glm-4-Plus 0.829 0.629 0.715 0.871 0.825 0.847 0.747 0.569 0.646
Jan 21, 2026 Qwen-Plus 0.839 0.574 0.682 0.925 0.813 0.865 0.785 0.536 0.637
Jan 21, 2026 GPT-4o 0.845 0.624 0.718 0.860 0.828 0.844 0.742 0.553 0.634
Jan 21, 2026 Gemini-2.5-Pro w/ R 0.821 0.598 0.692 0.839 0.768 0.802 0.746 0.540 0.626
Jan 21, 2026 Gemini-2.5-Pro 0.860 0.600 0.707 0.871 0.793 0.830 0.764 0.526 0.623
Jan 21, 2026 Gemini-2.5-Flash 0.821 0.587 0.681 0.827 0.758 0.791 0.729 0.523 0.609
Jan 21, 2026 DeepSeek-V3 0.828 0.556 0.665 0.878 0.789 0.831 0.753 0.510 0.608
Jan 21, 2026 Qwen-33B 0.864 0.534 0.660 0.900 0.798 0.846 0.790 0.493 0.607
Jan 21, 2026 Qwen-3-8B 0.831 0.586 0.687 0.808 0.748 0.777 0.706 0.503 0.587
Jan 21, 2026 LLaMA3-70B-Instruct 0.763 0.614 0.680 0.814 0.729 0.769 0.650 0.521 0.578
Jan 21, 2026 GPT-4o-Mini 0.756 0.532 0.625 0.803 0.737 0.769 0.640 0.460 0.535
Jan 21, 2026 Glm-4-Air 0.791 0.543 0.644 0.746 0.686 0.715 0.642 0.447 0.527
Jan 21, 2026 Glm-4-9B 0.738 0.472 0.576 0.624 0.501 0.556 0.503 0.303 0.378
Jan 21, 2026 LLaMA3-8b-Instruct 0.535 0.288 0.374 0.598 0.351 0.442 0.399 0.198 0.265
Leaderboard - RAG-based QA
Model AR AP F1^A ER EP F1^E EAR EAP F1^EA
Mar 25, 2025 Gemini-2.5-Pro w/ R 0.738 0.798 0.767 0.722 0.802 0.760 0.662 0.693 0.677
Mar 25, 2025 Gemini-2.5-Pro 0.724 0.753 0.738 0.722 0.790 0.754 0.649 0.657 0.653
Mar 25, 2025 Gemini-2.5-Flash 0.682 0.777 0.726 0.654 0.743 0.696 0.598 0.632 0.614
Mar 25, 2025 DeepSeek-V3 0.669 0.761 0.712 0.674 0.784 0.725 0.580 0.654 0.615
Jan 22, 2025 DeepSeek-R1 w/ R 0.638 0.763 0.695 0.660 0.803 0.725 0.578 0.682 0.626
Apr 18, 2024 LLaMA-3-70B-Instruct 0.665 0.744 0.702 0.649 0.741 0.692 0.572 0.633 0.601
Mar 13, 2024 GPT-4o 0.607 0.799 0.690 0.593 0.755 0.664 0.529 0.666 0.590
July 18, 2024 GPT-4o-mini 0.604 0.772 0.678 0.597 0.781 0.677 0.525 0.674 0.590
Aug 08, 2025 Qwen-Plus 0.624 0.778 0.693 0.656 0.808 0.724 0.562 0.680 0.615
Aug 08, 2025 Qwen-Plus w/ R 0.584 0.801 0.676 0.618 0.846 0.714 0.537 0.724 0.617
Aug 07, 2025 Qwen3-32B 0.570 0.801 0.666 0.597 0.827 0.693 0.510 0.711 0.594
Aug 07, 2025 Qwen3-8B 0.581 0.785 0.668 0.605 0.803 0.690 0.515 0.689 0.589
Jun 6, 2024 GLM-4-Air 0.599 0.759 0.670 0.607 0.770 0.679 0.517 0.637 0.571
Jun 6, 2024 GLM-4-Plus 0.552 0.798 0.653 0.554 0.805 0.656 0.481 0.685 0.565
Jun 6, 2024 GLM-4-9B 0.488 0.752 0.592 0.480 0.755 0.587 0.403 0.614 0.487
Apr 18, 2024 LLaMA-3-8B-Instruct 0.567 0.649 0.605 0.549 0.633 0.588 0.454 0.522 0.486