About FusionBench
FusionBench is a comprehensive and complex benchmark designed to evaulate large language models
(LLMs) Question Answering (QA) capabilities under both ambiguity and heterogeneity. It has 7,179 QA
pairs, with approximately 15.8K text snippets, 137.7K table entries, and 198.1K triples
overall.
News
Submission
Citation
⚙ This website is modified from Bird-Bench.
Leaderboard - Evidence-aware QA with Multiple Answers
| Model | AR | AP | F1^A | ER | EP | F1^E | EAR | EAP | F1^EA | |
|---|---|---|---|---|---|---|---|---|---|---|
| Mar 25, 2025 | Gemini-2.5-Pro w/ R | 0.810 | 0.815 | 0.812 | 0.882 | 0.851 | 0.866 | 0.782 | 0.783 | 0.783 |
| Mar 25, 2025 | Gemini-2.5-Pro | 0.750 | 0.823 | 0.785 | 0.785 | 0.897 | 0.837 | 0.704 | 0.747 | 0.725 |
| Jan 22, 2025 | DeepSeek-R1 w/ R | 0.741 | 0.812 | 0.775 | 0.830 | 0.929 | 0.877 | 0.706 | 0.768 | 0.736 |
| Apr 18, 2024 | LLaMA-3-70B-Instruct | 0.730 | 0.793 | 0.760 | 0.834 | 0.917 | 0.874 | 0.689 | 0.752 | 0.719 |
| Mar 25, 2025 | Gemini-2.5-Flash | 0.750 | 0.738 | 0.744 | 0.834 | 0.895 | 0.863 | 0.728 | 0.716 | 0.722 |
| Aug 08, 2025 | Qwen-Plus w/ R | 0.719 | 0.780 | 0.748 | 0.872 | 0.947 | 0.908 | 0.691 | 0.745 | 0.717 |
| Nov 6, 2024 | DeepSeek-V3 | 0.720 | 0.771 | 0.745 | 0.855 | 0.939 | 0.895 | 0.688 | 0.737 | 0.712 |
| Aug 08, 2025 | Qwen-Plus | 0.702 | 0.797 | 0.746 | 0.816 | 0.925 | 0.867 | 0.674 | 0.754 | 0.712 |
| July 18, 2024 | GPT-4o-Mini | 0.686 | 0.796 | 0.737 | 0.783 | 0.933 | 0.851 | 0.650 | 0.765 | 0.703 |
| Mar 13, 2024 | GPT-4o | 0.669 | 0.814 | 0.734 | 0.761 | 0.926 | 0.835 | 0.636 | 0.766 | 0.695 |
| Jun 6, 2024 | GLM-4-Air | 0.660 | 0.770 | 0.711 | 0.775 | 0.927 | 0.844 | 0.625 | 0.726 | 0.672 |
| Aug 07, 2025 | Qwen3-8B | 0.612 | 0.782 | 0.687 | 0.767 | 0.966 | 0.855 | 0.598 | 0.766 | 0.672 |
| Aug 07, 2025 | Qwen3-32B | 0.626 | 0.771 | 0.691 | 0.789 | 0.975 | 0.872 | 0.612 | 0.753 | 0.675 |
| Jun 6, 2024 | GLM-4-Plus | 0.609 | 0.799 | 0.691 | 0.711 | 0.952 | 0.814 | 0.579 | 0.761 | 0.657 |
| Apr 18, 2024 | LLaMA-3-8B-Instruct | 0.608 | 0.705 | 0.653 | 0.727 | 0.830 | 0.775 | 0.546 | 0.634 | 0.587 |
| Jun 6, 2024 | GLM-4-9B | 0.542 | 0.746 | 0.628 | 0.498 | 0.857 | 0.630 | 0.472 | 0.649 | 0.547 |
Leaderboard - Evidence-aware QA with Unique Answer
| Model | AR | AP | F1^A | ER | EP | F1^E | EAR | EAP | F1^EA | |
|---|---|---|---|---|---|---|---|---|---|---|
| Jan 21, 2026 | Qwen-Plus w/ R | 0.838 | 0.626 | 0.717 | 0.902 | 0.864 | 0.883 | 0.779 | 0.585 | 0.668 |
| Jan 21, 2026 | DeepSeek-R1 w/ R | 0.801 | 0.619 | 0.698 | 0.879 | 0.837 | 0.857 | 0.747 | 0.582 | 0.655 |
| Jan 21, 2026 | Glm-4-Plus | 0.829 | 0.629 | 0.715 | 0.871 | 0.825 | 0.847 | 0.747 | 0.569 | 0.646 |
| Jan 21, 2026 | Qwen-Plus | 0.839 | 0.574 | 0.682 | 0.925 | 0.813 | 0.865 | 0.785 | 0.536 | 0.637 |
| Jan 21, 2026 | GPT-4o | 0.845 | 0.624 | 0.718 | 0.860 | 0.828 | 0.844 | 0.742 | 0.553 | 0.634 |
| Jan 21, 2026 | Gemini-2.5-Pro w/ R | 0.821 | 0.598 | 0.692 | 0.839 | 0.768 | 0.802 | 0.746 | 0.540 | 0.626 |
| Jan 21, 2026 | Gemini-2.5-Pro | 0.860 | 0.600 | 0.707 | 0.871 | 0.793 | 0.830 | 0.764 | 0.526 | 0.623 |
| Jan 21, 2026 | Gemini-2.5-Flash | 0.821 | 0.587 | 0.681 | 0.827 | 0.758 | 0.791 | 0.729 | 0.523 | 0.609 |
| Jan 21, 2026 | DeepSeek-V3 | 0.828 | 0.556 | 0.665 | 0.878 | 0.789 | 0.831 | 0.753 | 0.510 | 0.608 |
| Jan 21, 2026 | Qwen-33B | 0.864 | 0.534 | 0.660 | 0.900 | 0.798 | 0.846 | 0.790 | 0.493 | 0.607 |
| Jan 21, 2026 | Qwen-3-8B | 0.831 | 0.586 | 0.687 | 0.808 | 0.748 | 0.777 | 0.706 | 0.503 | 0.587 |
| Jan 21, 2026 | LLaMA3-70B-Instruct | 0.763 | 0.614 | 0.680 | 0.814 | 0.729 | 0.769 | 0.650 | 0.521 | 0.578 |
| Jan 21, 2026 | GPT-4o-Mini | 0.756 | 0.532 | 0.625 | 0.803 | 0.737 | 0.769 | 0.640 | 0.460 | 0.535 |
| Jan 21, 2026 | Glm-4-Air | 0.791 | 0.543 | 0.644 | 0.746 | 0.686 | 0.715 | 0.642 | 0.447 | 0.527 |
| Jan 21, 2026 | Glm-4-9B | 0.738 | 0.472 | 0.576 | 0.624 | 0.501 | 0.556 | 0.503 | 0.303 | 0.378 |
| Jan 21, 2026 | LLaMA3-8b-Instruct | 0.535 | 0.288 | 0.374 | 0.598 | 0.351 | 0.442 | 0.399 | 0.198 | 0.265 |
Leaderboard - RAG-based QA
| Model | AR | AP | F1^A | ER | EP | F1^E | EAR | EAP | F1^EA | |
|---|---|---|---|---|---|---|---|---|---|---|
| Mar 25, 2025 | Gemini-2.5-Pro w/ R | 0.738 | 0.798 | 0.767 | 0.722 | 0.802 | 0.760 | 0.662 | 0.693 | 0.677 |
| Mar 25, 2025 | Gemini-2.5-Pro | 0.724 | 0.753 | 0.738 | 0.722 | 0.790 | 0.754 | 0.649 | 0.657 | 0.653 |
| Mar 25, 2025 | Gemini-2.5-Flash | 0.682 | 0.777 | 0.726 | 0.654 | 0.743 | 0.696 | 0.598 | 0.632 | 0.614 |
| Mar 25, 2025 | DeepSeek-V3 | 0.669 | 0.761 | 0.712 | 0.674 | 0.784 | 0.725 | 0.580 | 0.654 | 0.615 |
| Jan 22, 2025 | DeepSeek-R1 w/ R | 0.638 | 0.763 | 0.695 | 0.660 | 0.803 | 0.725 | 0.578 | 0.682 | 0.626 |
| Apr 18, 2024 | LLaMA-3-70B-Instruct | 0.665 | 0.744 | 0.702 | 0.649 | 0.741 | 0.692 | 0.572 | 0.633 | 0.601 |
| Mar 13, 2024 | GPT-4o | 0.607 | 0.799 | 0.690 | 0.593 | 0.755 | 0.664 | 0.529 | 0.666 | 0.590 |
| July 18, 2024 | GPT-4o-mini | 0.604 | 0.772 | 0.678 | 0.597 | 0.781 | 0.677 | 0.525 | 0.674 | 0.590 |
| Aug 08, 2025 | Qwen-Plus | 0.624 | 0.778 | 0.693 | 0.656 | 0.808 | 0.724 | 0.562 | 0.680 | 0.615 |
| Aug 08, 2025 | Qwen-Plus w/ R | 0.584 | 0.801 | 0.676 | 0.618 | 0.846 | 0.714 | 0.537 | 0.724 | 0.617 |
| Aug 07, 2025 | Qwen3-32B | 0.570 | 0.801 | 0.666 | 0.597 | 0.827 | 0.693 | 0.510 | 0.711 | 0.594 |
| Aug 07, 2025 | Qwen3-8B | 0.581 | 0.785 | 0.668 | 0.605 | 0.803 | 0.690 | 0.515 | 0.689 | 0.589 |
| Jun 6, 2024 | GLM-4-Air | 0.599 | 0.759 | 0.670 | 0.607 | 0.770 | 0.679 | 0.517 | 0.637 | 0.571 |
| Jun 6, 2024 | GLM-4-Plus | 0.552 | 0.798 | 0.653 | 0.554 | 0.805 | 0.656 | 0.481 | 0.685 | 0.565 |
| Jun 6, 2024 | GLM-4-9B | 0.488 | 0.752 | 0.592 | 0.480 | 0.755 | 0.587 | 0.403 | 0.614 | 0.487 |
| Apr 18, 2024 | LLaMA-3-8B-Instruct | 0.567 | 0.649 | 0.605 | 0.549 | 0.633 | 0.588 | 0.454 | 0.522 | 0.486 |