Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug/Assistance] DBbench任务评测结果与leaderboard不一致 #89

Open
SummerXIATIAN opened this issue Dec 22, 2023 · 1 comment
Open
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@SummerXIATIAN
Copy link

运行的是dbbench-std任务,worker数量5。开源模型都来自Huggingface,用fastchat部署

使用模型 实际分数 Leaderboard分数
gpt-3.5-turbo-0613 37.667 15.00
llama2-13b-chat 25.00 4.50
chatglm3-6b 34.99 -
chatglm2-6b - 13.67

Screenshots of gpt-3.5-turbo-0613 result
<overall.json>
{
"total": 300,
"validation": {
"running": 0.0,
"completed": 0.58,
"agent context limit": 0.0,
"agent validation failed": 0.37333333333333335,
"agent invalid action": 0.0,
"task limit reached": 0.04666666666666667,
"unknown": 0.0,
"task error": 0.0,
"average_history_length": 7.64,
"max_history_length": 34,
"min_history_length": 4
},
"custom": {
"other_accuracy": 0.23529411764705882,
"counting_accuracy": 0.11764705882352941,
"comparison_accuracy": 0.17647058823529413,
"ranking_accuracy": 0.23529411764705882,
"aggregation-SUM_accuracy": 0.125,
"aggregation-MIN_accuracy": 0.25,
"aggregation-MAX_accuracy": 0.0,
"aggregation-AVG_accuracy": 0.5,
"SELECT_accuracy": 0.2,
"INSERT_accuracy": 0.24,
"UPDATE_accuracy": 0.69,
"overall_cat_accuracy": 0.37666666666666665
}
}

想请问db的最后得分是不是overall.json结果里面这个overall_cat_accuracy分数(根据论文内容是Select/Insert/Update的平均,就是这个),其他别的分值也更对应不上。

另外别的任务也发现了类似的得分对应不上的情况,比如kg的得分一直都是0 #69

请问这个评估分数对应不上情况是正常的吗,会是什么原因导致的?

@SummerXIATIAN SummerXIATIAN added bug Something isn't working help wanted Extra attention is needed labels Dec 22, 2023
@zhc7
Copy link
Collaborator

zhc7 commented Dec 23, 2023

Hi, @SummerXIATIAN

想请问db的最后得分是不是overall.json结果里面这个overall_cat_accuracy分数(根据论文内容是Select/Insert/Update的平均,就是这个),其他别的分值也更对应不上。

是。

比如kg的得分一直都是0

这个可能是多种原因导致的,我们也无法确定。可以考虑检查是否能正确连接到kg任务中用到的服务器。

请问这个评估分数对应不上情况是正常的吗,会是什么原因导致的?

十分抱歉,llmbench.ai上的数据还未更新,请以仓库首页或者论文里的数据为准。例如gpt-3.5-turbo在DB下的准确率是36.7,如果实际测得是37.6,这是正常的,OpenAI的接口并不保证稳定性。对于llama-2-13b-chat的数据,有几个可能的原因1. 用的模型版本不一致 2. prompt不一致,这可能是由于fastchat版本或者使用方式导致的。

另外别的任务也发现了类似的得分对应不上的情况

如果确认不是以上几种问题的话,可以详细说明,以便我们做进一步的分析和检查。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants