We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
运行的是dbbench-std任务,worker数量5。开源模型都来自Huggingface,用fastchat部署
Screenshots of gpt-3.5-turbo-0613 result <overall.json> { "total": 300, "validation": { "running": 0.0, "completed": 0.58, "agent context limit": 0.0, "agent validation failed": 0.37333333333333335, "agent invalid action": 0.0, "task limit reached": 0.04666666666666667, "unknown": 0.0, "task error": 0.0, "average_history_length": 7.64, "max_history_length": 34, "min_history_length": 4 }, "custom": { "other_accuracy": 0.23529411764705882, "counting_accuracy": 0.11764705882352941, "comparison_accuracy": 0.17647058823529413, "ranking_accuracy": 0.23529411764705882, "aggregation-SUM_accuracy": 0.125, "aggregation-MIN_accuracy": 0.25, "aggregation-MAX_accuracy": 0.0, "aggregation-AVG_accuracy": 0.5, "SELECT_accuracy": 0.2, "INSERT_accuracy": 0.24, "UPDATE_accuracy": 0.69, "overall_cat_accuracy": 0.37666666666666665 } }
想请问db的最后得分是不是overall.json结果里面这个overall_cat_accuracy分数(根据论文内容是Select/Insert/Update的平均,就是这个),其他别的分值也更对应不上。
另外别的任务也发现了类似的得分对应不上的情况,比如kg的得分一直都是0 #69
请问这个评估分数对应不上情况是正常的吗,会是什么原因导致的?
The text was updated successfully, but these errors were encountered:
Hi, @SummerXIATIAN
是。
比如kg的得分一直都是0
这个可能是多种原因导致的,我们也无法确定。可以考虑检查是否能正确连接到kg任务中用到的服务器。
十分抱歉,llmbench.ai上的数据还未更新,请以仓库首页或者论文里的数据为准。例如gpt-3.5-turbo在DB下的准确率是36.7,如果实际测得是37.6,这是正常的,OpenAI的接口并不保证稳定性。对于llama-2-13b-chat的数据,有几个可能的原因1. 用的模型版本不一致 2. prompt不一致,这可能是由于fastchat版本或者使用方式导致的。
另外别的任务也发现了类似的得分对应不上的情况
如果确认不是以上几种问题的话,可以详细说明,以便我们做进一步的分析和检查。
Sorry, something went wrong.
No branches or pull requests
运行的是dbbench-std任务,worker数量5。开源模型都来自Huggingface,用fastchat部署
Screenshots of gpt-3.5-turbo-0613 result
<overall.json>
{
"total": 300,
"validation": {
"running": 0.0,
"completed": 0.58,
"agent context limit": 0.0,
"agent validation failed": 0.37333333333333335,
"agent invalid action": 0.0,
"task limit reached": 0.04666666666666667,
"unknown": 0.0,
"task error": 0.0,
"average_history_length": 7.64,
"max_history_length": 34,
"min_history_length": 4
},
"custom": {
"other_accuracy": 0.23529411764705882,
"counting_accuracy": 0.11764705882352941,
"comparison_accuracy": 0.17647058823529413,
"ranking_accuracy": 0.23529411764705882,
"aggregation-SUM_accuracy": 0.125,
"aggregation-MIN_accuracy": 0.25,
"aggregation-MAX_accuracy": 0.0,
"aggregation-AVG_accuracy": 0.5,
"SELECT_accuracy": 0.2,
"INSERT_accuracy": 0.24,
"UPDATE_accuracy": 0.69,
"overall_cat_accuracy": 0.37666666666666665
}
}
想请问db的最后得分是不是overall.json结果里面这个overall_cat_accuracy分数(根据论文内容是Select/Insert/Update的平均,就是这个),其他别的分值也更对应不上。
另外别的任务也发现了类似的得分对应不上的情况,比如kg的得分一直都是0 #69
请问这个评估分数对应不上情况是正常的吗,会是什么原因导致的?
The text was updated successfully, but these errors were encountered: