Comprehensive NLP Evaluation System
-
Updated
May 20, 2024 - Python
Comprehensive NLP Evaluation System
xAST评价体系,让安全工具不再“黑盒”. The xAST evaluation benchmark makes security tools no longer a "black box".
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.
The production toolkit for LLMs. Observability, prompt management and evaluations.
(Windows/Linux) Local WebUI for finetuning of neural network models (Now only for LLM) on python (In Gradio interface)
🤖 Build AI applications with confidence ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
An open-source visual programming environment for battle-testing prompts to LLMs.
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.
Unlock the potential of AI-driven solutions and delve into the world of Large Language Models. Explore cutting-edge concepts, real-world applications, and best practices to build powerful systems with these state-of-the-art models.
FuzzBench - Fuzzer benchmarking as a service.
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"
The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.
A Python tool to evaluate the performance of VLM on the medical domain.
Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.
To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."