evaluation

Here are 1,082 public repositories matching this topic...

obss / jury

Comprehensive NLP Evaluation System

python nlp machine-learning natural-language-processing metrics evaluation transformers pytorch datasets evaluate huggingface nlp-evaluation

Updated May 20, 2024
Python

EXP-Tools / steam-discount

Star

steam 特惠游戏榜单（自动刷新）

steam crawler evaluation rank discount zero playing

Updated May 20, 2024
Python

alipay / ant-application-security-testing-benchmark

Star

xAST评价体系，让安全工具不再“黑盒”. The xAST evaluation benchmark makes security tools no longer a "black box".

testing security application benchmark evaluation sca iast sast dast

Updated May 20, 2024
Java

open-compass / opencompass

Star

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated May 20, 2024
Python

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated May 20, 2024
TypeScript

lunary-ai / lunary

Star

The production toolkit for LLMs. Observability, prompt management and evaluations.

testing ai monitoring evaluation logs self-hosted openai hacktoberfest observability prompts llm langchain

Updated May 20, 2024
TypeScript

Dartvauder / NeuroTrainerWebUI

Star

(Windows/Linux) Local WebUI for finetuning of neural network models (Now only for LLM) on python (In Gradio interface)

python evaluation transformers neural-networks generation finetuning

Updated May 19, 2024
Python

mymmrac / mm

Star

Simple CLI math expression evaluator

go cli math repl evaluation

Updated May 19, 2024
Go

langwatch / langwatch

Star

🤖 Build AI applications with confidence ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.

ai analytics evaluation openai gpt datasets observability llm prompt-engineering

Updated May 19, 2024
TypeScript

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated May 19, 2024
TypeScript

ianarawjo / ChainForge

Sponsor

Star

An open-source visual programming environment for battle-testing prompts to LLMs.

ai evaluation large-language-models prompt-engineering llms llmops

Updated May 19, 2024
TypeScript

deshwalmahesh / PHUDGE

Star

Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.

nlp ai evaluation ml pytorch judge feedback-collection sota custom-dataset finetuning hallucination llm llm-evaluation hallucination-detection phi-3

Updated May 19, 2024
Jupyter Notebook

iamrk04 / LLM-Solutions-Playbook

Star

Unlock the potential of AI-driven solutions and delve into the world of Large Language Models. Explore cutting-edge concepts, real-world applications, and best practices to build powerful systems with these state-of-the-art models.

python memory chatbot evaluation openai llama chains agents parsers prompts llm prompt-engineering chatgpt deeplake langchain gpt4all

Updated May 19, 2024
Jupyter Notebook

google / fuzzbench

Star

FuzzBench - Fuzzer benchmarking as a service.

security benchmarking evaluation fuzzing benchmark-framework

Updated May 19, 2024
Python

Xnhyacinth / Awesome-LLM-Long-Context-Modeling

Star

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

agent benchmark evaluation survey transformer compress blogs papers ssm long-term-memory rag awsome-list large-language-models llm long-context-modeling length-extrapolation

Updated May 19, 2024

tatsu-lab / alpaca_eval

Star

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

nlp deep-learning leaderboard evaluation instruction-following foundation-models large-language-models rlhf

Updated May 19, 2024
Jupyter Notebook

MileBench / MileBench

Star

This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"

benchmark machine-learning natural-language-processing deep-neural-networks computer-vision deep-learning evaluation multimodality visual-question-answering multimodal foundation-models large-language-models llm llms long-context-transformers multimodal-large-language-models large-multimodal-models long-context-modeling

Updated May 19, 2024
Python

microsoft / rag-experiment-accelerator

Star

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

experiment information-retrieval azure evaluation indexing openai sparse vectors chunking acs embedding dense rag llm genai

Updated May 20, 2024
Python

aws-samples / model-as-a-judge-eval

Star

Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.

evaluation llm generative-ai llm-as-a-judge

Updated May 18, 2024
Jupyter Notebook

corentin-ryr / MultiMedEval

Star

A Python tool to evaluate the performance of VLM on the medical domain.

benchmark evaluation medical-imaging vision-language-model llava

Updated May 18, 2024
Python

Improve this page

Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

Here are 1,082 public repositories matching this topic...

obss / jury

EXP-Tools / steam-discount

alipay / ant-application-security-testing-benchmark

open-compass / opencompass

promptfoo / promptfoo

lunary-ai / lunary

Dartvauder / NeuroTrainerWebUI

mymmrac / mm

langwatch / langwatch

langfuse / langfuse

ianarawjo / ChainForge

deshwalmahesh / PHUDGE

iamrk04 / LLM-Solutions-Playbook

google / fuzzbench

Xnhyacinth / Awesome-LLM-Long-Context-Modeling

tatsu-lab / alpaca_eval

MileBench / MileBench

microsoft / rag-experiment-accelerator

aws-samples / model-as-a-judge-eval

corentin-ryr / MultiMedEval

Improve this page

Add this topic to your repo