bug: Different embedding token usage in Langfuse than in OpenAI #1871

michalwelna0 · 2024-04-26T06:57:13Z

Describe the bug

I wanted to use LlamaParse for parsing a set of documents (PDF/Doc/Docx) and index them to be able to ask custom questions to those documents. I created dedicated (fresh) OpenAI API Key, because I wanted to monitor the token usage and compare Langfuse with OpenAI. Once I performed a bunch of tests where I simply parse documents using LlamaParse and perform indexing step with LlamaIndex, I encountered mismatch between token usage by embedding model in Langfuse and in OpenAI.

Model used text-embedding-ada-002(-v2)
LangFuse token count - 32750
OpenAI account count - 33198
Difference 448 tokens

Tests on same set of documents
33249 - 32800 - 449

Tests on other documents
40779 - 40328 - 451
40646 - 40327 - 319

OpenAI always counted more tokens than Langfuse. As you can see perfoming tests on same documents a pattern shown up that we were missing around 450 tokens.

Have you ever experienced similar issue? Or maybe this is expected behaviour?

To reproduce

from llama_parse import LlamaParse
from langfuse.decorators import observe, langfuse_context
from llama_index.core import Settings, VectorStoreIndex
from llama_index.core.callbacks import CallbackManager
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.postprocessor import SimilarityPostprocessor

import os
from langfuse import Langfuse
from pathlib import Path

os.environ["LANGFUSE_PUBLIC_KEY"] = ""
os.environ["LANGFUSE_SECRET_KEY"] = ""
os.environ["LANGFUSE_HOST"] = ""
os.environ["OPENAI_API_KEY"] = ""
os.environ["LLAMA_CLOUD_API_KEY"] = ""

langfuse = Langfuse()

folder_path = Path("path do local documents")
num_workers = len([path for path in folder_path .iterdir()])
parser = LlamaParse(
    result_type="markdown",
    verbose=True,
    language="en",
    num_workers=num_workers,  # should be number of documents, limit 10
)


def index_docs(docs, trace):
    
    span = trace.span(
    name="index-docs",
    )
    
    Settings.llm = OpenAI(model="gpt-3.5-turbo")
    Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
    node_parser = MarkdownElementNodeParser(
            llm=OpenAI(model="gpt-3.5-turbo"), num_workers=num_workers
        )

    langfuse_callback_handler.set_root(trace)

    nodes = node_parser.get_nodes_from_documents(documents=docs)
    base_nodes, objects = node_parser.get_nodes_and_objects(nodes=nodes)

    index = VectorStoreIndex(nodes=base_nodes + objects)

    engine = index.as_query_engine(
        similarity_top_k=15,
        node_postprocessors=[
            SimilarityPostprocessor(similarity_cutoff=0.4)
        ],
        verbose=True,
    )
    langfuse_callback_handler.set_root(None)
    span.end()
    return engine

def run():
    trace = langfuse.trace(
    name = "Test trace",
    tags = ["TEST"]
)
    documents = parser.load_data([str(path) for path in folder_path.iterdir()])
    engine = index_docs(docs=documents, trace=trace)

run()

Additional information

I tried both Langfuse approaches using decorators (@observe()) and using low-level SDK. Both gave me same results - where embedding token count was not the same as from OpenAI.

The reason of this issue is that I would like to use Langfuse (monitoring, token counting, price calculation etc.) as a reliable source and I want to be sure that it calculates costs of token usage correctly, so I could estimate each the cost of each trace that I will be executing.

I verified that LlamaParse is not responsible for embedding token usage (performed a simple test only on using LlamaParse phase and monitored OpenAI token usage).

The text was updated successfully, but these errors were encountered:

marcklingen · 2024-04-29T20:07:31Z

Do you run on Langfuse Cloud? If so, could you provide a trace_id where this issue occurred? Checking the logs for this request end-to-end would help debug the problem and identify its source.

marcklingen · 2024-05-01T09:48:02Z

Do your embedding generations include inputs/outputs? By default, Langfuse takes the token numbers reported by LlamaIndex and does not attempt to tokenize them on the api-level as storing all embedded documents also in Langfuse is usually not necessary

https://github.com/langfuse/langfuse-python/blob/e77183bc0f69df1803fc33481e06e2fab83ec419/langfuse/llama_index/llama_index.py#L416

michalwelna0 · 2024-05-06T14:09:56Z

Hi @marcklingen, no we do not run it on Langfuse Cloud, we have it self-hosted in our environment so you will not be able to check it.

Embedding generations include inputs only - I did not experience any tokens generated from OpenAI.
So (If I understand correctly) Langfuse only gets tokens calculated from LlamaIndex and does not calculate them by itself? If so, this may be from LlamaIndex side..

marcklingen · 2024-05-06T14:14:09Z

Langfuse does both but for llamaindex we try to get them via llamaindex and just ingest into langfuse. If you have logs, you could try to find out if this event included token counts to pinpoint the problem. If no token counts are provided and a known model (e.g. the ones from openai that you use) are used, then Langfuse tokenizes within the ingestion api

michalwelna0 · 2024-05-07T07:40:30Z

Maybe to visualize the problem. I run the code snippet provided above on a sample 5 documents (PDFs/Docx). I obtained 41119 tokens used by embedding model (Text-embedding-ada-002-v2) from OpenAI usage. And from LangFuse trace I see OpenAIEmbedding generation with 40775 tokens used. Here are some screenshots of Trace within LangFuse UI.

Full trace view:

The same view but scrolled to down:

Opened EmbeddingGeneration in LangFuse:

The difference is 344 tokens between OpenAI and LangFuse.

michalwelna0 added the 🐞❔ unconfirmed bug label Apr 26, 2024

marcklingen added the question Further information is requested label Apr 30, 2024 — with Linear

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Different embedding token usage in Langfuse than in OpenAI #1871

bug: Different embedding token usage in Langfuse than in OpenAI #1871

michalwelna0 commented Apr 26, 2024

marcklingen commented Apr 29, 2024

marcklingen commented May 1, 2024

michalwelna0 commented May 6, 2024

marcklingen commented May 6, 2024

michalwelna0 commented May 7, 2024

bug: Different embedding token usage in Langfuse than in OpenAI #1871

bug: Different embedding token usage in Langfuse than in OpenAI #1871

Comments

michalwelna0 commented Apr 26, 2024

Describe the bug

To reproduce

Additional information

marcklingen commented Apr 29, 2024

marcklingen commented May 1, 2024

michalwelna0 commented May 6, 2024

marcklingen commented May 6, 2024

michalwelna0 commented May 7, 2024