Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aligning LlamaIndex Metadata Structure with Underlying Database Capabilities to Support Arrays of Objects #659

Open
TYRONEMICHAEL opened this issue Mar 21, 2024 · 1 comment

Comments

@TYRONEMICHAEL
Copy link
Contributor

Description:

Issue Summary:

We are utilizing LlamaIndex as an interface for various vector database implementations, including ChromaDb. While ChromaDb supports a flexible metadata structure that allows for arrays of objects, enabling rich and complex metadata associations, we've identified a limitation within LlamaIndex's metadata handling. The current Record<string, any> type definition for metadata in LlamaIndex restricts us to a flat key-value pair structure, which does not fully leverage the underlying databases' capabilities, particularly ChromaDb's ability to handle arrays of objects within metadata.

ChromaDb's Metadata Capabilities:

ChromaDb allows for a diverse range of metadata structures, as demonstrated by the following usage pattern:

await collection.upsert({
  ids: ["id1", "id2", "id3"],
  embeddings: [[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]],
  metadatas: [
    { "chapter": "3", "verse": "16" },
    { "chapter": "3", "verse": "5" },
    { "chapter": "29", "verse": "11" }
  ],
  documents: ["doc1", "doc2", "doc3"]
});

This flexibility in metadata structure allows users to associate multiple related attributes with a single document, enhancing the expressiveness and utility of the metadata.

Proposed Enhancement for LlamaIndex:

To bridge this gap and align LlamaIndex more closely with the capabilities of ChromaDb and potentially other databases, I propose we consider extending the metadata type definition in LlamaIndex to Record<string, any>[]. This adjustment would permit an array of metadata objects, each maintaining a flat structure, thereby respecting the underlying databases' constraints while offering enhanced flexibility and expressiveness in metadata definition.

Potential Benefits:

  • Enhanced Metadata Expressiveness: Allows for more complex and nuanced metadata associations, akin to what is already possible in ChromaDb.
  • Increased Flexibility and Usability: Makes LlamaIndex more adaptable for a variety of use cases where complex metadata is essential.
  • Alignment with Underlying Databases: Ensures that LlamaIndex can fully leverage the features and capabilities of the databases it interfaces with, like ChromaDb.

Seeking Input and Suggestions:

I am keen to hear the community's thoughts on this proposal, any potential challenges it might pose, and how it might be implemented effectively. Suggestions for alternative approaches that could resolve the issue are also highly welcome.

@marcusschiesser
Copy link
Collaborator

Thanks for your suggestion @TYRONEMICHAEL.

I think for a change like that, we need to consider at least the following:

  1. Can the data generated by the TS version be used with the Python version of LlamaIndex?
  2. Can the data generated by the Python version be used by the TS version?
  3. Does it not break the existing usage?
  4. Do other vector DBs benefit from the change?

About 1. and 2. I just took a look at the Python code, it's also using the first entry of the metadatas array, see https://github.com/run-llama/llama_index/blob/337936b013843fbc7aece81117140106803715ef/llama-index-integrations/vector_stores/llama-index-vector-stores-chroma/llama_index/vector_stores/chroma/base.py#L336 - we have to consider that

About 3. Instead of Record<string, any>[] we could probably use Record<string, any>[] | Record<string, any>

Regarding 4. it would be great to hear the thoughts of users of other Vector DBs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants