Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add near text search with multiple target vectors #4955

Draft
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

dirkkul
Copy link
Contributor

@dirkkul dirkkul commented May 16, 2024

What's being changed:

Adds searching for multiple target vectors in a single search. This should support all vector related searches but not aggregate. Can be tested with the python client 4.7.0a0

Review checklist

  • Documentation has been updated, if necessary. Link to changed documentation:
  • Chaos pipeline run or not necessary. Link to pipeline:
  • All new code is covered by tests where it is reasonable.
  • Performance tests have been run or not necessary.

@dirkkul dirkkul marked this pull request as draft May 16, 2024 22:53
@hadfield
Copy link

Will be an excellent addition! I would like to suggest the ability to configure scoring/ranking, such as, for a nearText case, sorting by the minimum average distance based on a distance metric (such as cosine) and including some weighting, so if this was using 3 vectors, weights might be [0.4, 0.3, 0.3] to more heavily weight the first vector. depending on the distance metric, there may need to be some normalization, especially if the vectors are coming from different embedding models.

@hadfield
Copy link

hadfield commented May 26, 2024

Related to this, but a different usage scenario, is a query that extends across collections that involves more than one vector.

Given a data model like:

Document (Collection), Topic (Collection), Image (Collection)

Document:
content (Vector, Text Embedding)

Topic:
description (Vector, Text Embedding)
multiModalDescription (Vector, MultiModal Embedding)

Image:
content (Vector, MultiModal Embedding)

Query:

  1. Document content: nearText("cute kittens")
  2. Matching Documents provide vectors to find nearby Topics based on closeness in the Text Embedding space
  3. Matching Topics provide vectors to find nearby Images based on closeness in the MultiModal Embedding space

So the Topics collection has two vectors and serves to "join" the two embedding spaces allowing queries to traverse across the embedding spaces. One scenario when this arises is when there is an existing dataset for "documents" and an existing dataset for "images" and you want to query across them without having to modify the current data (or the processes that maintain it).
I briefly discussed this use-case with @bobvanluijt a few months back at an event in NYC.
Hopefully I articulated what I mean, but let me know if clarifications are needed, or if I'm on the wrong track.
If this use-case is completely separate, i guess an issue could be added?

@hadfield
Copy link

hadfield commented May 28, 2024

For the parallel N vector query case, is there the concept of optimizing the ordering, such that the vector that has the least nearby results can be a gating factor on the others? In document search, if you were querying for "happy" AND "aardvark" you would search for "aardvark" first which presumably would be less frequent and help filter the "happy" results. The situation with vectors is not exactly the same but thought a similar process might help.
In a query I would use this for, one of the vectors would have something like 1000x the number of nearby vectors than the others so it could be bad performance-wise to enumerate them all only to be just intersected with the other much smaller sets.

Copy link

sonarcloud bot commented Jun 6, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot
3.3% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants