-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: js embedding registry #1308
feat: js embedding registry #1308
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial questions
for (const functionEntry of functions.values()) { | ||
const sourceColumn = columns[functionEntry.sourceColumn]; | ||
const destColumn = functionEntry.vectorColumn ?? "vector"; | ||
if (sourceColumn === undefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for my own knowledge but how do you decide between sourceColumn === undefined
and sourceColumn === undefined || sourceColumn === null
in these validation checks? (i.e. how do you know when you only need to check for one and not both)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so null
usually has to be explicitly set, while undefined
is just the absence of a value. So as long as we never explicitly start setting values to null, we generally only need to check for undefined
.
const newTable = new ArrowTable(columns); | ||
return alignTable(newTable, schema); | ||
} | ||
|
||
/** Helper function to apply embeddings to an input table */ | ||
async function applyEmbeddings<T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the relationship between applyEmbeddings
and appendVectorColumn
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I want to eventually get rid of applyEmbeddings
in favor of appendVectorColumn
to consolidate functionality & better align with the python implementation. It's essentially applyEmbeddingsFromSchema
while the current applyEmbeddings
is applyEmbeddingsFromEmbeddingFunctions
@@ -65,6 +66,8 @@ export interface CreateTableOptions { | |||
* The available options are described at https://lancedb.github.io/lancedb/guides/storage/ | |||
*/ | |||
storageOptions?: Record<string, string>; | |||
schema?: Schema; | |||
embeddingFunction?: EmbeddingFunctionConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: maybe use the name embeddingConfig
instead of embeddingFunction
since there can be multiple functions?
Actually, does this support multiple embedding functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current implementation does not, but it should be pretty easy to add that in a follow up.
|
||
/** | ||
* Should the source column be excluded from the resulting table | ||
* vectorField is used in combination with `LanceSchema` to provide a declarative data model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the JS equivalent to python's func.VectorField. I'm open to suggestions on improving the docs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a user override this? I'm wondering if it would be more appropriate as a private free function, so it's hidden from the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In JS, you can override anything, anywhere if you try hard enough.
So we can't explicitly forbid a user from overriding this, but not declaring it as abstract
generally means it's not meant to be overridden.
ndims(): number | undefined { | ||
return undefined; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be abstract?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's abstract, it requires the user to provide an implementation, even if the function can return undefined
. This just provides a default implementation so users don't need to implement it if they dont want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got some initial comments. There are a some tests I'd like to see, but this is looking well designed so far.
|
||
/** | ||
* Should the source column be excluded from the resulting table | ||
* vectorField is used in combination with `LanceSchema` to provide a declarative data model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a user override this? I'm wondering if it would be more appropriate as a private free function, so it's hidden from the users.
} | ||
abstract computeSourceEmbeddings( | ||
data: T[], | ||
): Promise<number[][] | Float32Array[] | Float64Array[]>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could allow Float16Array[]
here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the JS native Float32Array
and Float64Array
, not the arrow arrays. There is no native Float16Array
in JS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see.
* ``` | ||
*/ | ||
export function LanceSchema( | ||
fields: Record<string, [DataType, Map<string, EmbeddingFunction>] | DataType>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Record
preserve order? I worry we can't give the column ordering users would expect. Column order does matter in our schemas right now, I believe. I could be wrong about this, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String keys are ordered by insertion order, so this should always be deterministic.
@@ -420,6 +424,111 @@ describe("when dealing with versioning", () => { | |||
}); | |||
}); | |||
|
|||
describe("embedding functions", () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One behavior I'm interested: What happens if you create a table with a custom embedding, then another process opens that table without having first registered that custom embedding. Does it error on open, or not until you add / query data? And does the error message make sense? This seems like it would be worth a unit test to confirm we have consistent behavior on this over time. Probably could use the reset to test this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to the python api, this'll throw when adding, not on open
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good now. Thanks for the additional tests :)
Co-authored-by: Will Jones <willjones127@gmail.com>
todo: