Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: js embedding registry #1308

Merged
merged 18 commits into from
May 29, 2024

Conversation

universalmind303
Copy link
Contributor

@universalmind303 universalmind303 commented May 15, 2024

todo:

  • add more tests
  • add more examples

@github-actions github-actions bot added the enhancement New feature or request label May 15, 2024
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial questions

for (const functionEntry of functions.values()) {
const sourceColumn = columns[functionEntry.sourceColumn];
const destColumn = functionEntry.vectorColumn ?? "vector";
if (sourceColumn === undefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for my own knowledge but how do you decide between sourceColumn === undefined and sourceColumn === undefined || sourceColumn === null in these validation checks? (i.e. how do you know when you only need to check for one and not both)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so null usually has to be explicitly set, while undefined is just the absence of a value. So as long as we never explicitly start setting values to null, we generally only need to check for undefined.

nodejs/lancedb/arrow.ts Outdated Show resolved Hide resolved
const newTable = new ArrowTable(columns);
return alignTable(newTable, schema);
}

/** Helper function to apply embeddings to an input table */
async function applyEmbeddings<T>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the relationship between applyEmbeddings and appendVectorColumn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I want to eventually get rid of applyEmbeddings in favor of appendVectorColumn to consolidate functionality & better align with the python implementation. It's essentially applyEmbeddingsFromSchema while the current applyEmbeddings is applyEmbeddingsFromEmbeddingFunctions

nodejs/lancedb/connection.ts Show resolved Hide resolved
@@ -65,6 +66,8 @@ export interface CreateTableOptions {
* The available options are described at https://lancedb.github.io/lancedb/guides/storage/
*/
storageOptions?: Record<string, string>;
schema?: Schema;
embeddingFunction?: EmbeddingFunctionConfig;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: maybe use the name embeddingConfig instead of embeddingFunction since there can be multiple functions?

Actually, does this support multiple embedding functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current implementation does not, but it should be pretty easy to add that in a follow up.


/**
* Should the source column be excluded from the resulting table
* vectorField is used in combination with `LanceSchema` to provide a declarative data model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the JS equivalent to python's func.VectorField. I'm open to suggestions on improving the docs here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a user override this? I'm wondering if it would be more appropriate as a private free function, so it's hidden from the users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JS, you can override anything, anywhere if you try hard enough.

So we can't explicitly forbid a user from overriding this, but not declaring it as abstract generally means it's not meant to be overridden.

Comment on lines +124 to +126
ndims(): number | undefined {
return undefined;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be abstract?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's abstract, it requires the user to provide an implementation, even if the function can return undefined. This just provides a default implementation so users don't need to implement it if they dont want.

nodejs/lancedb/embedding/registry.ts Outdated Show resolved Hide resolved
nodejs/lancedb/embedding/registry.ts Show resolved Hide resolved
nodejs/lancedb/embedding/registry.ts Show resolved Hide resolved
@universalmind303 universalmind303 marked this pull request as ready for review May 22, 2024 00:27
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got some initial comments. There are a some tests I'd like to see, but this is looking well designed so far.

nodejs/__test__/registry.test.ts Outdated Show resolved Hide resolved

/**
* Should the source column be excluded from the resulting table
* vectorField is used in combination with `LanceSchema` to provide a declarative data model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a user override this? I'm wondering if it would be more appropriate as a private free function, so it's hidden from the users.

}
abstract computeSourceEmbeddings(
data: T[],
): Promise<number[][] | Float32Array[] | Float64Array[]>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could allow Float16Array[] here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the JS native Float32Array and Float64Array, not the arrow arrays. There is no native Float16Array in JS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see.

* ```
*/
export function LanceSchema(
fields: Record<string, [DataType, Map<string, EmbeddingFunction>] | DataType>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Record preserve order? I worry we can't give the column ordering users would expect. Column order does matter in our schemas right now, I believe. I could be wrong about this, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String keys are ordered by insertion order, so this should always be deterministic.

@@ -420,6 +424,111 @@ describe("when dealing with versioning", () => {
});
});

describe("embedding functions", () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One behavior I'm interested: What happens if you create a table with a custom embedding, then another process opens that table without having first registered that custom embedding. Does it error on open, or not until you add / query data? And does the error message make sense? This seems like it would be worth a unit test to confirm we have consistent behavior on this over time. Probably could use the reset to test this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to the python api, this'll throw when adding, not on open

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good now. Thanks for the additional tests :)

nodejs/__test__/table.test.ts Outdated Show resolved Hide resolved
Co-authored-by: Will Jones <willjones127@gmail.com>
@universalmind303 universalmind303 merged commit dbea3a7 into lancedb:main May 29, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants