Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: implement pgvector datatype and evaluation #124292

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jordanlewis
Copy link
Member

@jordanlewis jordanlewis commented May 16, 2024

This commit adds the pgvector datatype and associated evaluation operators and functions. It doesn't include index acceleration.

Functionality included:

  • CREATE EXTENSION vector
  • vector datatype with optional length, storage and retrieval in non-indexed table columns
  • Equality and inequality operators
  • <-> operator - L2 distance
  • <#> operator - (negative) inner product
  • <=> operator - cosine distance
  • l1_distance builtin
  • l2_distance builtin
  • cosine_distance builtin
  • inner_product builtin
  • vector_dims builtin
  • vector_norm builtin

Updates #121432
Epic: None

Release note (sql change): implement pgvector encoding, decoding, and operators, without index acceleration.

@jordanlewis jordanlewis requested review from a team as code owners May 16, 2024 19:23
@jordanlewis jordanlewis requested review from nkodali and yuzefovich and removed request for a team May 16, 2024 19:23
Copy link

blathers-crl bot commented May 16, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

It looks like your PR touches SQL parser code but doesn't add or edit parser tests. Please make sure you add or edit parser tests if you edit the parser.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@jordanlewis jordanlewis marked this pull request as draft May 16, 2024 19:23
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jordanlewis jordanlewis force-pushed the pgvector-datatype-and-eval branch 3 times, most recently from 3342745 to 8b12d14 Compare May 17, 2024 02:52
@jordanlewis jordanlewis marked this pull request as ready for review May 17, 2024 03:08
@jordanlewis jordanlewis force-pushed the pgvector-datatype-and-eval branch 2 times, most recently from 2877549 to 8b018a1 Compare May 17, 2024 03:59
@jordanlewis jordanlewis requested a review from a team as a code owner May 17, 2024 03:59
@jordanlewis jordanlewis requested review from rharding6373 and removed request for a team May 17, 2024 03:59
@jordanlewis jordanlewis force-pushed the pgvector-datatype-and-eval branch 2 times, most recently from da48dc0 to 959612f Compare May 17, 2024 19:18
@jordanlewis jordanlewis requested a review from a team as a code owner May 17, 2024 19:18
@jordanlewis jordanlewis requested review from nameisbhaskar and vidit-bhat and removed request for a team May 17, 2024 19:18
@jordanlewis jordanlewis force-pushed the pgvector-datatype-and-eval branch 3 times, most recently from fa290d7 to 7661327 Compare May 17, 2024 23:54
Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so cool!

We should also do some mixed-version testing to make sure the new type is correctly disallowed until a cluster is upgraded (example).

Reviewed 54 of 68 files at r1, 21 of 21 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @nameisbhaskar, @nkodali, @rharding6373, @vidit-bhat, and @yuzefovich)


pkg/sql/opt/ops/scalar.opt line 538 at r2 (raw file):

    Left ScalarExpr
    Right ScalarExpr
}

Any particular reason why we aren't also doing the l1 distance <+>? pg_vector seems to support it: https://github.com/pgvector/pgvector/blob/258eaf58fdaff1843617ff59ea855e0768243fe9/README.md?plain=1#L142-L147


pkg/sql/parser/sql.y line 14277 at r2 (raw file):

  }

const_vector:

Can we add some parser tests?


pkg/sql/scanner/scan.go line 338 at r2 (raw file):

				return
			}
			return

Here and below, we're advancing the position of the scanner without setting the token ID. Should we backtrack if it isn't a vector operator?


pkg/sql/sem/builtins/pgvector_builtins.go line 41 at r2 (raw file):

			},
			ReturnType: tree.FixedReturnType(types.Float),
			Fn: func(_ context.Context, evalCtx *eval.Context, args tree.Datums) (tree.Datum, error) {

These seem like good candidates to be defined with a SQL body using the corresponding operator. They're immutable, so would be inlined. What do you think?

Note: we'd also have to set CalledOnNullInput: true to freely allow inlining.


pkg/sql/sem/cast/cast.go line 254 at r2 (raw file):

	}

	if srcFamily == types.ArrayFamily && tgtFamily == types.PGVectorFamily {

Should there also be an implicit cast from vector to real[] or double[]?

https://github.com/pgvector/pgvector/blob/258eaf58fdaff1843617ff59ea855e0768243fe9/sql/vector.sql#L157-L158


pkg/sql/sem/eval/binary_op.go line 1236 at r2 (raw file):

	ctx context.Context, _ *tree.DistanceVectorOp, left, right tree.Datum,
) (tree.Datum, error) {
	v := tree.MustBeDPGVector(left)

I think this will result in an internal error if passed a NULL argument. Same for the builtin functions. We should also make sure to test that case.


pkg/sql/sem/tree/datum.go line 6673 at r2 (raw file):

			width := int(typ.Width())
			if width > 0 && len(in.T) != width {
				return nil, pgerror.Newf(pgcode.StringDataLengthMismatch,

[nit] pg_vector uses DataException.
https://github.com/pgvector/pgvector/blob/258eaf58fdaff1843617ff59ea855e0768243fe9/src/vector.c#L77-L79


pkg/sql/types/types.go line 491 at r2 (raw file):

	// PGVector is the type representing a PGVector object.
	PGVector = &T{

Should we be adding a case to the types.T.Equivalent() method for PGVector since the widths of compatible vectors must be the same? (I think leaving it as-is could be justified as well)


pkg/sql/types/types.go line 1321 at r2 (raw file):

//		COLLATEDSTRING: max # of characters
//		BIT           : max # of bits
//	  VECTOR        : # of dimensions

[nit] formatting


pkg/util/vector/vector.go line 26 at r2 (raw file):

// MaxDim is the maximum number of dimensions a vector can have.
const MaxDim = 16000

Should we make this configurable? Even if only as a crdb_internal setting or env variable.


pkg/util/vector/vector.go line 55 at r2 (raw file):

		}

		val, err := strconv.ParseFloat(part, 32)

Should we also check for invalid floats? e.g. NaN, +/-Inf.


pkg/util/vector/vector.go line 200 at r2 (raw file):

		norm += float64(t[i]) * float64(t[i])
	}
	return math.Sqrt(norm)

We may want to validate the intermediate result here and for the l2 distance - we've run into flakes in aggregate functions that use Sqrt with floats when the true result is zero but precision errors result in a small negative value instead.

Also, are we missing range checks? This seems like it could overflow, as do some of the other functions.


pkg/sql/logictest/testdata/logic_test/vector line 1 at r2 (raw file):

# LogicTest: !local-mixed-23.2

Can we do a test that attempts to pass a NULL value to some of the vector operators?


pkg/sql/logictest/testdata/logic_test/vector line 8 at r2 (raw file):

5.196152422706632

statement error dimensions for type vector must be at least 1

[nit] We try to also validate the error code in logic tests with pgcode <code>
Ex:

statement error pgcode 42P13 pq: no language specified


pkg/sql/logictest/testdata/logic_test/vector line 25 at r2 (raw file):


query T rowsort
SELECT * FROM v

It seems strange that we would allow this - aren't basically all vector operations impossible if the widths of the vectors are different?


pkg/sql/rowenc/keyside/keyside_test.go line 249 at r3 (raw file):

		return hasKeyEncoding(typ.ArrayContents())
	}
	return !colinfo.MustBeValueEncoded(typ)

I think this change misses decimals, which also do not round trip in all cases.

Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I'll just respond to a few of your astute questions for now before making modifications to the PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @nameisbhaskar, @nkodali, @rharding6373, @vidit-bhat, and @yuzefovich)


pkg/sql/opt/ops/scalar.opt line 538 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Any particular reason why we aren't also doing the l1 distance <+>? pg_vector seems to support it: https://github.com/pgvector/pgvector/blob/258eaf58fdaff1843617ff59ea855e0768243fe9/README.md?plain=1#L142-L147

This was added in 0.7.0, and I started the patch when pgvector was still at 0.6.0. We could definitely add it, I was thinking of just adding it in a followon, but we could definitely do it now too.


pkg/util/vector/vector.go line 200 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

We may want to validate the intermediate result here and for the l2 distance - we've run into flakes in aggregate functions that use Sqrt with floats when the true result is zero but precision errors result in a small negative value instead.

Also, are we missing range checks? This seems like it could overflow, as do some of the other functions.

I tried to follow the pg_vector implementation, but if you think we need to add things like this I'm open to it. https://github.com/pgvector/pgvector/blob/master/src/vector.c


pkg/sql/logictest/testdata/logic_test/vector line 25 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

It seems strange that we would allow this - aren't basically all vector operations impossible if the widths of the vectors are different?

I actually just added this comparison today when responding to test failures, specifically because stats collection was panicking because it needs to run comparisons on the datums in a table column, and it's possible in pg_vector (for some reason) to have a vector column without a dimensions modifier. I checked the pg_vector source code and it also permits this comparison, so I figured I'd just throw it in here.

Thoughts?


pkg/sql/rowenc/keyside/keyside_test.go line 249 at r3 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

I think this change misses decimals, which also do not round trip in all cases.

Good point, this wasn't failing any tests but I think you're right, will fix.

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @nameisbhaskar, @nkodali, @rharding6373, @vidit-bhat, and @yuzefovich)


pkg/sql/opt/ops/scalar.opt line 538 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

This was added in 0.7.0, and I started the patch when pgvector was still at 0.6.0. We could definitely add it, I was thinking of just adding it in a followon, but we could definitely do it now too.

I think it'd be fine to leave it for a future PR, just curious.


pkg/util/vector/vector.go line 200 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I tried to follow the pg_vector implementation, but if you think we need to add things like this I'm open to it. https://github.com/pgvector/pgvector/blob/master/src/vector.c

I think it's ok to defer both requests (maybe indefinitely) until a test fails because of them, considering pg_vector does the exact same thing. But [nit] it might be nice to leave a TODO as a hint for the future.


pkg/sql/logictest/testdata/logic_test/vector line 25 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I actually just added this comparison today when responding to test failures, specifically because stats collection was panicking because it needs to run comparisons on the datums in a table column, and it's possible in pg_vector (for some reason) to have a vector column without a dimensions modifier. I checked the pg_vector source code and it also permits this comparison, so I figured I'd just throw it in here.

Thoughts?

Did you end up checking against a postgres instance with pg_vector installed? I think this function might prevent table columns from being defined without an explicit dimension. At any rate, there aren't any pg_vector tests for this case, which makes it seem likely to be an oversight if they do allow it.

Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @nameisbhaskar, @nkodali, @rharding6373, @vidit-bhat, and @yuzefovich)


pkg/sql/parser/sql.y line 14277 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Can we add some parser tests?

Yeah, I wasn't sure about needing to do this because all of this is implicitly tested by the logic tests and so on. The parser tests seem to be more used for new top level syntax, rather than new expression, type or operator syntax. Anyway, added.


pkg/sql/scanner/scan.go line 338 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Here and below, we're advancing the position of the scanner without setting the token ID. Should we backtrack if it isn't a vector operator?

Yeah, good point. Fixed and added tests. The <- case threw me for a bit, but it scans because it could mean less than negative.


pkg/sql/sem/builtins/pgvector_builtins.go line 41 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

These seem like good candidates to be defined with a SQL body using the corresponding operator. They're immutable, so would be inlined. What do you think?

Note: we'd also have to set CalledOnNullInput: true to freely allow inlining.

I don't really see the advantage beside code reuse, but if you feel strongly we can do it. I think a disadvantage is that the reader has to remember the weird names of the operators (<=> in this case) to validate that the code makes sense, whereas in the current implementation it's clear that cosine_distance calling CosDistance is the right thing.


pkg/sql/sem/cast/cast.go line 254 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Should there also be an implicit cast from vector to real[] or double[]?

https://github.com/pgvector/pgvector/blob/258eaf58fdaff1843617ff59ea855e0768243fe9/sql/vector.sql#L157-L158

Done.


pkg/sql/sem/eval/binary_op.go line 1236 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

I think this will result in an internal error if passed a NULL argument. Same for the builtin functions. We should also make sure to test that case.

I think these should be fine because of logic that nullifies binary operators and builtins before they're called. I added tests, good point.


pkg/sql/sem/tree/datum.go line 6673 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] pg_vector uses DataException.
https://github.com/pgvector/pgvector/blob/258eaf58fdaff1843617ff59ea855e0768243fe9/src/vector.c#L77-L79

Done.


pkg/sql/types/types.go line 491 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Should we be adding a case to the types.T.Equivalent() method for PGVector since the widths of compatible vectors must be the same? (I think leaving it as-is could be justified as well)

Hmm, but at least according to the pg_vector source code it is okay to compare vectors with different widths...


pkg/sql/types/types.go line 1321 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] formatting

Done.


pkg/util/vector/vector.go line 26 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Should we make this configurable? Even if only as a crdb_internal setting or env variable.

Hmm... maybe? I don't see the use case. I think we should aim to stay completely compatible.


pkg/util/vector/vector.go line 55 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Should we also check for invalid floats? e.g. NaN, +/-Inf.

Done.


pkg/util/vector/vector.go line 200 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

I think it's ok to defer both requests (maybe indefinitely) until a test fails because of them, considering pg_vector does the exact same thing. But [nit] it might be nice to leave a TODO as a hint for the future.

Done.


pkg/sql/logictest/testdata/logic_test/vector line 1 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Can we do a test that attempts to pass a NULL value to some of the vector operators?

Done.


pkg/sql/logictest/testdata/logic_test/vector line 8 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] We try to also validate the error code in logic tests with pgcode <code>
Ex:

statement error pgcode 42P13 pq: no language specified

Done.


pkg/sql/logictest/testdata/logic_test/vector line 25 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Did you end up checking against a postgres instance with pg_vector installed? I think this function might prevent table columns from being defined without an explicit dimension. At any rate, there aren't any pg_vector tests for this case, which makes it seem likely to be an oversight if they do allow it.

I did, it is permitted :/

jordan=# create table t(v vector);
CREATE TABLE
jordan=# insert into t values('[1]'),('[1,2]');
INSERT 0 2
jordan=# select * from t;
   v
-------
 [1]
 [1,2]
(2 rows)

Copy link

blathers-crl bot commented May 18, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Release note (sql change): implement pgvector encoding, decoding, and
operators, without index acceleration.
Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work! :lgtm: once you add a case to unsupported_types.go. I think Yahor will be taking a look as well.

Reviewed 14 of 14 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @nameisbhaskar, @nkodali, @rharding6373, @vidit-bhat, and @yuzefovich)


pkg/sql/sem/builtins/pgvector_builtins.go line 41 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I don't really see the advantage beside code reuse, but if you feel strongly we can do it. I think a disadvantage is that the reader has to remember the weird names of the operators (<=> in this case) to validate that the code makes sense, whereas in the current implementation it's clear that cosine_distance calling CosDistance is the right thing.

My thinking was that it'd be nicer for the optimizer and vectorized engine (when we add a vectorized implementation) to only have to consider the operators. This is something we could change later, though, so I'd be alright with leaving it this way for now if you prefer. I guess it would also be pretty simple to just add a couple norm rules that convert the builtin function calls to the corresponding operators.


pkg/sql/sem/eval/binary_op.go line 1236 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I think these should be fine because of logic that nullifies binary operators and builtins before they're called. I added tests, good point.

TILBinOp also has a CalledOnNullInput field.


pkg/sql/types/types.go line 491 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Hmm, but at least according to the pg_vector source code it is okay to compare vectors with different widths...

Right, no need for a change then.


pkg/sql/logictest/testdata/logic_test/mixed_version_pgvector line 1 at r4 (raw file):

# LogicTest: cockroach-go-testserver-23.2

Thanks for adding this. It reminds me - we also need a case for VECTOR in unsupported_types.go.


pkg/sql/logictest/testdata/logic_test/vector line 25 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I did, it is permitted :/

jordan=# create table t(v vector);
CREATE TABLE
jordan=# insert into t values('[1]'),('[1,2]');
INSERT 0 2
jordan=# select * from t;
   v
-------
 [1]
 [1,2]
(2 rows)

Ah well, that's a little sad. Thanks for checking.


pkg/sql/parser/testdata/select_exprs line 2068 at r4 (raw file):

SELECT (("[1,2]") <-> ("[3,4]")) -- fully parenthesized
SELECT "[1,2]" <-> "[3,4]" -- literals removed
SELECT _ <-> _ -- identifiers removed

Do you know why the vectors are considered identifiers instead of literals? Just because of the double quotes?

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pgvector 0.7 is brand new, but I still think it's worth getting some of the new functionality in for this initial version. In particular the new halfvec type is significant (it's necessary to support openai's larger embedding model without truncation) and we should make sure we have the right structure for multiple vector types.

One thing that pgvector doesn't do AFAICT but seems useful is if each vector had a bit to indicate whether it is normalized, so that cosine_distance could automatically turn into the faster inner_product when we don't need to normalize again. Would that be too much of a deviation from pgvector?

I was thinking it would be best to treat vectors as binary blobs as much as possible, and only view them as arrays when needed. That would minimize decoding overhead on the assumption that eventually we're not computing inner products with a Go for loop but passing the blob off to some simd-optimized implementation.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @jordanlewis, @nameisbhaskar, @nkodali, @rharding6373, @vidit-bhat, and @yuzefovich)

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! I only have some minor nits (and I'll keep on learning about the pgvector implementation).

Reviewed 51 of 68 files at r1, 12 of 21 files at r2, 2 of 4 files at r3, 14 of 14 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @jordanlewis, @nameisbhaskar, @nkodali, @rharding6373, and @vidit-bhat)


pkg/sql/parser/sql.y line 34 at r4 (raw file):

    "github.com/cockroachdb/cockroach/pkg/geo/geopb"
    "github.com/cockroachdb/cockroach/pkg/geo/geopb"

nit: this import is duplicated with the line above.


pkg/sql/randgen/datum.go line 297 at r4 (raw file):

	case types.TSQueryFamily:
		return tree.NewDTSQuery(tsearch.RandomTSQuery(rng))
	case types.PGVectorFamily:

nit: are there some interesting vectors we could add to randInterestingDatums?

Relatedly, is PGVector a scalar type? In other words, should it be included into types.ScalarTypes?


pkg/sql/rowenc/encoded_datum.go line 333 at r4 (raw file):

		// Note that at time of this writing we don't support arrays of JSON
		// (tracked via #23468) nor of TSQuery / TSVector / PGVector types (tracked by
		// #90886), so technically we don't need to do a recursive call here,

nit: arrays of PGVector type are tracked by #121432.


pkg/sql/sem/cast/cast_map.go line 75 at r4 (raw file):

		oid.T_text:    {MaxContext: ContextAssignment, origin: ContextOriginAutomaticIOConversion, Volatility: volatility.Immutable},
	},
	oidext.T_pgvector: {

Just checking: updates to castMap were generated via cast_map_gen.sh?


pkg/util/vector/vector.go line 74 at r4 (raw file):

// String implements the fmt.Stringer interface.
func (v T) String() string {
	strs := make([]string, len(v))

nit: should we use strings.Builder?


pkg/util/vector/vector.go line 83 at r4 (raw file):

// Size returns the size of the vector in bytes.
func (v T) Size() uintptr {
	return uintptr(len(v)) * 4

nit: should we do s/len/cap/ to be more precise about memory usage? Also perhaps include 24 bytes for the slice overhead?


pkg/util/vector/vector.go line 175 at r4 (raw file):

		normB += t2[i] * t2[i]
	}
	// Use sqrt(a * b) over sqrt(a) * sqrt(b)

nit: I see that this comment comes from the pgvector source code, but I find it confusing since it doesn't actually match the formula.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants