Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide Optimized Reindexing #517

Open
jvanderen1 opened this issue Jan 5, 2024 · 0 comments
Open

Provide Optimized Reindexing #517

jvanderen1 opened this issue Jan 5, 2024 · 0 comments

Comments

@jvanderen1
Copy link

jvanderen1 commented Jan 5, 2024

I was wondering if it would be possible to optimize how PGSync updates Elasticsearch records. I put 2 different thoughts I had into these dropdown menu's. Please let me know what you think.

Only Update When Specified Columns Changed
I notice that when an unspecified column updates, it fires a re-index.

Example

Given the following schema for table my_table

# Table: my_table
# ---------------------------------------------------------------------------------------------------------
# Columns:
#  id                          | integer                     | PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
#  foo                         | text                        |
#  bar                         | text                        |
# ---------------------------------------------------------------------------------------------------------

And a schema.json defined as:

[
  {
    "database": "default",
    "index": "my_index",
    "nodes": {
      "table": "my_table",
      "columns": [
        "foo"
      ]
    }
  }
]

What is happening?

my_index is getting updated when column bar updates.

What do I expect to happen instead?

my_index would not be updated when bar updates because it is not specified in schema.json.

Batch Updates Together

I notice that when a record updates multiple times, it gets re-indexed the same number of times it was updated.

Example

Given the following schema for table my_table

# Table: my_table
# ---------------------------------------------------------------------------------------------------------
# Columns:
#  id                          | integer                     | PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
#  foo                         | text                        |
# ---------------------------------------------------------------------------------------------------------

And a schema.json defined as:

[
  {
    "database": "default",
    "index": "my_index",
    "nodes": {
      "table": "my_table",
      "columns": [
        "foo"
      ]
    }
  }
]

And I perform the following query:

UPDATE my_table
SET foo = "hello"
WHERE id = 1;

Followed by another query (before pgsync runs again):

UPDATE my_table
SET foo = "world"
WHERE id = 1;

What is happening?

2 insert requests are sent to my_index the next time pgsync runs.

What do I expect to happen instead?

1 insert request is sent to my_index the next time pgsync runs. Because we are performing full document updates in the index, there is no reason to send the same document to Elasticsearch twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant