Provide Optimized Reindexing #517

jvanderen1 · 2024-01-05T22:50:41Z

I was wondering if it would be possible to optimize how PGSync updates Elasticsearch records. I put 2 different thoughts I had into these dropdown menu's. Please let me know what you think.

Only Update When Specified Columns Changed

I notice that when an unspecified column updates, it fires a re-index.

Example

Given the following schema for table my_table

# Table: my_table
# ---------------------------------------------------------------------------------------------------------
# Columns:
#  id                          | integer                     | PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
#  foo                         | text                        |
#  bar                         | text                        |
# ---------------------------------------------------------------------------------------------------------

And a schema.json defined as:

[
  {
    "database": "default",
    "index": "my_index",
    "nodes": {
      "table": "my_table",
      "columns": [
        "foo"
      ]
    }
  }
]

What is happening?

my_index is getting updated when column bar updates.

What do I expect to happen instead?

my_index would not be updated when bar updates because it is not specified in schema.json.

Batch Updates Together

I notice that when a record updates multiple times, it gets re-indexed the same number of times it was updated.

Example

Given the following schema for table my_table

# Table: my_table
# ---------------------------------------------------------------------------------------------------------
# Columns:
#  id                          | integer                     | PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
#  foo                         | text                        |
# ---------------------------------------------------------------------------------------------------------

And a schema.json defined as:

[
  {
    "database": "default",
    "index": "my_index",
    "nodes": {
      "table": "my_table",
      "columns": [
        "foo"
      ]
    }
  }
]

And I perform the following query:

UPDATE my_table
SET foo = "hello"
WHERE id = 1;

Followed by another query (before pgsync runs again):

UPDATE my_table
SET foo = "world"
WHERE id = 1;

What is happening?

2 insert requests are sent to my_index the next time pgsync runs.

What do I expect to happen instead?

1 insert request is sent to my_index the next time pgsync runs. Because we are performing full document updates in the index, there is no reason to send the same document to Elasticsearch twice.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide Optimized Reindexing #517

Provide Optimized Reindexing #517

jvanderen1 commented Jan 5, 2024 •

edited

Example

What is happening?

What do I expect to happen instead?

Example

What is happening?

What do I expect to happen instead?

Provide Optimized Reindexing #517

Provide Optimized Reindexing #517

Comments

jvanderen1 commented Jan 5, 2024 • edited

Example

What is happening?

What do I expect to happen instead?

Example

What is happening?

What do I expect to happen instead?

jvanderen1 commented Jan 5, 2024 •

edited