Fix: Take batch_size into account when filling snapshots for INCREMENTAL_BY_UNIQUE_KEY models #2616

erindru · 2024-05-15T22:33:41Z

This PR makes the batch_index available when filling snapshots, so EvaluationStrategy's can take it into account when making decisions.

Prior to this, for kind's like INCREMENTAL_BY_UNIQUE_KEY, if:

You had a model that defines intervals (say @daily) and runs them in multiple batches (say, batch_size=1, which generates a Snapshot per interval)
When running the Snapshot's, SQLMesh would treat them all as "clear table -> insert data" rather than just the first one

After this change, for Snapshot's with no existing intervals at plan time, SQLMesh can check the index in the batch and just execute the "clear table" logic for the first snapshot and then the "merge into existing table" logic for subsequent snapshots

Fixes #2609

…Y_UNIQUE_KEY models

erindru · 2024-05-15T22:35:59Z

tests/core/engine_adapter/test_integration.py

@@ -96,6 +101,20 @@ def dialect(self) -> str:
    def current_catalog_type(self) -> str:
        return self.engine_adapter.current_catalog_type

+    @property
+    def supports_merge(self) -> bool:


I found that this needed to be checked in multiple places so I made it a property of the TestContext

In addition, it's become more complex. It's no longer "engine either supports merge or it doesnt", it depends on both the engine and the catalog

Interesting. What prompted you to change this? Did you run into any issues? Or did you notice that merge is not being used?

I noticed that some other tests, like test_merge() werent actually testing all the combinations because they were being skipped where dialect=spark or dialect=trino.

Spark and Trino do support MERGE, just not on all catalogs, so I improved the check.

But, I wanted to use the same logic in the test_batch_size_on_incremental_by_unique_key_model test (because unless i'm missing something, INCREMENTAL_BY_UNIQUE_KEY with batches only works on engines that support MERGE). So rather than duplicate the checks, I made them a property of the TestContext fixture so they could be used in both places.

Arguably this kind of thing could also live on the EngineAdapter

erindru · 2024-05-15T22:40:51Z

tests/core/engine_adapter/test_integration.py

@@ -376,6 +395,116 @@ def non_temp_tables(self) -> t.List[str]:
        return [x for x in self.tables if not x.startswith("__temp") and not x.startswith("temp")]


+class ProjectCreator(PydanticModel):


I had trouble creating a test case to reproduce the issue because it touched so many concepts.

What I really wanted was to be able to set up a small isolated SQLMesh project to expose the issue from the users perspective (sqlmesh plan throws error when I use a model defined like so), so I wrote this small fixture to create a minimal project I could run against the current engine adapter

I realise there is some overlap with sqlmesh.cli.example_project.init_example_project but I didnt need the example models and they just slowed things down

tobymao · 2024-05-15T22:48:55Z

tests/core/engine_adapter/test_integration.py

+    schema_name: str
+    _context: t.Optional[Context] = PrivateAttr(default=None)
+
+    def add_seed(self, model_name: str, columns: t.Dict[str, str], rows: t.List[t.Dict[str, str]]):


check out core/dialect.py::pandas_to_sql

or just use create_view in the engine_adapter, that accepts a pandas dataframe

So the goal was to produce a project from a user perspective, based on the files in the filesystem that a user would create.

A user doesn't create a seed as a pandas DataFrame (unless theyre using Python models I guess), they create a .csv file in the seeds/ directory and expose it in a SQL model using kind SEED.

However, i've implemented the pandas_to_sql version because in this case the seed data is just a vehicle to expose an issue in the INCREMENTAL_BY_UNIQUE_KEY model

I ended up removing this entirely in favour of adding the core methods to TestContext instead

…ng out CSV

…lement batch_index parameter in Airflow scheduler

erindru · 2024-05-16T05:03:57Z

@izeigerman I had a go at implementing your suggestions on the Airflow scheduler code but I ran out of time to add a test case that proves I implemented them correctly.

I'm back on Monday and can resume this up then; otherwise feel free to finish it off if this bug is blocking someone

izeigerman · 2024-05-16T22:08:41Z

sqlmesh/engines/commands.py

@@ -37,6 +37,7 @@ class EvaluateCommandPayload(PydanticModel):
    end: TimeLike
    execution_time: TimeLike
    deployability_index: DeployabilityIndex
+    batch_index: int = 0


[Nit] I don't think we need a default value here, to make sure that all command's attributes are set explicitly upstream.

Fair call, i've removed the default

izeigerman · 2024-05-16T22:11:57Z

examples/airflow/docker_compose_decorator.py

@@ -20,6 +20,9 @@
    ]
 )

+# Dont load Airflow example DAGs because they cause visual pollution
+docker_compose["x-airflow-common"]["environment"]["AIRFLOW__CORE__LOAD_EXAMPLES"] = "false"


Nice 👍 Let's move this into Dockerfile.template : https://github.com/TobikoData/sqlmesh/blob/main/examples/airflow/Dockerfile.template#L49

Do you mean add an ENV entry to Dockerfile.template to set it?

The problem is, it gets overridden again the second the containers are instantiated, because its explicitly defined in the upstream docker-compose.yml file

izeigerman · 2024-05-16T22:12:23Z

examples/airflow/spark_conf/hive-site.xml

@@ -1,7 +1,7 @@
 <configuration>
   <property>
     <name>javax.jdo.option.ConnectionURL</name>
-     <value>jdbc:postgresql://airflow-postgres-1:5432/metastore_db</value>
+     <value>jdbc:postgresql://postgres:5432/metastore_db</value>


Just curious: why did you have to change the hostname?

I was having trouble with new docker compose vs legacy docker-compose (which I think is docker compose V2 vs docker compose V1).

The name airflow-postgres-1 is just the container name, autogenerated by docker-compose. Since the service is defined as postgres in the docker-compose.yml, the network name that should be referenced is postgres

Something seems to have changed in the networking that docker compose sets up in newer versions. The name airflow-postgres-1 was not reachable anymore, but postgres was. So I couldnt get the tests to run without this change

erindru · 2024-05-20T04:46:53Z

tests/schedulers/airflow/test_dag_generator.py

+        self.target = target
+
+
+def test_generate_plan_application_dag__batch_index_populated(mocker: MockerFixture, make_snapshot):


I couldn't find any existing tests for SnapshotDagGenerator so I added this file

…esh context onto TestContext

izeigerman

This looks great, thank you!

Take batch_size into account when filling snapshots for INCREMENTAL_B…

b99a863

…Y_UNIQUE_KEY models

erindru marked this pull request as draft May 15, 2024 22:33

erindru commented May 15, 2024

View reviewed changes

tobymao reviewed May 15, 2024

View reviewed changes

ProjectCreator: Create seeds from Pandas DataFrame's instead of writi…

92f6ae0

…ng out CSV

erindru force-pushed the issue-2609-incremental-model-batch-size branch from 24b08f2 to aba4fb5 Compare May 16, 2024 04:54

Tweak Airflow docker config to work with newer docker-compose and imp…

e40130a

…lement batch_index parameter in Airflow scheduler

erindru force-pushed the issue-2609-incremental-model-batch-size branch from aba4fb5 to e40130a Compare May 16, 2024 05:02

Merge branch 'main' into issue-2609-incremental-model-batch-size

ee2bc33

erindru marked this pull request as ready for review May 16, 2024 12:01

erindru marked this pull request as draft May 16, 2024 12:02

izeigerman reviewed May 16, 2024

View reviewed changes

erindru added 2 commits May 20, 2024 09:48

Merge branch 'main' into issue-2609-incremental-model-batch-size

023cfe2

Add test for SnapshotDagGenerator propagating batch_index

1bab32e

erindru commented May 20, 2024

View reviewed changes

erindru added 2 commits May 20, 2024 17:10

Remove ProjectCreator and add the methods to create an in-memory SQLM…

69b11e6

…esh context onto TestContext

Merge branch 'main' into issue-2609-incremental-model-batch-size

20b7172

erindru marked this pull request as ready for review May 20, 2024 05:24

izeigerman approved these changes May 20, 2024

View reviewed changes

izeigerman merged commit 8ac8a0a into TobikoData:main May 20, 2024
12 checks passed

erindru deleted the issue-2609-incremental-model-batch-size branch May 20, 2024 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Take batch_size into account when filling snapshots for INCREMENTAL_BY_UNIQUE_KEY models #2616

Fix: Take batch_size into account when filling snapshots for INCREMENTAL_BY_UNIQUE_KEY models #2616

erindru commented May 15, 2024

erindru May 15, 2024

izeigerman May 16, 2024

erindru May 19, 2024 •

edited

erindru May 15, 2024

tobymao May 15, 2024

erindru May 16, 2024

erindru May 20, 2024

erindru commented May 16, 2024

izeigerman May 16, 2024

erindru May 20, 2024

izeigerman May 16, 2024

erindru May 19, 2024

izeigerman May 16, 2024 •

edited

erindru May 19, 2024

erindru May 20, 2024

izeigerman left a comment

		@@ -376,6 +395,116 @@ def non_temp_tables(self) -> t.List[str]:
		return [x for x in self.tables if not x.startswith("__temp") and not x.startswith("temp")]


		class ProjectCreator(PydanticModel):

		self.target = target


		def test_generate_plan_application_dag__batch_index_populated(mocker: MockerFixture, make_snapshot):

Fix: Take batch_size into account when filling snapshots for INCREMENTAL_BY_UNIQUE_KEY models #2616

Fix: Take batch_size into account when filling snapshots for INCREMENTAL_BY_UNIQUE_KEY models #2616

Conversation

erindru commented May 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erindru May 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erindru commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

izeigerman May 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

izeigerman left a comment

Choose a reason for hiding this comment

erindru May 19, 2024 •

edited

izeigerman May 16, 2024 •

edited