feat(pyspark): provide a mode option to manage both batch and streaming connections #9131

chloeh13q · 2024-05-06T19:37:58Z

Description of changes

Provide a mode option to manage both batch and streaming connections

chloeh13q · 2024-05-09T19:31:24Z

Along the lines of #8584, it's worth pointing out the fact that pyspark in streaming mode requires all read and write paths to be directories rather than files. I think Impala has something similar

chloeh13q · 2024-05-14T04:08:33Z

xref: ibis-project/testing-data#9

jcrist

I think a mode="streaming"/mode="batch" kwarg to connect makes sense. However, is it possible to implement this while keeping everything within a single Backend class?

Our current magical ibis.<backend>.connect method isn't really setup to handle things the way you're doing here (as you've found by having to reimplement it).
It looks like a lot of the methods in the streaming backend are pretty similar to the batch implementations - uniting them and branching when necessary feels doable. If necessary, we can always create a small abstraction layer internal to the Backend class if needed.
For the new read_kafka/to_kafka methods, raising NotImplementedError (or something) when in batch mode seems fine.

We can always split the implementations later on if it grows unmaintainable, but for now it looks to me to be simpler to keep everything within one class.

ibis/backends/pyspark/tests/conftest.py

ibis/backends/pyspark/tests/test_streaming/conftest.py

ibis/backends/pyspark/__init__.py

chloeh13q · 2024-05-29T17:42:56Z

ibis/backends/pyspark/__init__.py

+        if self.mode == "streaming":
+            raise NotImplementedError(
+                "Pyspark in streaming mode does not support direction registration of parquet files. "
+                "Please use `read_parquet_directory` instead."


We don't technically have read_parquet_directory implemented yet but I figured that it may be okay because it's going in in a separate PR? But I can delete this if needed.

jcrist

One small fixup, otherwise this looks good to me!

ibis/backends/pyspark/tests/test_import_export.py

…ng connections

… level read/write and update ibis-test-data directory structure

jcrist

chloeh13q mentioned this pull request May 9, 2024

[EPIC] add Spark streaming support #8868

Open

1 task

chloeh13q force-pushed the feat/spark-streaming-connect branch 3 times, most recently from 066c2b3 to c942d98 Compare May 14, 2024 04:08

chloeh13q marked this pull request as ready for review May 14, 2024 04:09

jcrist reviewed May 24, 2024

View reviewed changes

chloeh13q force-pushed the feat/spark-streaming-connect branch 2 times, most recently from ffa61d4 to db25855 Compare May 28, 2024 18:02

chloeh13q requested a review from jcrist May 28, 2024 18:46

jcrist reviewed May 28, 2024

View reviewed changes

ibis/backends/pyspark/tests/conftest.py Outdated Show resolved Hide resolved

ibis/backends/pyspark/tests/test_streaming/conftest.py Outdated Show resolved Hide resolved

ibis/backends/pyspark/__init__.py Outdated Show resolved Hide resolved

chloeh13q requested a review from jcrist May 28, 2024 22:21

chloeh13q force-pushed the feat/spark-streaming-connect branch from 6764823 to 04fd4d0 Compare May 28, 2024 22:34

chloeh13q commented May 29, 2024

View reviewed changes

jcrist reviewed May 29, 2024

View reviewed changes

ibis/backends/pyspark/tests/test_import_export.py Outdated Show resolved Hide resolved

Chloe He added 12 commits May 29, 2024 14:18

feat(pyspark): provide a mode option to manage both batch and streami…

fec4905

…ng connections

feat(pyspark): implement read_kafka method

fcd5b5e

feat(pyspark): implement read and write methods and unit tests

4136d5b

refactor: move test data to ibis-testing-data, fix a small bug

00cfff3

refactor(pyspark): rename read and write methods to reflect directory…

47364a0

… level read/write and update ibis-test-data directory structure

test(pyspark): add tests for read and write in pyspark streaming

84134f0

refactor(pyflink): consolidate methods into a single Backend class

0f87204

fix a few minor things

9ab6a7a

address comments

3bcb6a9

add unit tests

808d430

code cleanup (remove a function that is not used)

586bb25

rewrite a xfailed test with pytest.raises

f0e7e09

chloeh13q force-pushed the feat/spark-streaming-connect branch from 3f137bb to f0e7e09 Compare May 29, 2024 21:18

jcrist approved these changes May 29, 2024

View reviewed changes

jcrist enabled auto-merge (squash) May 29, 2024 21:21

jcrist merged commit e425ad5 into ibis-project:main May 29, 2024
73 checks passed

chloeh13q deleted the feat/spark-streaming-connect branch May 30, 2024 14:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pyspark): provide a mode option to manage both batch and streaming connections #9131

feat(pyspark): provide a mode option to manage both batch and streaming connections #9131

chloeh13q commented May 6, 2024 •

edited

chloeh13q commented May 9, 2024

chloeh13q commented May 14, 2024

jcrist left a comment •

edited

chloeh13q May 29, 2024

jcrist left a comment

jcrist left a comment

feat(pyspark): provide a mode option to manage both batch and streaming connections #9131

feat(pyspark): provide a mode option to manage both batch and streaming connections #9131

Conversation

chloeh13q commented May 6, 2024 • edited

Description of changes

chloeh13q commented May 9, 2024

chloeh13q commented May 14, 2024

jcrist left a comment • edited

Choose a reason for hiding this comment

chloeh13q May 29, 2024

Choose a reason for hiding this comment

jcrist left a comment

Choose a reason for hiding this comment

jcrist left a comment

Choose a reason for hiding this comment

chloeh13q commented May 6, 2024 •

edited

jcrist left a comment •

edited