Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pyspark): provide a mode option to manage both batch and streaming connections #9131

Merged
merged 12 commits into from
May 29, 2024

Conversation

chloeh13q
Copy link
Contributor

@chloeh13q chloeh13q commented May 6, 2024

Description of changes

Provide a mode option to manage both batch and streaming connections

@chloeh13q
Copy link
Contributor Author

Along the lines of #8584, it's worth pointing out the fact that pyspark in streaming mode requires all read and write paths to be directories rather than files. I think Impala has something similar

@chloeh13q chloeh13q force-pushed the feat/spark-streaming-connect branch 3 times, most recently from 066c2b3 to c942d98 Compare May 14, 2024 04:08
@chloeh13q
Copy link
Contributor Author

xref: ibis-project/testing-data#9

@chloeh13q chloeh13q marked this pull request as ready for review May 14, 2024 04:09
Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a mode="streaming"/mode="batch" kwarg to connect makes sense. However, is it possible to implement this while keeping everything within a single Backend class?

  • Our current magical ibis.<backend>.connect method isn't really setup to handle things the way you're doing here (as you've found by having to reimplement it).
  • It looks like a lot of the methods in the streaming backend are pretty similar to the batch implementations - uniting them and branching when necessary feels doable. If necessary, we can always create a small abstraction layer internal to the Backend class if needed.
  • For the new read_kafka/to_kafka methods, raising NotImplementedError (or something) when in batch mode seems fine.

We can always split the implementations later on if it grows unmaintainable, but for now it looks to me to be simpler to keep everything within one class.

@chloeh13q chloeh13q force-pushed the feat/spark-streaming-connect branch 2 times, most recently from ffa61d4 to db25855 Compare May 28, 2024 18:02
@chloeh13q chloeh13q requested a review from jcrist May 28, 2024 18:46
ibis/backends/pyspark/tests/conftest.py Outdated Show resolved Hide resolved
ibis/backends/pyspark/tests/test_streaming/conftest.py Outdated Show resolved Hide resolved
ibis/backends/pyspark/__init__.py Outdated Show resolved Hide resolved
@chloeh13q chloeh13q requested a review from jcrist May 28, 2024 22:21
@chloeh13q chloeh13q force-pushed the feat/spark-streaming-connect branch from 6764823 to 04fd4d0 Compare May 28, 2024 22:34
if self.mode == "streaming":
raise NotImplementedError(
"Pyspark in streaming mode does not support direction registration of parquet files. "
"Please use `read_parquet_directory` instead."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't technically have read_parquet_directory implemented yet but I figured that it may be okay because it's going in in a separate PR? But I can delete this if needed.

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small fixup, otherwise this looks good to me!

ibis/backends/pyspark/tests/test_import_export.py Outdated Show resolved Hide resolved
@chloeh13q chloeh13q force-pushed the feat/spark-streaming-connect branch from 3f137bb to f0e7e09 Compare May 29, 2024 21:18
Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@jcrist jcrist enabled auto-merge (squash) May 29, 2024 21:21
@jcrist jcrist merged commit e425ad5 into ibis-project:main May 29, 2024
73 checks passed
@chloeh13q chloeh13q deleted the feat/spark-streaming-connect branch May 30, 2024 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants