-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pyspark): provide a mode option to manage both batch and streaming connections #9131
feat(pyspark): provide a mode option to manage both batch and streaming connections #9131
Conversation
Along the lines of #8584, it's worth pointing out the fact that pyspark in streaming mode requires all read and write paths to be directories rather than files. I think Impala has something similar |
066c2b3
to
c942d98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a mode="streaming"
/mode="batch"
kwarg to connect
makes sense. However, is it possible to implement this while keeping everything within a single Backend
class?
- Our current magical
ibis.<backend>.connect
method isn't really setup to handle things the way you're doing here (as you've found by having to reimplement it). - It looks like a lot of the methods in the streaming backend are pretty similar to the batch implementations - uniting them and branching when necessary feels doable. If necessary, we can always create a small abstraction layer internal to the
Backend
class if needed. - For the new
read_kafka
/to_kafka
methods, raisingNotImplementedError
(or something) when in batch mode seems fine.
We can always split the implementations later on if it grows unmaintainable, but for now it looks to me to be simpler to keep everything within one class.
ffa61d4
to
db25855
Compare
6764823
to
04fd4d0
Compare
if self.mode == "streaming": | ||
raise NotImplementedError( | ||
"Pyspark in streaming mode does not support direction registration of parquet files. " | ||
"Please use `read_parquet_directory` instead." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't technically have read_parquet_directory
implemented yet but I figured that it may be okay because it's going in in a separate PR? But I can delete this if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small fixup, otherwise this looks good to me!
… level read/write and update ibis-test-data directory structure
3f137bb
to
f0e7e09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description of changes
Provide a mode option to manage both batch and streaming connections