-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: A simple approach towards Stateless UDAF #16767
Comments
Yes, but it doesn't mean that he has to physically do all the data commits in one call, which can be fatal when there is a lot of input data (maybe millions of rows?). I think the implementor of stateless UDAF comes in to complete this tradeoff, i.e. what data structure to buffer this data within the middle, or to COMPACT or intermediate computation ahead of time depending on the semantics of the aggregator. In other words, in my mind, the implementor of stateless UDAF should implement these interfaces to achieve pipelined computation
It is still much simpler than Stateful UDAF because it does not need to
|
This comment was marked as resolved.
This comment was marked as resolved.
If the performance of passing whole input at once is acceptable, I guess
Wait, I don't get this. Why 2 copies? I think only one copy for |
Keep in mind that the stateless UDAF needs to solve the cases that cannot be incremental. This might be more common than you imagine. Many streaming jobs were migrated from batch, where the complicated algorithms are freely used and were not designed for incremental computation at all. For a textbook example, combined with a session window, a UDAF accepts tens of events (browsing history, clicks, impressions, etc.), execute some statistics regression, and then outputs a score of the likelihood of scam. For these algorithms that can be made incremental, risingwavelabs/arrow-udf#23 will solve it. |
https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html Spark UDAF is also such “incremental” api. So I’m confused about what does “cannot be incremental” and “stateless” really mean. 🤔 |
I just discussed with @st1page, and made some points clarified:
|
Well, I think indeed some algorithm cannot be incremental. e.g., if you need to iterate over the whole inputs multiple times. In this case, all computation is performed in Ohhh, I finally understand, the API @fuyufjh proposed is equivalent to only having So there are 3 cases:
|
Yeah, of course. Rather than "isn't always", I tend to say "is almost never" for user-defined functions.
I understand @st1page's concern, but that would be putting the cart before the horse. The users need to ingest a non-incremental algorithm here, and this is the requirement. If the only concern is about the data mount, we can add some limitation actually, or using
Again, keep in mind the algorithm is not incremental. You are basically saying that users should manually create a buffer and put rows into it for each |
Besides, this idea doesn't conflict with risingwavelabs/arrow-udf#23. I'd like to offer both to users and see their reactions. |
Yes, but I'm curious how users of batch frameworks like Spark do it.. Because that seems to be the only way from the UDAF API. |
Is your feature request related to a problem? Please describe.
Share some ideas about UDAF during our discussion with @wangrunji0408
Stateless UDAF and Stateful UDAF are very different in both implementation and use cases
Describe the solution you'd like
We don’t really need to offer a framework for stateless UDAF. Instead, we can reuse the interface of UDF and accept an
ARRAY
as input argument.For example, supposing the user have provided such a UDF:
or
Then they can no only use it as normal UDFs but also as a UDAF in agg or window agg
A trivial approach is using
array_agg()
and then call the UDF with the result array. However, in this way, both the result array and the input rows are persisted, resulting in 2 copies of data in the state table. Besides, thearray_agg()
is written by users, which looks very unfriendly.To make it more efficient and to make the syntax above just works, we need a special code path to handle these use cases.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: