Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: add test for impure function correlation behavior #9014

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Apr 18, 2024

Related to #8921,
trying to write down exactly what the expected behavior is.

I figure we can use this PR to hash out exaclty what we want the semantics to be, and then the other discussions might be easier because our goal is written down precisely somewhere. Please let me know if you agree or disagree with this behavior, or if there are other tests we should add.

Need to fix the UDF test case in a followup. Also wasn't sure where to put these tests, I put them in their own little file but if you point me elsewhere Iwill move them.

@NickCrews
Copy link
Contributor Author

oooh, these CI failures are revealing differences in the backend behaviors... wheeee into the 🐰 🕳️ we go!

@NickCrews
Copy link
Contributor Author

I can think of two reasons a user might care about the semantics here, and I hope we can support both of their needs:

  1. impureness/correlatedness. If a function is impure, they may want it to be executed once, or many times, depending on if they want them to be correlated. See this example
  2. performance. If some computation is slow, they only want it to happen a single time.

So I think this means that we as ibis authors can't assume what the goals of the user is, and how many times they want an expression executed. Therefore, we shouldn't do any clever rewrites or mergings of selects. I think we need to keep a more 1:1 correspondence between what the user writes and the SQL we produce: Every time the user does a .select(), .mutate(), etc, (except for simple column renamings, and maybe a few other cases) that leads to exactly one more with select .... as ... in the generated SQL.

IDK, what do you think of this train of reasoning? I think I'm fairly convinced that those two use cases are the requirements for success, but perhaps there is a different/better way of accomplishing that goal

@kszucs
Copy link
Member

kszucs commented Apr 22, 2024

@NickCrews please rebase to test with #9023 change included

@cpcloud
Copy link
Member

cpcloud commented Apr 22, 2024

@NickCrews My only objection would be that

performance. If some computation is slow, they only want it to happen a single time.

Is not something we can enforce even if we never merge any select statements. This kind of guarantee is at the level of the query engine.

@NickCrews
Copy link
Contributor Author

@cpcloud yup you are right with guarantees, "suggesting" to the backend is the best we can do.

What do you think in general of my proposal of "one CTE per .select"? I'm not sure if you're skeptical of the whole thing or just the performance claims... Thanks!

@NickCrews
Copy link
Contributor Author

I'm trying to decide what is higher priority:

An implementation that is faster 90% of the time, but does clever things and therefore isn't able to be tuned by the user that 10% of the time they need it

Vs

An implementation that is a bit slower in the majority of cases, but is always fine tunable to get the perf you need in the edge cases.

@NickCrews NickCrews force-pushed the test-correlation branch 2 times, most recently from 7700941 to d14384c Compare April 23, 2024 19:21
@NickCrews
Copy link
Contributor Author

slowly going through the backends and adding the correct marks for each kind of failure...

@NickCrews NickCrews force-pushed the test-correlation branch 3 times, most recently from 53c6941 to ee0ae0c Compare April 23, 2024 21:33
Need to fix the a few broken cases.

Related to ibis-project#8921,
trying to write down exactly what the expected behavior is.
@NickCrews
Copy link
Contributor Author

@cpcloud @kszucs I think this is ready for review whenever you get the chance! I think this is the groundwork for defining the current state, and after we get this in then we can start talking about what we think ideal behavior should be, and how to get there.

@NickCrews NickCrews requested review from cpcloud and kszucs May 7, 2024 01:48
@NickCrews
Copy link
Contributor Author

Anything I can do here to help move this forward?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants