-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Ibis slows down considerably for wider tables #9111
Comments
I've benchmarked Ibis 8 vs. 9 on 55 Ibis expressions which are part of a Data Warehouse built using https://github.com/binste/dbt-ibis. I've measured:
Here some pseudo code to illustrate import ibis
t = ibis.table(...)
# --------------------------
# START: Execution of Ibis code
# --------------------------
t = t.group_by(...).select(...)
# --------------------------
# END: Execution of Ibis code
# --------------------------
# --------------------------
# START: Convert Ibis expression to SQL
# --------------------------
ibis.to_sql(t)
# --------------------------
# END: Convert Ibis expression to SQL
# -------------------------- The great news is that the compilation to SQL got significantly faster with the move to SQLGlot which is super nice! :) The execution of Ibis code on the other hand got a bit slower with one expression taking significantly longer with 11 seconds. I've profiled that expression and most time is spent in the following statements:
Hope this helps! |
@binste Thanks for digging into this. Interesting results! Can you share the query that's now taking 11 seconds with Ibis 9.0? |
Regarding drop and relocate, they're both implemented using a somewhat naive approach:
I suspect that for
We may be able to take a similar approach for relocate by using a data structure more optimized for the operation. I think something that has fast ordered inserts (perhaps |
Unfortunately, I can't as I'd need to mask column names and code logic for IP reasons. It's 2 input tables with each around 50 columns and 1 table with ~10 columns and then various operations on top of it. But happy to test out any PRs if there is a wheel file available! |
Naive benchmark here, but for a quick test: # drop_test.py
import ibis
import time
from contextlib import contextmanager
@contextmanager
def tictoc(num_cols):
tic = time.time()
yield
toc = time.time()
print(f"| {num_cols} | {toc - tic} seconds")
print(f"{ibis.__version__=}")
for num_cols in [10, 20, 50, 100, 200, 500, 1000]:
t = ibis.table(name="t", schema=[(f"a{i}", "int") for i in range(num_cols)])
with tictoc(num_cols):
t.drop("a8") 🐚 python drop_test.py
ibis.__version__='8.0.0'
| 10 | 0.18016910552978516 seconds
| 20 | 0.0016529560089111328 seconds
| 50 | 0.0037081241607666016 seconds
| 100 | 0.007429361343383789 seconds
| 200 | 0.013902902603149414 seconds
| 500 | 0.03390693664550781 seconds
| 1000 | 0.06650948524475098 seconds 🐚 python drop_test.py
ibis.__version__='9.0.0'
| 10 | 0.002298593521118164 seconds
| 20 | 0.005956888198852539 seconds
| 50 | 0.027918338775634766 seconds
| 100 | 0.09690213203430176 seconds
| 200 | 0.36721324920654297 seconds
| 500 | 2.301510810852051 seconds
| 1000 | 9.317416191101074 seconds |
What happened?
In a web application, I'm creating rather complicated Ibis expressions in the backend. The execution of the final expression takes < 1 second but creating the expression itself took up to 30 seconds. Took me a while to figure out what's going on but I think it's because Ibis gets considerably slower for wider tables. I don't need all columns in my app and so I was now able to improve performance by just pruning away right from the beginning what I don't need.
However, in case you see any improvement potential, below some example code to demonstrate it. If this is just inherent to Ibis, what do you think about a note in the documentation? I only found #6832 which is somewhat related.
Setup
Drop and relocate
In more complex expressions, having multiple of these drop statements can quickly sum up:
Same for relocate:
select
andmutate
do not have this issue:Selectors
I've also noticed that Ibis selectors can be much slower than using a pure Python implementation:
What version of ibis are you using?
9.0.0
What backend(s) are you using, if any?
No response
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: