Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Ibis slows down considerably for wider tables #9111

Open
1 task done
binste opened this issue May 3, 2024 · 5 comments
Open
1 task done

perf: Ibis slows down considerably for wider tables #9111

binste opened this issue May 3, 2024 · 5 comments
Labels
bug Incorrect behavior inside of ibis

Comments

@binste
Copy link
Contributor

binste commented May 3, 2024

What happened?

In a web application, I'm creating rather complicated Ibis expressions in the backend. The execution of the final expression takes < 1 second but creating the expression itself took up to 30 seconds. Took me a while to figure out what's going on but I think it's because Ibis gets considerably slower for wider tables. I don't need all columns in my app and so I was now able to improve performance by just pruning away right from the beginning what I don't need.

However, in case you see any improvement potential, below some example code to demonstrate it. If this is just inherent to Ibis, what do you think about a note in the documentation? I only found #6832 which is somewhat related.

Setup

import ibis
import ibis.selectors as s
from ibis import deferred as d_

t = ibis.table(name="base1", schema={f"col{i}": "int64" for i in range(200)})

Drop and relocate

In more complex expressions, having multiple of these drop statements can quickly sum up:

t1 = t.drop("col0", "col1", "col2")  # 500ms

Same for relocate:

t1 = t.relocate("col100", before="col2")  # 1,700ms -> 1.7 seconds

select and mutate do not have this issue:

t1 = t.select("col0", "col1", "col2")  # 3ms
t1 = t.mutate(colnew=ibis.literal(1))  # 20ms

Selectors

I've also noticed that Ibis selectors can be much slower than using a pure Python implementation:

t.mutate(s.across(s.c(*[f"col{i}" for i in range(20)]), d_.cast("str")))  # 580ms
t.mutate(**{f"col{i}": d_[f"col{i}"] for i in range(20)})  # 70ms

What version of ibis are you using?

9.0.0

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@binste binste added the bug Incorrect behavior inside of ibis label May 3, 2024
@binste
Copy link
Contributor Author

binste commented May 3, 2024

I've benchmarked Ibis 8 vs. 9 on 55 Ibis expressions which are part of a Data Warehouse built using https://github.com/binste/dbt-ibis. I've measured:

  • Execution of Ibis code: The time it takes to execute the Ibis code to get to the final expression
  • Convert Ibis expression to SQL: The time it takes to compile the final Ibis expression to SQL

Here some pseudo code to illustrate

import ibis

t = ibis.table(...)

# --------------------------
# START: Execution of Ibis code
# --------------------------
t = t.group_by(...).select(...)

# --------------------------
# END: Execution of Ibis code
# --------------------------

# --------------------------
# START: Convert Ibis expression to SQL
# --------------------------
ibis.to_sql(t)
# --------------------------
# END: Convert Ibis expression to SQL
# --------------------------

image

image

The great news is that the compilation to SQL got significantly faster with the move to SQLGlot which is super nice! :) The execution of Ibis code on the other hand got a bit slower with one expression taking significantly longer with 11 seconds. I've profiled that expression and most time is spent in the following statements:

  • Table.relocate ibis/expr/types/relations.py:4279
    • Code here looks like this: .relocate("valid_to", after="valid_from")
  • Join.wrapper ibis/expr/types/joins.py:188
  • Table.drop ibis/expr/types/relations.py:2329
  • Table.rename ibis/expr/types/relations.py:2140

Hope this helps!

@cpcloud
Copy link
Member

cpcloud commented May 3, 2024

@binste Thanks for digging into this. Interesting results!

Can you share the query that's now taking 11 seconds with Ibis 9.0?

@cpcloud
Copy link
Member

cpcloud commented May 3, 2024

Regarding drop and relocate, they're both implemented using a somewhat naive approach:

  • drop: project every column except the input columns
  • relocate: basically a reprojection that loops through every column

I suspect that for drop we can turn it into a special operation that we should at least be able to make the operation scale with the number of dropped columns instead of the total number of columns. A place to start might be:

  1. create a special Drop relational operation
  2. implement the column drop in select merging by looping over the drop list and removing each element of the drop list from the Select fields. Select fields are dicts, so that should get us to O(droplist) instead of O(allcolumns).

We may be able to take a similar approach for relocate by using a data structure more optimized for the operation. I think something that has fast ordered inserts (perhaps sortedcontainers has a sorted map container?) might be a good place to start.

@binste
Copy link
Contributor Author

binste commented May 3, 2024

Unfortunately, I can't as I'd need to mask column names and code logic for IP reasons. It's 2 input tables with each around 50 columns and 1 table with ~10 columns and then various operations on top of it. But happy to test out any PRs if there is a wheel file available!

@gforsyth
Copy link
Member

gforsyth commented May 3, 2024

Naive benchmark here, but for a quick test:

# drop_test.py
import ibis
import time
from contextlib import contextmanager


@contextmanager
def tictoc(num_cols):
    tic = time.time()
    yield
    toc = time.time()
    print(f"| {num_cols} | {toc - tic} seconds")


print(f"{ibis.__version__=}")


for num_cols in [10, 20, 50, 100, 200, 500, 1000]:
    t = ibis.table(name="t", schema=[(f"a{i}", "int") for i in range(num_cols)])

    with tictoc(num_cols):
        t.drop("a8")
🐚 python drop_test.py
ibis.__version__='8.0.0'
| 10 | 0.18016910552978516 seconds
| 20 | 0.0016529560089111328 seconds
| 50 | 0.0037081241607666016 seconds
| 100 | 0.007429361343383789 seconds
| 200 | 0.013902902603149414 seconds
| 500 | 0.03390693664550781 seconds
| 1000 | 0.06650948524475098 seconds
🐚 python drop_test.py
ibis.__version__='9.0.0'
| 10 | 0.002298593521118164 seconds
| 20 | 0.005956888198852539 seconds
| 50 | 0.027918338775634766 seconds
| 100 | 0.09690213203430176 seconds
| 200 | 0.36721324920654297 seconds
| 500 | 2.301510810852051 seconds
| 1000 | 9.317416191101074 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Status: backlog
Development

No branches or pull requests

3 participants