Modify the pandas analyzer code to always respect the sample size #12097

pdet · 2024-05-16T18:44:54Z

The Pandas Analyzer code would ignore the sample size limit for null values when sniffing data types from object type columns.
In the case where we have an object column where most (or all) values are null, the whole column would be sniffed.

Now the null values are not skipped, and if we sampled the file, we upgrade the type from null to varchar, if only nulls were found.

This improves the scan of the NYC Taxi dataset from a dataframe by 2 orders of magnitude.

Old Time: 1.72
New Time: 0.025545666925609112

Sample benchmark:

wget "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2016-01.parquet" -O "tripdata.parquet"

import pandas 
import duckdb

df = pandas.read_parquet("tripdata.parquet")
con = duckdb.connect()
sql = f""" select passenger_count, avg(tip_amount) as tip_amount from df where trip_distance < 5 group by passenger_count order by passenger_count"""
result =  con.execute(sql).df()

scripts/regression_test_python.py

Tishj · 2024-05-16T19:25:22Z

Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method

pdet · 2024-05-16T20:00:27Z

Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method

This should still work, no?

Tishj · 2024-05-16T20:06:09Z

Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method

This should still work, no?

Can you explain how?
This PR pretty much entirely undoes #9811 which is the PR that was made to address this issue

If the analyzer can only find nulls at the given offset, the resulting type is null
That reintroduces the problem of #6669, does it not?

pdet · 2024-05-16T20:32:20Z

Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method

This should still work, no?

Can you explain how? This PR pretty much entirely undoes #9811 which is the PR that was made to address this issue

If the analyzer can only find nulls at the given offset, the resulting type is null That reintroduces the problem of #6669, does it not?

Not really, if the values sniffed are all null and we did not sniff the full dataframe, it now defaults to a varchar.

Tishj · 2024-05-17T08:44:25Z

That just means that this does not break the referenced issue because the columns happen to be VARCHAR, that's not equivalent

I'm fine merging this, I agree that this regression is a problem, but we do have to be aware of the behavior we're throwing away here and the problems that will inevitably cause

Mytherin · 2024-05-17T08:49:23Z

Maybe there's a faster way of checking for NULL values instead of calling get_item for every row (which is very slow)?

Mytherin · 2024-05-17T08:53:05Z

e.g. maybe we could use https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html - although ideally we wouldn't do that for the entire DataFrame range

Mytherin · 2024-05-17T08:57:00Z

How about we do that as a fallback - if the result type is SQLNULL (i.e. we only found null values), we call isnull or similar to find any non-NULL values in the column (if there are any) and take their type instead.

pdet · 2024-05-17T09:18:10Z

How about we do that as a fallback - if the result type is SQLNULL (i.e. we only found null values), we call isnull or similar to find any non-NULL values in the column (if there are any) and take their type instead.

Sure, I'm happy to try/benchmark that :-)

pdet · 2024-05-17T10:47:21Z

@Mytherin I've modifed the code to essentially use __getitem__(first_valid_index()).
The time on the benchmark when executing a query on top of the NYC Taxi dataframe is now: 0.22
So one order of magnitude faster than main, but one order of magnitude slower than just defaulting to varchar.

One thing to notice is that the query that is running does not require the columns that are object types. Maybe one thing we should consider doing is performing projection pushdown and the dataframe.

Mytherin · 2024-05-17T12:25:26Z

One thing to notice is that the query that is running does not require the columns that are object types. Maybe one thing we should consider doing is performing projection pushdown and the dataframe.

The problem is that what we really need in that case is some callback on when we need the type of the column, then we could lazily figure out the type only when required. That's not really supported infrastructure wise in the system right now - but would definitely be interesting to add.

pdet · 2024-05-21T16:22:17Z

@Mytherin The CI here is failing because the previous implementations has a bug on the benchmark I've added

Mytherin · 2024-05-21T16:29:53Z

Ah yeah, good point. LGTM in that case

Merge pull request duckdb/duckdb#12152 from carlopi/allow_community_extensions Merge pull request duckdb/duckdb#12097 from pdet/pandas_object_analyzer

pdet added 2 commits May 16, 2024 20:39

Modify the pandas analyzer code to always respect the sample size

b213401

Add regression test

1b817af

Tishj reviewed May 16, 2024

View reviewed changes

scripts/regression_test_python.py Outdated Show resolved Hide resolved

Accidental comment of methods

4f1650c

duckdb-draftbot marked this pull request as draft May 16, 2024 20:19

pdet marked this pull request as ready for review May 16, 2024 21:08

Modify fallback to still identify the first non null column

808e256

duckdb-draftbot marked this pull request as draft May 17, 2024 10:43

pdet marked this pull request as ready for review May 17, 2024 10:47

make sure we are not /0

92c97a5

duckdb-draftbot marked this pull request as draft May 17, 2024 12:33

Mytherin marked this pull request as ready for review May 17, 2024 12:41

Mytherin added the CI Failure label May 17, 2024

Mytherin merged commit e09a044 into duckdb:main May 21, 2024
42 of 43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify the pandas analyzer code to always respect the sample size #12097

Modify the pandas analyzer code to always respect the sample size #12097

pdet commented May 16, 2024

Tishj commented May 16, 2024

pdet commented May 16, 2024

Tishj commented May 16, 2024

pdet commented May 16, 2024 •

edited

Tishj commented May 17, 2024 •

edited

Mytherin commented May 17, 2024

Mytherin commented May 17, 2024

Mytherin commented May 17, 2024

pdet commented May 17, 2024

pdet commented May 17, 2024

Mytherin commented May 17, 2024

pdet commented May 21, 2024

Mytherin commented May 21, 2024

Modify the pandas analyzer code to always respect the sample size #12097

Modify the pandas analyzer code to always respect the sample size #12097

Conversation

pdet commented May 16, 2024

Tishj commented May 16, 2024

pdet commented May 16, 2024

Tishj commented May 16, 2024

pdet commented May 16, 2024 • edited

Tishj commented May 17, 2024 • edited

Mytherin commented May 17, 2024

Mytherin commented May 17, 2024

Mytherin commented May 17, 2024

pdet commented May 17, 2024

pdet commented May 17, 2024

Mytherin commented May 17, 2024

pdet commented May 21, 2024

Mytherin commented May 21, 2024

pdet commented May 16, 2024 •

edited

Tishj commented May 17, 2024 •

edited