Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify the pandas analyzer code to always respect the sample size #12097

Merged
merged 5 commits into from
May 21, 2024

Conversation

pdet
Copy link
Contributor

@pdet pdet commented May 16, 2024

The Pandas Analyzer code would ignore the sample size limit for null values when sniffing data types from object type columns.
In the case where we have an object column where most (or all) values are null, the whole column would be sniffed.

Now the null values are not skipped, and if we sampled the file, we upgrade the type from null to varchar, if only nulls were found.

This improves the scan of the NYC Taxi dataset from a dataframe by 2 orders of magnitude.

Old Time: 1.72
New Time: 0.025545666925609112

Sample benchmark:

wget "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2016-01.parquet" -O "tripdata.parquet"
import pandas 
import duckdb

df = pandas.read_parquet("tripdata.parquet")
con = duckdb.connect()
sql = f""" select passenger_count, avg(tip_amount) as tip_amount from df where trip_distance < 5 group by passenger_count order by passenger_count"""
result =  con.execute(sql).df()

@Tishj
Copy link
Contributor

Tishj commented May 16, 2024

Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method

@pdet
Copy link
Contributor Author

pdet commented May 16, 2024

Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method

This should still work, no?

@Tishj
Copy link
Contributor

Tishj commented May 16, 2024

Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method

This should still work, no?

Can you explain how?
This PR pretty much entirely undoes #9811 which is the PR that was made to address this issue

If the analyzer can only find nulls at the given offset, the resulting type is null
That reintroduces the problem of #6669, does it not?

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 16, 2024 20:19
@pdet
Copy link
Contributor Author

pdet commented May 16, 2024

Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method

This should still work, no?

Can you explain how? This PR pretty much entirely undoes #9811 which is the PR that was made to address this issue

If the analyzer can only find nulls at the given offset, the resulting type is null That reintroduces the problem of #6669, does it not?

Not really, if the values sniffed are all null and we did not sniff the full dataframe, it now defaults to a varchar.

@pdet pdet marked this pull request as ready for review May 16, 2024 21:08
@Tishj
Copy link
Contributor

Tishj commented May 17, 2024

That just means that this does not break the referenced issue because the columns happen to be VARCHAR, that's not equivalent

I'm fine merging this, I agree that this regression is a problem, but we do have to be aware of the behavior we're throwing away here and the problems that will inevitably cause

@Mytherin
Copy link
Collaborator

Maybe there's a faster way of checking for NULL values instead of calling get_item for every row (which is very slow)?

@Mytherin
Copy link
Collaborator

e.g. maybe we could use https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html - although ideally we wouldn't do that for the entire DataFrame range

@Mytherin
Copy link
Collaborator

How about we do that as a fallback - if the result type is SQLNULL (i.e. we only found null values), we call isnull or similar to find any non-NULL values in the column (if there are any) and take their type instead.

@pdet
Copy link
Contributor Author

pdet commented May 17, 2024

How about we do that as a fallback - if the result type is SQLNULL (i.e. we only found null values), we call isnull or similar to find any non-NULL values in the column (if there are any) and take their type instead.

Sure, I'm happy to try/benchmark that :-)

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 17, 2024 10:43
@pdet
Copy link
Contributor Author

pdet commented May 17, 2024

@Mytherin I've modifed the code to essentially use __getitem__(first_valid_index()).
The time on the benchmark when executing a query on top of the NYC Taxi dataframe is now: 0.22
So one order of magnitude faster than main, but one order of magnitude slower than just defaulting to varchar.

One thing to notice is that the query that is running does not require the columns that are object types. Maybe one thing we should consider doing is performing projection pushdown and the dataframe.

@pdet pdet marked this pull request as ready for review May 17, 2024 10:47
@Mytherin
Copy link
Collaborator

One thing to notice is that the query that is running does not require the columns that are object types. Maybe one thing we should consider doing is performing projection pushdown and the dataframe.

The problem is that what we really need in that case is some callback on when we need the type of the column, then we could lazily figure out the type only when required. That's not really supported infrastructure wise in the system right now - but would definitely be interesting to add.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 17, 2024 12:33
@Mytherin Mytherin marked this pull request as ready for review May 17, 2024 12:41
@pdet
Copy link
Contributor Author

pdet commented May 21, 2024

@Mytherin The CI here is failing because the previous implementations has a bug on the benchmark I've added

@Mytherin Mytherin merged commit e09a044 into duckdb:main May 21, 2024
42 of 43 checks passed
@Mytherin
Copy link
Collaborator

Ah yeah, good point. LGTM in that case

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request May 21, 2024
Merge pull request duckdb/duckdb#12152 from carlopi/allow_community_extensions
Merge pull request duckdb/duckdb#12097 from pdet/pandas_object_analyzer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants