Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: .sql() sometimes execute the query in order to get the schema #9162

Open
1 task done
chloeh13q opened this issue May 8, 2024 · 2 comments · May be fixed by #9290
Open
1 task done

bug: .sql() sometimes execute the query in order to get the schema #9162

chloeh13q opened this issue May 8, 2024 · 2 comments · May be fixed by #9290
Labels
bug Incorrect behavior inside of ibis flink Issues or PRs related to Flink pyspark The Apache PySpark backend

Comments

@chloeh13q
Copy link
Contributor

What happened?

The .sql() method calls _get_schema_using_query() underneath the hood, which uses the query to get the schema if a schema is not passed as an argument. The implementation of _get_schema_using_query() differs across backends, but for the most part, if the backend provides a way to analyze the query, that's what we use. If the backend doesn't, a lot of times we create a new view/table, execute the query, and then drop the view/table so that the method has no side effect. In some backends, however, we're just executing the query. Examples: Flink, PySpark.

What version of ibis are you using?

main

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@chloeh13q chloeh13q added the bug Incorrect behavior inside of ibis label May 8, 2024
@gforsyth gforsyth added pyspark The Apache PySpark backend flink Issues or PRs related to Flink labels May 10, 2024
@gforsyth
Copy link
Member

It's unclear if any of the backends are actually executing the queries (which would be a bug), or if they're returning a deferred object with schema information for us to extract.

Probably the first step here is to add a unit-test that tries to run a .sql() call that would be expensive to actually execute and then ensure that it returns in a "short" amount of time.

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2024

Investigated this, and here are the results:

backend cheap? description
bigquery ✔️ uses dry_run=True which doesn't execute the query
clickhouse ✔️ creates a view then gets the view's schema
datafusion ✔️ creates a view then gets the view's schema
druid ✔️ doesn't support .sql
duckdb ✔️ calls DESCRIBE on the query, which doesn't execute the query
exasol ✔️ creates a view then gets the view's schema
flink ✔️ sql_query doesn't execute the query, it constructs a Java object using a private constructor that only sets its inputs as attributes on a class.
impala ✔️ creates a view then gets the view's schema
mssql ✔️ uses a special function sp_describe_first_result_set which gets metadata using static analysis
mysql EXECUTES
oracle ✔️ creates a view then gets the view's schema
polars uses eager=None which can be True at init-time (not sure what that means), so will address this in a PR to explicitly be False.
postgres ✔️ creates a view then gets the view's schema
pyspark ✔️ we use SparkSession.sql to get type information, and this method doesn't execute the query before returning
risingwave ✔️ creates a view then gets the view's schema
snowflake ✔️ executes a LIMIT 0 version of the query and returns the DESCRIBE RESULT output
sqlite ✔️ creates a view then gets the view's schema
trino ✔️ creates a prepared statement from a query and gets the DESCRIBE OUTPUT output; prepared statements do not execute the query

PR incoming for polars and mysql.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis flink Issues or PRs related to Flink pyspark The Apache PySpark backend
Projects
Status: backlog
Development

Successfully merging a pull request may close this issue.

3 participants