bug: `.sql()` sometimes execute the query in order to get the schema #9162

chloeh13q · 2024-05-08T22:07:32Z

What happened?

The .sql() method calls _get_schema_using_query() underneath the hood, which uses the query to get the schema if a schema is not passed as an argument. The implementation of _get_schema_using_query() differs across backends, but for the most part, if the backend provides a way to analyze the query, that's what we use. If the backend doesn't, a lot of times we create a new view/table, execute the query, and then drop the view/table so that the method has no side effect. In some backends, however, we're just executing the query. Examples: Flink, PySpark.

What version of ibis are you using?

main

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

gforsyth · 2024-05-10T17:50:48Z

It's unclear if any of the backends are actually executing the queries (which would be a bug), or if they're returning a deferred object with schema information for us to extract.

Probably the first step here is to add a unit-test that tries to run a .sql() call that would be expensive to actually execute and then ensure that it returns in a "short" amount of time.

cpcloud · 2024-06-02T13:39:12Z

Investigated this, and here are the results:

backend	cheap?	description
bigquery	✔️	uses `dry_run=True` which doesn't execute the query
clickhouse	✔️	creates a view then gets the view's schema
datafusion	✔️	creates a view then gets the view's schema
druid	✔️	doesn't support `.sql`
duckdb	✔️	calls `DESCRIBE` on the query, which doesn't execute the query
exasol	✔️	creates a view then gets the view's schema
flink	✔️	`sql_query` doesn't execute the query, it constructs a Java object using a private constructor that only sets its inputs as attributes on a class.
impala	✔️	creates a view then gets the view's schema
mssql	✔️	uses a special function `sp_describe_first_result_set` which gets metadata using static analysis
mysql	⛔	EXECUTES
oracle	✔️	creates a view then gets the view's schema
polars	❓	uses `eager=None` which can be `True` at init-time (not sure what that means), so will address this in a PR to explicitly be `False`.
postgres	✔️	creates a view then gets the view's schema
pyspark	✔️	we use `SparkSession.sql` to get type information, and this method doesn't execute the query before returning
risingwave	✔️	creates a view then gets the view's schema
snowflake	✔️	executes a `LIMIT 0` version of the query and returns the `DESCRIBE RESULT` output
sqlite	✔️	creates a view then gets the view's schema
trino	✔️	creates a prepared statement from a query and gets the `DESCRIBE OUTPUT` output; prepared statements do not execute the query

PR incoming for polars and mysql.

chloeh13q added the bug Incorrect behavior inside of ibis label May 8, 2024

gforsyth added pyspark The Apache PySpark backend flink Issues or PRs related to Flink labels May 10, 2024

cpcloud linked a pull request Jun 2, 2024 that will close this issue

fix(polars,mysql): avoid execution of query in _get_schema_using_query #9290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: `.sql()` sometimes execute the query in order to get the schema #9162

bug: `.sql()` sometimes execute the query in order to get the schema #9162

chloeh13q commented May 8, 2024

gforsyth commented May 10, 2024

cpcloud commented Jun 2, 2024

bug: .sql() sometimes execute the query in order to get the schema #9162

bug: .sql() sometimes execute the query in order to get the schema #9162

Comments

chloeh13q commented May 8, 2024

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

gforsyth commented May 10, 2024

cpcloud commented Jun 2, 2024

bug: `.sql()` sometimes execute the query in order to get the schema #9162

bug: `.sql()` sometimes execute the query in order to get the schema #9162