Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: read_parquet and similar methods silently overwrite tables #9199

Open
1 task done
deepyaman opened this issue May 15, 2024 · 2 comments
Open
1 task done

bug: read_parquet and similar methods silently overwrite tables #9199

deepyaman opened this issue May 15, 2024 · 2 comments
Labels
bug Incorrect behavior inside of ibis io Issues related to input and/or output

Comments

@deepyaman
Copy link
Contributor

What happened?

Here's an example derived from a user issue:

>>> import ibis
>>> t = ibis.examples.penguins.fetch()
>>> t.to_parquet("p1.parquet")
>>> t.rename("2_{name}").to_parquet("p2.parquet")
>>> con = ibis.duckdb.connect()
>>> p1 = con.read_parquet("p1.parquet", table_name="p")
>>> con.tables
Tables
------
- p
>>> p2 = con.read_parquet("p2.parquet", table_name="p")
>>> con.tables  # This won't look good...
Tables
------
- p
>>> p1.join(p2, p1["species"] == p2["2_species"]).execute()  # Ibis compiled the expression, but it's not valid DuckDB, since the table definition got overridden
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/joins.py", line 190, in wrapper
    return method(self._finish(), *args, **kwargs)
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/core.py", line 393, in execute
    return self._find_backend(use_default=True).execute(
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1349, in execute
    table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1283, in _to_duckdb_relation
    return self.con.sql(sql)
duckdb.duckdb.BinderException: Binder Error: Values list "t2" does not have a column named "species"
>>> p1.join(p2, p1["2_species"] == p2["2_species"]).execute()  # Ibis won't be happy, because it still thinks p1 is the table that isn't renamed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 775, in __getitem__
    values = self.bind(args)
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 227, in bind
    values.extend(bind(self, arg))
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 98, in bind
    yield ops.Field(table, value).to_expr()
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/common/bases.py", line 72, in __call__
    return cls.__create__(*args, **kwargs)
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/common/grounds.py", line 120, in __create__
    return super().__create__(**kwargs)
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/operations/relations.py", line 86, in __init__
    raise IbisTypeError(
ibis.common.exceptions.IbisTypeError: Column '2_species' is not found in table. Existing columns: 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year'.
>>> p1["species"].execute()  # Even simple stuff will fail
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/core.py", line 393, in execute
    return self._find_backend(use_default=True).execute(
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1349, in execute
    table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1283, in _to_duckdb_relation
    return self.con.sql(sql)
duckdb.duckdb.BinderException: Binder Error: Values list "t0" does not have a column named "species"

Shouldn't read_parquet, etc. just have an overwrite option? If you don't specify that, it should not let you clobber an existing table.

What version of ibis are you using?

9.0.0 (initially reported on 8.0.0)

What backend(s) are you using, if any?

DuckDB, but can be replicated on other backends

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@deepyaman deepyaman added the bug Incorrect behavior inside of ibis label May 15, 2024
@gforsyth
Copy link
Member

This does seem like a potential footgun.

I think with DuckDB, at least, read parquet wouldn't overwrite an existing table, but read_parquet is creating views and we're likely doing a create or replace.

I think an overwrite kwarg is a good idea

@gforsyth
Copy link
Member

Just to spell this out concretely:

There should be an overwrite kwarg to read_parquet (and read_csv, read_json, etc...) that defaults to False.

In terms of creating tables / views, we should default to CREATE VIEW name AS ... and only throw in a CREATE OR REPLACE if overwrite=True

@gforsyth gforsyth added the io Issues related to input and/or output label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis io Issues related to input and/or output
Projects
Status: backlog
Development

No branches or pull requests

2 participants