Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ddl): add no_views arg to list_tables() #8864

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mfatihaktas
Copy link
Member

Description of changes

First step towards addressing #8382.

This

  • Adds no_views arg to list_tables() for clickhouse, Flink and pyspark backends.

Asking for early feedback. If the change is reasonable, will proceed to make the same change for the remaining backends.

@mfatihaktas mfatihaktas self-assigned this Apr 2, 2024
@mfatihaktas mfatihaktas added feature Features or general enhancements ddl Issues related to creating or altering data definitions labels Apr 2, 2024
@mfatihaktas mfatihaktas force-pushed the feat/ddl/table-and-views branch 2 times, most recently from 10ee2e2 to 012aa23 Compare April 2, 2024 18:36
self,
like: str | None = None,
database: str | None = None,
no_views: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally in this case I prefer the following pattern:

Suggested change
no_views: bool = False,
include_views: bool = True,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, makes sense, I will switch to include_views.

@cpcloud
Copy link
Member

cpcloud commented Apr 2, 2024

Before we go down this path, let's consider whether the general approach is going to work for all the varieties of table like things one may want.

There are tables, views, materialized views, streams, and probably others.

At some point someone will ask to be able to list only table like objects of type X.

I think adding a flag for each of those to list_tables is not a good idea because it will lead to an API with a large number of parameters and a complicated implementation.

Instead, how about we have a list_foo method for each kind of object, and introduce a new method, list_rels (or list_relations) listing the union of all tabular objects.

We could also not implement the latter until someone asks for it.

@mfatihaktas
Copy link
Member Author

Instead, how about we have a list_foo method for each kind of object, and introduce a new method, list_rels (or list_relations) listing the union of all tabular objects.
We could also not implement the latter until someone asks for it.

Thanks @cpcloud , I also think that exposing list_foo() methods that do exactly what their names suggest would be cleaner and easier to maintain. Just to note, this will be a breaking change as list_tables() list views together with the tables.

I will proceed with this approach then.

@gforsyth
Copy link
Member

gforsyth commented Apr 3, 2024

I'm not opposed to explicit list_ methods but I think we need to consider the user with less SQL experience and think about how we want to handle certain scenarios.

The one that immediately comes to mind is

  • user does a read_csv with DuckDB, even passes in a table_name kwarg
  • user calls list_tables and there is nothing there because read_csv creates a view

@mfatihaktas
Copy link
Member Author

The one that immediately comes to mind is
user does a read_csv with DuckDB, even passes in a table_name kwarg
user calls list_tables and there is nothing there because read_csv creates a view

As another example to this, when create_table() is called with an in-memory object, it creates a view rather than a table. Flink requires a source connector for tables and we chose to create a view instead of the alternative where we would first write the obj into a temporary file and then create a table pointing to the file.

@cpcloud
Copy link
Member

cpcloud commented Apr 3, 2024

The one that immediately comes to mind is
user does a read_csv with DuckDB, even passes in a table_name kwarg
user calls list_tables and there is nothing there because read_csv creates a view

As another example to this, when create_table() is called with an in-memory object, it creates a view rather than a table. Flink requires a source connector for tables and we chose to create a view instead of the alternative where we would first write the obj into a temporary file and then create a table pointing to the file.

This isn't true in general, just for the Flink backend. All other backends will create physical persistent storage when create_table is called.

@cpcloud
Copy link
Member

cpcloud commented Apr 3, 2024

I'm not opposed to explicit list_ methods but I think we need to consider the user with less SQL experience and think about how we want to handle certain scenarios.

The one that immediately comes to mind is

  • user does a read_csv with DuckDB, even passes in a table_name kwarg
  • user calls list_tables and there is nothing there because read_csv creates a view

IMO this isn't a huge loss, but would require some documentation in list_tables and in any of the read_foo methods that create views.

Another option would be to materialize the table, but I don't know if we want to do that. Repeated use of large CSVs will be faster, but the initial function call will scale with the CSV file rather than the number of sample lines like it does now.

@jcrist
Copy link
Member

jcrist commented Apr 3, 2024

Another API option would be to have an include_types/exclude_types (or include/exclude or some other name) that filters table-like things based on type.

con.list_tables()  # defaults to listing all table-like things within the current hierarchy

con.list_tables(include=["table"])  # only list tables
con.list_tables(include=["table", "view"])  # only list tables and views

I like this better because:

  • We don't have a proliferation of methods that some backends won't be able to support (not all backends have views, not all backends have materialized views) - if a backend doesn't support a table category it can just ignore that filter.
  • To users writing queries using ibis's expression api, everything tabular is a table - we make no distinction at the expression level between the ways a table might be represented in the backend. That's backend-specific details.
  • We can keep the current behavior of seeing all table-like things that might be referenced, which helps avoid the situation Gil referenced above where a con.read_csv call might create a view that then isn't visible in con.list_tables.

@cpcloud
Copy link
Member

cpcloud commented Apr 3, 2024

I'm not a big fan of APIs that encode flags with dynamic values like strings when there's an API to encode that information in a statically knowable and searchable way like a method name.

Strings-as-flags are also not easily tab-completable, and if you want encode what values are supported (and perhaps show that somewhere) you have to build something for that. Encoding these APIs as methods doesn't have these problems.

We don't have a proliferation of methods that some backends won't be able to support (not all backends have views, not all backends have materialized views) - if a backend doesn't support a table category it can just ignore that filter.

I don't think ignoring the flag is any better, if someone asks for "materialized_views" (which is rather annoying to type) and the backend doesn't support it the backend returns an empty list which is ambiguous about whether the backend supports them but doesn't have any, or just doesn't support them. Alternatively, we now have to encode which backends support what, which has just as much proliferation as adding methods (or a base class impl) that raise (or encoding support as an ABC).

To users writing queries using ibis's expression api, everything tabular is a table - we make no distinction at the expression level between the ways a table might be represented in the backend. That's backend-specific details.

It's true that accessing the list of materialized views (say) is backend-specific, however, that's the goal here: to expose this information to users.

I proposed a solution for this already (list_rels/list_relations), but it is indeed a breaking change.

@jcrist
Copy link
Member

jcrist commented Apr 3, 2024

What I'm getting at is that for most use cases in ibis the user shouldn't care whether something is a view/table/materialized-view/dataframe/..., they're all treated the same by the expression api (and represented by a ir.Table object). In our backend-generic API we call all tabular things tables.

When executing an unbound expression, all that matters is that there is a corresponding table-like-thing with a given name and schema for each unbound table - one backend might use a view for mytable, while another might use a table - for querying purposes there's no reason to differentiate the two.

Given this, I'd argue that we'd want the best named and most intuitive API to return all table-like-things, rather than just database tables. I'm not a huge fan of list_relations/list_rels - we don't have a user-facing Relation object and there's no reason to expect a user to be familiar with that term. Since we call all of these tables elsewhere, using the backend-generic list_tables term here made sense to me, but if you'd like to reserve that for true database tables then how about con.list?

@mfatihaktas
Copy link
Member Author

As a side question:

BasePandasBackend, which is used for pandas and dask backends, has create_view() defined as

def create_view(
        self,
        name: str,
        obj: ir.Table,
        *,
        database: str | None = None,
        overwrite: bool = False,
    ) -> ir.Table:
        return self.create_table(
            name, obj=obj, temp=None, database=database, overwrite=overwrite
        )

Given that the notion of virtual table (view) is non-existent for these backends, should we remove create_view() to make the API explicit as part of this effort/issue?

@cpcloud
Copy link
Member

cpcloud commented Apr 3, 2024

con.list seems fine. Just to be clear con.list would be the method for listing all tabular objects.

Agree that list_rels complicates things with a new noun that maybe unnecessarily unfamiliar.

@mfatihaktas mfatihaktas force-pushed the feat/ddl/table-and-views branch 9 times, most recently from 451a9f7 to 08198f7 Compare April 9, 2024 18:40
@mfatihaktas mfatihaktas marked this pull request as ready for review April 11, 2024 14:47
@mfatihaktas
Copy link
Member Author

Thank you all for the discussion. The first pass is ready for review now:

  • Made list_tables() and list_views() explicit.
    • For some of the backends, I could not find a direct way to distinguish views from the tables. Added TODO comments to decide for those during the reviews.
  • Added list() for listing both tables and views.

@ncclementi
Copy link
Contributor

@mfatihaktas What's the status of this one, is this blocked or does it need a review?
If you need a review would you mind solving the conflicts first and then we can ping someone

@mfatihaktas
Copy link
Member Author

@mfatihaktas What's the status of this one, is this blocked or does it need a review? If you need a review would you mind solving the conflicts first and then we can ping someone

Thanks for checking on this. Regarding the status, I marked this PR as ready-for-review some time ago. I am busy with another project this week, but will try to resolve the conflicts before EOD on Friday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ddl Issues related to creating or altering data definitions feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants