Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table IO Managers should capture column schemas with appropriate metadata tag #21954

Open
slopp opened this issue May 17, 2024 · 3 comments
Open
Assignees

Comments

@slopp
Copy link
Contributor

slopp commented May 17, 2024

What's the use case?

Many of the IO managers will add dataframe_columns metadata to an asset materialization, eg https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-duckdb-pandas/dagster_duckdb_pandas/duckdb_pandas_type_handler.py#L66

In #20424, we standardized on using dagster/column_schema as the metadata key name for this type of information, and that metadata is now used in the UI and asset checks, eg:

https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_core/definitions/asset_check_factories/schema_change_checks.py#L58

We should consider updating the metadata key the IO managers create / use. Though this would be a breaking change.

Ideas of implementation

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@sryza sryza removed their assignment May 20, 2024
@sryza
Copy link
Contributor

sryza commented May 20, 2024

Something for us to think about here is that there can be a difference between:

  • The column types for the in-memory dataframe object
  • The column types in the database.

E.g., in memory, a pandas string column will often have type "obj", while in the database it might have type "varchar".

To be consistent with the rest of the product, I think we'd want the main "Columns" section for the asset to show the column types as they appear in the database.

@jamiedemaria
Copy link
Contributor

jamiedemaria commented May 20, 2024

E.g., in memory, a pandas string column will often have type "obj", while in the database it might have type "varchar".

good point.

the main "Columns" section for the asset

this section being the one populated by dagster/column_schema correct'? if so we should just be able to add that metadata in. i think it might require adding a DB query to get the schema of the table though, depending on how each of the different DB's connector objects works and what they return on a successful create/update. i dont see issues with adding that though

@sryza
Copy link
Contributor

sryza commented May 20, 2024

this section being the one populated by dagster/column_schema correct'?

exactly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants