Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanding nested star selects gives an error in Bigquery syntax #3484

Closed
Bladieblah opened this issue May 15, 2024 · 11 comments · Fixed by #3531
Closed

Expanding nested star selects gives an error in Bigquery syntax #3484

Bladieblah opened this issue May 15, 2024 · 11 comments · Fixed by #3531
Assignees

Comments

@Bladieblah
Copy link

Before you file an issue

  • Make sure you specify the "read" dialect eg. parse_one(sql, read="spark"): bigquery dialect
  • Make sure you specify the "write" dialect eg. ast.sql(dialect="duckdb"): Not applicable
  • Check if the issue still exists on main: Couldn't find anything related

Fully reproducible code snippet

sql_string = """
SELECT
  foo.bar.*
FROM
  foo
"""

import sqlglot
from sqlglot.optimizer.qualify_columns import qualify_columns
from sqlglot.schema import ensure_schema

ast = sqlglot.parse_one(sql_string, dialect="bigquery")
schema = ensure_schema(None, dialect="bigquery")

qualified = qualify_columns(
  ast,
  schema,
  expand_alias_refs=True,
  expand_stars=True,
  infer_schema=None,
)

The qualify function throws an error:
sqlglot.errors.OptimizeError: Unknown table: bar

Official Documentation
Bigquery docs

@VaggelisD VaggelisD self-assigned this May 15, 2024
@z3z1ma
Copy link
Contributor

z3z1ma commented May 17, 2024

Also consider applicability to nested structs

SELECT
    user.id,
    struct(
        user.id,
        user.name,
        user.email,
        user.title,
        user.harness_department_c,
        user.role_level_c,
        user.current_role_start_date_c,
        user.geo_region_c,
        user.geo_subregion_c,
        user.sales_segment_c,
        struct(manager.id, manager.name, manager.email) AS manager,
        struct(se.id, se.name, se.email) AS se,
        struct(tdr.id, tdr.name, tdr.email) AS tdr,
        struct(tds.id, tds.name, tds.email) AS tds
    )::STRUCT<id STRING, name STRING, email STRING, title STRING, harness_department_c STRING, role_level_c STRING, current_role_start_date_c DATE, geo_region_c STRING, geo_subregion_c STRING, sales_segment_c STRING, manager STRUCT<id STRING, name STRING, email STRING>, se STRUCT<id STRING, name STRING, email STRING>, tdr STRUCT<id STRING, name STRING, email STRING>, tds STRUCT<id STRING, name STRING, email STRING>> AS emp
FROM salesforce_v4.user
LEFT JOIN salesforce_v4.user AS manager
    ON user.manager_id = manager.id
LEFT JOIN salesforce_v4.user AS se
    ON user.se_c = se.id
LEFT JOIN salesforce_v4.user AS tdr
    ON user.tdr_c = tdr.id
LEFT JOIN salesforce_v4.user AS tds
    ON user.tds_c = tds.id

select a.emp.manager.* from this.table as a

@georgesittas
Copy link
Collaborator

Interesting, so it completely flattens structs. FYI @VaggelisD, an interesting edge case here is when two struct fields on different levels have the same names. BigQuery will give them names such as lvl1.foo, so if it's the only dialect that supports this struct/star explosion we could try to create matching aliases for the expanded projections.

@Bladieblah
Copy link
Author

There are some quirky edge cases for sure, and they're not all documented properly unfortunately (from a quick look). For the example you mentioned:

select
  a, b, c.* from 
(select
  5 as a,
  6 as b,
  (select as struct 
    7 as a,
    8 as b
  ) as c
)

gives an unnested table with fields
(a, b, a_1, b_1)

whereas selecting as a struct

select as struct 
  a, b, c.* from 
(select
  5 as a,
  6 as b,
  (select as struct 
    7 as a,
    8 as b
  ) as c
)

leads to an unnested table with field names
(a, b, _field_3, _field_4)

@georgesittas
Copy link
Collaborator

ok I see, I'm not sure if it's worth preserving all of that then. Perhaps our usual _col_i naming scheme will suffice?

@Bladieblah
Copy link
Author

I agree, fully replicating the logic won't be necessary as with proper aliasing these edge cases will not occur

@georgesittas
Copy link
Collaborator

Yep 👍. FYI this is a WIP on our end, should have a fix soon.

@Bladieblah
Copy link
Author

Thanks for picking this up so quickly, I really appreciate it!

@z3z1ma
Copy link
Contributor

z3z1ma commented May 22, 2024

Currently trying to unnest a struct with * causes an warning in sqlmesh that prevents optimization due to missing schema for table I think. So for clarity, does this change in sqlglot also fix that sqlmesh behavior? If so, thats great!

Another consideration that causes a schema warning in sqlmesh disabling optimization:

A table is implicitly a struct when referenced directly in a projection.

select
  users.*,
  activity -- a table is implicitly a struct when referenced directly
from users
left join activity
  using (user_id)

So if activity is:

CREATE TABLE activity (
  id string,
  user_id string,
  event string,
  ts timestamp
);

The referencing it directly, the output type is struct<id string, user_id string, event string, ts timestamp> as you'd guess. Not sure if this is related or not to the same codepath. Otherwise we can break this out.

cc @georgesittas

@georgesittas
Copy link
Collaborator

So I think the implicit unnesting is probably trickier, and my hunch is that Vaggelis' PR won't solve it. Though I'd double check to be sure.

Currently trying to unnest a struct with * causes an warning in sqlmesh that prevents optimization due to missing schema for table I think. So for clarity, does this change in sqlglot also fix that sqlmesh behavior? If so, thats great!

I think so! Can you share a small example to make sure I understood this correctly?

@z3z1ma
Copy link
Contributor

z3z1ma commented May 23, 2024

-- Given this is your upstream model called `user`
SELECT
    user.id,
    struct(
        user.id,
        user.name,
        user.email,
        user.title,
        user.harness_department_c,
        user.role_level_c,
        user.current_role_start_date_c,
        user.geo_region_c,
        user.geo_subregion_c,
        user.sales_segment_c,
        struct(manager.id, manager.name, manager.email) AS manager,
        struct(se.id, se.name, se.email) AS se,
        struct(tdr.id, tdr.name, tdr.email) AS tdr,
        struct(tds.id, tds.name, tds.email) AS tds
    )::STRUCT<id STRING, name STRING, email STRING, title STRING, harness_department_c STRING, role_level_c STRING, current_role_start_date_c DATE, geo_region_c STRING, geo_subregion_c STRING, sales_segment_c STRING, manager STRUCT<id STRING, name STRING, email STRING>, se STRUCT<id STRING, name STRING, email STRING>, tdr STRUCT<id STRING, name STRING, email STRING>, tds STRUCT<id STRING, name STRING, email STRING>> AS emp
FROM salesforce_v4.user
LEFT JOIN salesforce_v4.user AS manager
    ON user.manager_id = manager.id
LEFT JOIN salesforce_v4.user AS se
    ON user.se_c = se.id
LEFT JOIN salesforce_v4.user AS tdr
    ON user.tdr_c = tdr.id
LEFT JOIN salesforce_v4.user AS tds
    ON user.tds_c = tds.id;
    
    
-- This will cause the optimizer to fail / warn, even if upstream schema was known fully
SELECT emp.* FROM user;

-- Ditto
SELECT emp.manager.* FROM user;

You can repro it with any struct * expansion in a downstream model

@georgesittas
Copy link
Collaborator

Yep, these should work after we fix this issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants