Expanding nested star selects gives an error in Bigquery syntax #3484

Bladieblah · 2024-05-15T14:29:25Z

Before you file an issue

Make sure you specify the "read" dialect eg. parse_one(sql, read="spark"): bigquery dialect
Make sure you specify the "write" dialect eg. ast.sql(dialect="duckdb"): Not applicable
Check if the issue still exists on main: Couldn't find anything related

Fully reproducible code snippet

sql_string = """
SELECT
  foo.bar.*
FROM
  foo
"""

import sqlglot
from sqlglot.optimizer.qualify_columns import qualify_columns
from sqlglot.schema import ensure_schema

ast = sqlglot.parse_one(sql_string, dialect="bigquery")
schema = ensure_schema(None, dialect="bigquery")

qualified = qualify_columns(
  ast,
  schema,
  expand_alias_refs=True,
  expand_stars=True,
  infer_schema=None,
)

The qualify function throws an error:
sqlglot.errors.OptimizeError: Unknown table: bar

Official Documentation
Bigquery docs

The text was updated successfully, but these errors were encountered:

z3z1ma · 2024-05-17T05:34:34Z

Also consider applicability to nested structs

SELECT
    user.id,
    struct(
        user.id,
        user.name,
        user.email,
        user.title,
        user.harness_department_c,
        user.role_level_c,
        user.current_role_start_date_c,
        user.geo_region_c,
        user.geo_subregion_c,
        user.sales_segment_c,
        struct(manager.id, manager.name, manager.email) AS manager,
        struct(se.id, se.name, se.email) AS se,
        struct(tdr.id, tdr.name, tdr.email) AS tdr,
        struct(tds.id, tds.name, tds.email) AS tds
    )::STRUCT<id STRING, name STRING, email STRING, title STRING, harness_department_c STRING, role_level_c STRING, current_role_start_date_c DATE, geo_region_c STRING, geo_subregion_c STRING, sales_segment_c STRING, manager STRUCT<id STRING, name STRING, email STRING>, se STRUCT<id STRING, name STRING, email STRING>, tdr STRUCT<id STRING, name STRING, email STRING>, tds STRUCT<id STRING, name STRING, email STRING>> AS emp
FROM salesforce_v4.user
LEFT JOIN salesforce_v4.user AS manager
    ON user.manager_id = manager.id
LEFT JOIN salesforce_v4.user AS se
    ON user.se_c = se.id
LEFT JOIN salesforce_v4.user AS tdr
    ON user.tdr_c = tdr.id
LEFT JOIN salesforce_v4.user AS tds
    ON user.tds_c = tds.id

select a.emp.manager.* from this.table as a

georgesittas · 2024-05-17T13:06:30Z

Interesting, so it completely flattens structs. FYI @VaggelisD, an interesting edge case here is when two struct fields on different levels have the same names. BigQuery will give them names such as lvl1.foo, so if it's the only dialect that supports this struct/star explosion we could try to create matching aliases for the expanded projections.

Bladieblah · 2024-05-17T21:49:51Z

There are some quirky edge cases for sure, and they're not all documented properly unfortunately (from a quick look). For the example you mentioned:

select
  a, b, c.* from 
(select
  5 as a,
  6 as b,
  (select as struct 
    7 as a,
    8 as b
  ) as c
)

gives an unnested table with fields
(a, b, a_1, b_1)

whereas selecting as a struct

select as struct 
  a, b, c.* from 
(select
  5 as a,
  6 as b,
  (select as struct 
    7 as a,
    8 as b
  ) as c
)

leads to an unnested table with field names
(a, b, _field_3, _field_4)

georgesittas · 2024-05-17T23:54:12Z

ok I see, I'm not sure if it's worth preserving all of that then. Perhaps our usual _col_i naming scheme will suffice?

Bladieblah · 2024-05-22T12:54:10Z

I agree, fully replicating the logic won't be necessary as with proper aliasing these edge cases will not occur

georgesittas · 2024-05-22T12:58:03Z

Yep 👍. FYI this is a WIP on our end, should have a fix soon.

Bladieblah · 2024-05-22T16:19:44Z

Thanks for picking this up so quickly, I really appreciate it!

z3z1ma · 2024-05-22T16:52:37Z

Currently trying to unnest a struct with * causes an warning in sqlmesh that prevents optimization due to missing schema for table I think. So for clarity, does this change in sqlglot also fix that sqlmesh behavior? If so, thats great!

Another consideration that causes a schema warning in sqlmesh disabling optimization:

A table is implicitly a struct when referenced directly in a projection.

select
  users.*,
  activity -- a table is implicitly a struct when referenced directly
from users
left join activity
  using (user_id)

So if activity is:

CREATE TABLE activity (
  id string,
  user_id string,
  event string,
  ts timestamp
);

The referencing it directly, the output type is struct<id string, user_id string, event string, ts timestamp> as you'd guess. Not sure if this is related or not to the same codepath. Otherwise we can break this out.

cc @georgesittas

georgesittas · 2024-05-22T17:20:45Z

So I think the implicit unnesting is probably trickier, and my hunch is that Vaggelis' PR won't solve it. Though I'd double check to be sure.

Currently trying to unnest a struct with * causes an warning in sqlmesh that prevents optimization due to missing schema for table I think. So for clarity, does this change in sqlglot also fix that sqlmesh behavior? If so, thats great!

I think so! Can you share a small example to make sure I understood this correctly?

z3z1ma · 2024-05-23T21:36:54Z

-- Given this is your upstream model called `user`
SELECT
    user.id,
    struct(
        user.id,
        user.name,
        user.email,
        user.title,
        user.harness_department_c,
        user.role_level_c,
        user.current_role_start_date_c,
        user.geo_region_c,
        user.geo_subregion_c,
        user.sales_segment_c,
        struct(manager.id, manager.name, manager.email) AS manager,
        struct(se.id, se.name, se.email) AS se,
        struct(tdr.id, tdr.name, tdr.email) AS tdr,
        struct(tds.id, tds.name, tds.email) AS tds
    )::STRUCT<id STRING, name STRING, email STRING, title STRING, harness_department_c STRING, role_level_c STRING, current_role_start_date_c DATE, geo_region_c STRING, geo_subregion_c STRING, sales_segment_c STRING, manager STRUCT<id STRING, name STRING, email STRING>, se STRUCT<id STRING, name STRING, email STRING>, tdr STRUCT<id STRING, name STRING, email STRING>, tds STRUCT<id STRING, name STRING, email STRING>> AS emp
FROM salesforce_v4.user
LEFT JOIN salesforce_v4.user AS manager
    ON user.manager_id = manager.id
LEFT JOIN salesforce_v4.user AS se
    ON user.se_c = se.id
LEFT JOIN salesforce_v4.user AS tdr
    ON user.tdr_c = tdr.id
LEFT JOIN salesforce_v4.user AS tds
    ON user.tds_c = tds.id;
    
    
-- This will cause the optimizer to fail / warn, even if upstream schema was known fully
SELECT emp.* FROM user;

-- Ditto
SELECT emp.manager.* FROM user;

You can repro it with any struct * expansion in a downstream model

georgesittas · 2024-05-23T21:38:25Z

Yep, these should work after we fix this issue :)

VaggelisD self-assigned this May 15, 2024

VaggelisD mentioned this issue May 22, 2024

feat(optimizer)!: Expand stars on BigQuery's tbl.struct_col.* selections #3531

Merged

georgesittas closed this as completed in #3531 May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expanding nested star selects gives an error in Bigquery syntax #3484

Expanding nested star selects gives an error in Bigquery syntax #3484

Bladieblah commented May 15, 2024

z3z1ma commented May 17, 2024 •

edited

georgesittas commented May 17, 2024

Bladieblah commented May 17, 2024

georgesittas commented May 17, 2024

Bladieblah commented May 22, 2024

georgesittas commented May 22, 2024

Bladieblah commented May 22, 2024

z3z1ma commented May 22, 2024

georgesittas commented May 22, 2024

z3z1ma commented May 23, 2024

georgesittas commented May 23, 2024

Expanding nested star selects gives an error in Bigquery syntax #3484

Expanding nested star selects gives an error in Bigquery syntax #3484

Comments

Bladieblah commented May 15, 2024

z3z1ma commented May 17, 2024 • edited

georgesittas commented May 17, 2024

Bladieblah commented May 17, 2024

georgesittas commented May 17, 2024

Bladieblah commented May 22, 2024

georgesittas commented May 22, 2024

Bladieblah commented May 22, 2024

z3z1ma commented May 22, 2024

georgesittas commented May 22, 2024

z3z1ma commented May 23, 2024

georgesittas commented May 23, 2024

z3z1ma commented May 17, 2024 •

edited