Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After using projection, the same query statement has different values ​​and cannot be hit. #63928

Open
liubaochuan opened this issue May 16, 2024 · 8 comments
Labels
question Question?

Comments

@liubaochuan
Copy link

liubaochuan commented May 16, 2024

Make sure to check documentation https://clickhouse.com/docs/en/ first. If the question is concise and probably has a short answer, asking it in community Slack is probably the fastest way to find the answer. For more complicated questions, consider asking them on StackOverflow with "clickhouse" tag https://stackoverflow.com/questions/tagged/clickhouse

If you still prefer GitHub issues, remove all this text and ask your question here.

clickhouse version: 23.11.1.2002

A:
sql explain:
explain PLAN select SEQ_ID from ds2.CW_SNAP_FEATURE_Face where SEQ_ID > 10000 order by SEQ_ID asc limit 1;
explain result:
Expression (Projection)
Limit (preliminary LIMIT (without OFFSET))
Sorting (Sorting for ORDER BY)
Expression (Before ORDER BY)
Filter (WHERE)
ReadFromMergeTree (ds2.CW_SNAP_FEATURE_Face)

B:
sql explain:
explain PLAN select SEQ_ID from ds2.CW_SNAP_FEATURE_Face where SEQ_ID > 1000000 order by SEQ_ID asc limit 1;
explain result:
Expression (Projection)
Limit (preliminary LIMIT (without OFFSET))
Sorting (Sorting for ORDER BY)
Filter
ReadFromMergeTree (prj_seq_id_group_id)

Projection:
ALTER TABLE ds2.CW_SNAP_FEATURE_Face
ADD PROJECTION prj_seq_id_group_id
(
SELECT *
ORDER BY SEQ_ID
);

The different values ​​of SEQ_ID cause A to fail to hit the projection。

@liubaochuan liubaochuan added the question Question? label May 16, 2024
@konsta-danyliuk
Copy link

Hi @liubaochuan,
Would you mind sharing the DDL for the ds2.CW_SNAP_FEATURE_Face table?

@liubaochuan
Copy link
Author

liubaochuan commented May 16, 2024

@konsta-danyliuk

CREATE TABLE ds2.CW_SNAP_FEATURE_Face
(

`SEQ_ID` Int64 DEFAULT toUnixTimestamp64Nano(now64()),

`FEATURE_ID` Int64,

`GROUP_ID_INT32` Int32,

`FEATURE` String,

`SHORT_FEATURE` String,

`IMAGE_URL` String,

`QUALITY_SCORE` String,

`QUALITY_SCORE_INT64` Int64,

`CAPTURE_TIME` Int64,

`CREATE_TIME` Int64,

`ASSOCIATED_OBJECT_ID` Int64,

`ASSOCIATED_OBJECT_IDS` String,

`LABEL` Int64,

`TRACKID` Int64,

`TRACK_BTIME` Int64,

`TRACK_ETIME` Int64,

`IS_SPARSE` Int8 DEFAULT '0',

`EXT_DATA` String,

`ATTRIBUTE` String,

INDEX CREATETIME_INDEX CREATE_TIME TYPE minmax GRANULARITY 8192,

INDEX LABEL_INDEX LABEL TYPE minmax GRANULARITY 8192,

INDEX SEQ_ID_INDEX1 SEQ_ID TYPE minmax GRANULARITY 1,

PROJECTION prj_seq_id_group_id
(
    SELECT *
    ORDER BY SEQ_ID
)

)
ENGINE = MergeTree
PRIMARY KEY (GROUP_ID_INT32,
SEQ_ID)
ORDER BY (GROUP_ID_INT32,
SEQ_ID)
SETTINGS index_granularity = 8192;

@konsta-danyliuk
Copy link

Hi @liubaochuan,
In general, the SQL engine uses its logic to decide which execution path to take in each case, depending on the query conditions and data. I'm also thinking whether the type of the SEQ_ID column Int64 DEFAULT toUnixTimestamp64Nano(now64()) has an impact here.

  • Do you see a performance difference between queries A and B?
  • I'm curious what would be the results of the EXPLAIN PIPELINE and EXPLAIN indexes=1 statements for both queries.

@liubaochuan
Copy link
Author

@konsta-danyliuk

EXPLAIN PIPELINE Result:
A:
(Expression)
ExpressionTransform
(Limit)
Limit
(Sorting)
MergingSortedTransform 72 → 1
MergeSortingTransform × 72
LimitsCheckingTransform × 72
PartialSortingTransform × 72
(Expression)
ExpressionTransform × 72
(Filter)
FilterTransform × 72
(ReadFromMergeTree)
MergeTreeSelect(pool: ReadPool, algorithm: Thread) × 72 0 → 1

B:
(Expression)
ExpressionTransform
(Limit)
Limit
(Sorting)
MergingSortedTransform 85 → 1
(Filter)
FilterTransform × 85
(ReadFromMergeTree)
MergeTreeSelect(pool: ReadPoolInOrder, algorithm: InOrder) × 85 0 → 1

B query is 20 times faster than A query。

@konsta-danyliuk
Copy link

Wow! x20 sounds interesting.
It seems you only ran EXPLAIN PIPELINE, would you share EXPLAIN indexes=1results as well? I'm curious what is the difference in the data these queries read.

@liubaochuan
Copy link
Author

@konsta-danyliuk

EXPLAIN indexes=1 Result:
A:
Expression (Projection)
Limit (preliminary LIMIT (without OFFSET))
Sorting (Sorting for ORDER BY)
Expression (Before ORDER BY)
Filter (WHERE)
ReadFromMergeTree (ds2.CW_SNAP_FEATURE_Face)
Indexes:
PrimaryKey
Keys:
SEQ_ID
Condition: (SEQ_ID in [10001, +Inf))
Parts: 14/14
Granules: 82901/82901
Skip
Name: SEQ_ID_INDEX1
Description: minmax GRANULARITY 1
Parts: 14/14
Granules: 82901/82901

B:
Expression (Projection)
Limit (preliminary LIMIT (without OFFSET))
Sorting (Sorting for ORDER BY)
Filter
ReadFromMergeTree (prj_seq_id_group_id)
Indexes:
PrimaryKey
Keys:
SEQ_ID
Condition: (SEQ_ID in [1000001, +Inf))
Parts: 14/14
Granules: 82782/82901

@liubaochuan
Copy link
Author

@konsta-danyliuk
Is there any solution to this problem?

@konsta-danyliuk
Copy link

Hi @liubaochuan.

  • What does your data look like? Are there many records where SEQ_ID is less than 10000?
  • Could you also let me know how many records return both queries? I would be interested in the following part.
X rows in set. Elapsed: XXXX sec. Processed XXXXX thousand rows, XXX MB
Peak memory usage: XXXXX MiB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question?
Projects
None yet
Development

No branches or pull requests

2 participants