[Bug]: "Unique rows (HashSet)" has a bug and drops records #3908

gertwieland · 2024-05-04T21:02:52Z

Apache Hop version?

2.8

Java version?

openjdk version "11.0.21" 2023-10-17

Operating system

Windows

What happened?

"Unique rows (HashSet)" seems to drop records even if they only appear once.
Steps to reproduce the error:

Generate 60k records, then add a sequence and one column with random fake data.

Then calculate a SHA256 checksum over it. Since it includes the sequence number from 1 - 60k, those checksums must be all unique.

But still, the "Unique rows (HashSet)" seems to consider one row a duplicate, and only returns 59,999 records.

Test pipeline attached
Unique_Hash_Faulty.zip

Issue Priority

Priority: 3

Issue Component

Component: Hop Gui

DAJGIT · 2024-05-05T22:09:30Z

I could reproduce this case after several runs.
Trying to catch the duplicate record found this option: Compare using stored row values

It seems this solves this case.

DAJGIT · 2024-05-06T15:47:35Z

Keep diving looking for a repoduction path and here it is:

Unique_Hash_Faulty_Sample.zip

hansva · 2024-05-06T16:51:05Z

.take-issue

hansva · 2024-06-05T12:24:04Z

I have taken a look, and indeed it is what it is. the way we do it is use Arrays.deepHashCode to calculate the hash and it seems this one will likely start giving collisions after only 75K rows. I'll turn on the "Compare using stored value" as default option.

hansva · 2024-06-05T12:35:31Z

I made that option the default and added a warning to the docs on possible collisions

Use compare using values by default in Unique rows hashset #3908

gertwieland added awaiting triage bug labels May 4, 2024

github-actions bot added P3 Nice to have Hop Gui labels May 4, 2024

hansva added P1 Critical Issue and removed P3 Nice to have labels May 6, 2024

github-actions bot removed the awaiting triage label May 6, 2024

github-actions bot assigned hansva May 6, 2024

github-actions bot added this to the 2.9 milestone May 6, 2024

hansva modified the milestones: 2.9, 2.10 May 20, 2024

hansva added a commit to hansva/hop that referenced this issue Jun 5, 2024

add warning to unique rows hash documentation , fixes apache#3908

76b9bfe

hansva closed this as completed in 3cb682a Jun 6, 2024

hansva added a commit that referenced this issue Jun 6, 2024

Merge pull request #4019 from hansva/3908

3515546

Use compare using values by default in Unique rows hashset #3908

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: "Unique rows (HashSet)" has a bug and drops records #3908

[Bug]: "Unique rows (HashSet)" has a bug and drops records #3908

gertwieland commented May 4, 2024

DAJGIT commented May 5, 2024

DAJGIT commented May 6, 2024

hansva commented May 6, 2024

hansva commented Jun 5, 2024

hansva commented Jun 5, 2024

[Bug]: "Unique rows (HashSet)" has a bug and drops records #3908

[Bug]: "Unique rows (HashSet)" has a bug and drops records #3908

Comments

gertwieland commented May 4, 2024

Apache Hop version?

Java version?

Operating system

What happened?

Issue Priority

Issue Component

DAJGIT commented May 5, 2024

DAJGIT commented May 6, 2024

hansva commented May 6, 2024

hansva commented Jun 5, 2024

hansva commented Jun 5, 2024