Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: "Unique rows (HashSet)" has a bug and drops records #3908

Closed
gertwieland opened this issue May 4, 2024 · 5 comments
Closed

[Bug]: "Unique rows (HashSet)" has a bug and drops records #3908

gertwieland opened this issue May 4, 2024 · 5 comments
Assignees
Labels
bug Hop Gui P1 Critical Issue
Milestone

Comments

@gertwieland
Copy link

Apache Hop version?

2.8

Java version?

openjdk version "11.0.21" 2023-10-17

Operating system

Windows

What happened?

"Unique rows (HashSet)" seems to drop records even if they only appear once.
Steps to reproduce the error:

Generate 60k records, then add a sequence and one column with random fake data.

Then calculate a SHA256 checksum over it. Since it includes the sequence number from 1 - 60k, those checksums must be all unique.

But still, the "Unique rows (HashSet)" seems to consider one row a duplicate, and only returns 59,999 records.

Test pipeline attached
Unique_Hash_Faulty.zip

image

Issue Priority

Priority: 3

Issue Component

Component: Hop Gui

@DAJGIT
Copy link

DAJGIT commented May 5, 2024

I could reproduce this case after several runs.
Trying to catch the duplicate record found this option: Compare using stored row values
image
It seems this solves this case.

@DAJGIT
Copy link

DAJGIT commented May 6, 2024

Keep diving looking for a repoduction path and here it is:
image

Unique_Hash_Faulty_Sample.zip

@hansva hansva added P1 Critical Issue and removed P3 Nice to have labels May 6, 2024
@hansva
Copy link
Contributor

hansva commented May 6, 2024

.take-issue

@github-actions github-actions bot added this to the 2.9 milestone May 6, 2024
@hansva hansva modified the milestones: 2.9, 2.10 May 20, 2024
@hansva
Copy link
Contributor

hansva commented Jun 5, 2024

I have taken a look, and indeed it is what it is. the way we do it is use Arrays.deepHashCode to calculate the hash and it seems this one will likely start giving collisions after only 75K rows. I'll turn on the "Compare using stored value" as default option.

hansva added a commit to hansva/hop that referenced this issue Jun 5, 2024
@hansva
Copy link
Contributor

hansva commented Jun 5, 2024

I made that option the default and added a warning to the docs on possible collisions

@hansva hansva closed this as completed in 3cb682a Jun 6, 2024
hansva added a commit that referenced this issue Jun 6, 2024
Use compare using values by default in Unique rows hashset #3908
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Hop Gui P1 Critical Issue
Projects
None yet
Development

No branches or pull requests

3 participants