Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Import BUG #838

Open
ralphkretzschmar opened this issue Mar 6, 2024 · 9 comments
Open

Data Import BUG #838

ralphkretzschmar opened this issue Mar 6, 2024 · 9 comments

Comments

@ralphkretzschmar
Copy link

Dear Team,
When i try to import testdata it identifies accounts as duplicates but they are not really duplicates.

Example data (csv-content):
Organisationnames:
Müller
Muller
Möller
Moller
When trying to import these organisation names as accounts it idetifies "Müller" as "Muller" -> false duplicates

So i can't import such data because of false duplicates

Best regards
Ralph

@ralphkretzschmar
Copy link
Author

After digging more into deep i recognized that it was the config of mysql db.
Sorry for the false report.

Best regrads
Ralph

@ralphkretzschmar
Copy link
Author

Unfortunately its not only the DB i thought i have to convert it to another collation like "utf8mb4_bin" of the DB. but this has negative side effects.
i guess the comparison for accountnames has to be done at code level?

@urban-thinking
Copy link
Contributor

Hiya Ralph.

as we state in https://blog.crm-now.de/doc/berliCRM/installation/Installation_berlicrm.html the DATABASE COLLATION must be utf8_unicode_ci .

The second part of your question is not clear enough to answer, can you try to reword it?

Regards
Emilio

@urban-thinking
Copy link
Contributor

Ahh ... I talked to a colleague which understood your question. Let me give an AI answer ;-)

The utf8_unicode_ci collation in MySQL is a case-insensitive collation that supports the UTF-8 character set. It treats accented characters as equivalent to their non-accented counterparts. This behavior is by design and is intended to facilitate searches and comparisons where differences in accents or case should be ignored.

In the case of "müller" and "muller", the utf8_unicode_ci collation treats them as equivalent because it ignores the difference in the accent on the letter 'u'. This can be beneficial in many situations, such as when searching for names or words where accents might be inconsistently used or omitted.

If you want accent sensitivity in your searches, you would need to use a different collation that supports that, such as utf8mb4_bin, which is case-sensitive and accent-sensitive. However, it's worth noting that using accent-insensitive collations like utf8mb3_unicode_ci or utf8mb4_unicode_ci is often preferred for applications where users might input data inconsistently.

Regards Emilio

@ralphkretzschmar
Copy link
Author

HI Emilio,

thank you for your really fast update :) (makes sense for me)
i think to go with utf8mb4_unicode_ci is fin because of search funktions etc.

Do you know where to take a look at the code to implement a more granular double check for duplicates at importing data function?

What i try to implement is a check if the Accountname (which i try to import) is already existing (100% same check).
So i could import import accounts like "Muller GmbH" and "Müller GmbH" as they are treated as different accounts and still have the the other benefits for inconsistently data input.

i could share my code afterwards -> could be interesting for german admin-users.

Best regards
Ralph

@Archibald111
Copy link
Contributor

utf8mb4_unicode_ci is not ok, that is the source for your Umlaute problem

@ralphkretzschmar
Copy link
Author

Hi Frank,

and which one should i use?

Best regards,
Ralph

@Archibald111
Copy link
Contributor

as Emilio wrote utf8_unicode_ci

@AlexKay85
Copy link
Contributor

Hi Ralph,

'utf8_unicode_ci' will not help you with this issue, it treats Umlauts the same as 'utf8mb4_unicode_ci'.

What you'd need is either a binary collation like 'utf8_bin' or a typecast to binary for every comparison.
We do not support binary collations, they were not tested at all and probably wouldn't work very well.
Unfortunately it's not easy to fix this on the code level either. Too many places where it'd had to be done and it also opens another can of worms.

Best Regards,
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants