Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlite3.OperationalError: Could not decode to UTF-8 column 'data' with text #44

Open
Stratos-Crimson opened this issue May 18, 2023 · 51 comments
Assignees
Labels
bug Something isn't working

Comments

@Stratos-Crimson
Copy link

Hello,
I suspect that this is due to some of the database not being in UTF-8.
I got around this on another program that also exports Whatsapp messages by adding the following code:
db.text_factory = lambda b: b.decode(errors = 'ignore')
after this code which was already there,
db = sqlite3.connect(file_path)
But when I tried to do this in this program,
I couldn't find It in the extract python files.
I instead found,
this....

if os.path.isfile(contact_db):
    with sqlite3.connect(contact_db) as db:
        contacts(db, data)
if os.path.isfile(msg_db):
    with sqlite3.connect(msg_db) as db:
        messages(db, data)
        media(db, data, media_folder)
        vcard(db, data)
    create_html(data, output_folder)

Which is beyond my understanding.

@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

You can place the line as follow:

if os.path.isfile(contact_db):
    with sqlite3.connect(contact_db) as db:
        db.text_factory = lambda b: b.decode(errors = 'ignore') # here
        contacts(db, data)
if os.path.isfile(msg_db):
    with sqlite3.connect(msg_db) as db:
        db.text_factory = lambda b: b.decode(errors = 'ignore') # and here
        messages(db, data)
        media(db, data, media_folder)
        vcard(db, data)
    create_html(data, output_folder)

Also, could you provide me a stack trace related to this error and the version of the exporter you are using?

@KnugiHK KnugiHK added the bug Something isn't working label May 18, 2023
@Stratos-Crimson
Copy link
Author

Oh thanks, sorry for not providing It before.

Traceback (most recent call last):)
  File "/home/user/.local/bin/wtsexporter", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.9/site-packages/Whatsapp_Chat_Exporter/__main__.py", line 240 in main
    messages(db, data)
  File "/home/user/.local/lib/python3.9/site-packages/Whatsapp_Chat_Exporter/extract.py", line 380 in messages
  content = c.fetchone()
sqlite3.OperationalError: Could not decode to UTF-8 column 'data' with text '{placeholder}'

@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

Oh thanks, sorry for not providing It before.

Traceback (most recent call last):)
  File "/home/user/.local/bin/wtsexporter", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.9/site-packages/Whatsapp_Chat_Exporter/__main__.py", line 240 in main
    messages(db, data)
  File "/home/user/.local/lib/python3.9/site-packages/Whatsapp_Chat_Exporter/extract.py", line 380 in messages
  content = c.fetchone()
sqlite3.OperationalError: Could not decode to UTF-8 column 'data' with text '{placeholder}'

No problem.

Do you know what's the language and what encoding the message is used?

@Stratos-Crimson
Copy link
Author

Stratos-Crimson commented May 18, 2023

I am using the latest one.
When I tried to use It before your update, It was giving me that message table error but now that you have updated It, that's gone.
This is literally the terminal output of the text:
Keep it up�����������������������������������
I think that there are some emojis in this text. Like thumbs up, y'know? That would make sense but since I don't remember this message, I can't say for sure.

@Stratos-Crimson
Copy link
Author

I have just run a grep command on all of the messages and after seeing the messages I think that this is most likely this emoji "💪" times 4.

@Stratos-Crimson
Copy link
Author

Stratos-Crimson commented May 18, 2023

Adding this line didn't help.
But I only added It to the extract command and since the error is specifying the data column, this could be because of that.
Any thoughts?

You can place the line as follow:

if os.path.isfile(contact_db):
    with sqlite3.connect(contact_db) as db:
        db.text_factory = lambda b: b.decode(errors = 'ignore') # here
        contacts(db, data)
if os.path.isfile(msg_db):
    with sqlite3.connect(msg_db) as db:
        db.text_factory = lambda b: b.decode(errors = 'ignore') # and here
        messages(db, data)
        media(db, data, media_folder)
        vcard(db, data)
    create_html(data, output_folder)

Also, could you provide me a stack trace related to this error and the version of the exporter you are using?

@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

I tried to reproduce it but failed, the message shown in the HTML correctly.
image

You can use a SQLite browser to extract that message out and convert it into HEX. Doing so may enable me to reproduce the problem.

@Stratos-Crimson
Copy link
Author

I have tried to search for It using sqlitebrowser in the message table and data column but I couldn't find It.
Can I find out which row It's erroring out on?

@Stratos-Crimson
Copy link
Author

Wait, I think that I have found It!
I just searched for the question mark shard and found 3 messages.
There is one which just has BLOB written on It.
And when I opened It, It showed me the Keep It up message and a bunch of dots.
It is also saying that the type of data that's currently in the cell is Binary.

@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

I have tried to search for It using sqlitebrowser in the message table and data column but I couldn't find It. Can I find out which row It's erroring out on?

Replace

        i += 1 # around line 377 of extract.py?
        if i % 1000 == 0:
            print(f"Gathering messages...({i}/{total_row_number})", end="\r")
        content = c.fetchone()

with

        i += 1
        print(content["_id"])
        content = c.fetchone()

When you run that, the next row the printed _id should be the row that causing the problem.

@Stratos-Crimson
Copy link
Author

This is the HEX and ASCII:
0000 4b 65 65 70 20 69 74 20 75 70 f0 9f 92 90 f0 9f Keep it up......
0010 92 90 f0 9f 92 90 f0 9f 8c b9 f0 9f 8c b9 f2 a3 ................
0020 b0 bd f0 b7 a0 bd f0 b7 a0 bd ed b2 9e .............

@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

This is the HEX and ASCII: 0000 4b 65 65 70 20 69 74 20 75 70 f0 9f 92 90 f0 9f Keep it up...... 0010 92 90 f0 9f 92 90 f0 9f 8c b9 f0 9f 8c b9 f2 a3 ................ 0020 b0 bd f0 b7 a0 bd f0 b7 a0 bd ed b2 9e .............

Also, can you figure out the actual message shown in WhatsApp? Btw, just a reminder, do not post it publicly if it is a sensitive message.

@Stratos-Crimson
Copy link
Author

You see, I have lost the Whatsapp encryption key and was only able to get the key file and this was before I deleted all of the Whatsapp backups.
Do you see anything sensitive in that message? I don't know HEX.

@Stratos-Crimson
Copy link
Author

Stratos-Crimson commented May 18, 2023

Hey, I am worried now, what is that HEX? 😰

@Stratos-Crimson
Copy link
Author

Hey so I did this:

echo 4b656570206974207570f09f9290f09f9290f09f9290f09f8cb9f09f8cb9f2a3b0bdf0b7a0bdf0b7a0bdedb29e | xxd -r -p

And got this:

Keep it up💐💐💐🌹🌹򣰽𷠽𷠽

@Stratos-Crimson Stratos-Crimson changed the title sqlite3.OperationalError: Could not decode to UTF-8 column 'data' with text 'there is some text here and shards of question marks' sqlite3.OperationalError: Could not decode to UTF-8 column 'data' with text ' Keep it up�����������������������������������' May 18, 2023
@Stratos-Crimson
Copy link
Author

Is there a way to ignore this error and let It run?

@KnugiHK KnugiHK self-assigned this May 18, 2023
@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

Hey, I am worried now, what is that HEX? cold_sweat

I don't mean to scare you. I have yet to look into the hex before you post the result tbh.

Is there a way to ignore this error and let It run?

The easiest way is use a try except block and skip it but it is not an ideal solution.

I am still trying to reproduce it because the exporter failed after inserting a row with your binary content, but I got a different error than yours.

Traceback (most recent call last):
  ...
  File "...\Whatsapp_Chat_Exporter\extract.py", line 371, in messages
    if "\r\n" in msg:
TypeError: a bytes-like object is required, not 'str'

Does my screenshot below have any difference from yours?
image

@Stratos-Crimson
Copy link
Author

Yes there is!
The first row is missing a 9f at the end.
And the 2nd row is missing an a3 at the end.

@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

Oh. They are just being hidden😂. Here you are:
image

@Stratos-Crimson
Copy link
Author

Stratos-Crimson commented May 18, 2023

Hah, there is no difference than.
Did you try changing It from binary mode to text or RTL text?

@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

Hah, there is no difference than. Did you try changing It from binary mode to text or RTL text?

They can't be displayed in text or RTL mode. Try this and post the output:

SELECT quote(text_data) from message WHERE _id=<the message id>;

@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

If you are using Linux/WSL:

$ sqlite3 msgstore.db
> SELECT quote(text_data) from message WHERE _id=<the message id>;

@Stratos-Crimson
Copy link
Author

'Keep it up💐💐💐🌹🌹򣰽𷠽𷠽���'
Thanks. I got this.

@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

'Keep it upbouquetbouquetbouquetroserose򣰽𷠽𷠽���' Thanks. I got this.

Hmmm, you are supposed to get a hex string. Something like this:
image

@Stratos-Crimson
Copy link
Author

I didn't get a hex. What version of sqlite3 command line are you using?
I am on 3.34.1.

@Stratos-Crimson
Copy link
Author

Stratos-Crimson commented May 18, 2023

Why is there no id in your command?

@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

I didn't get a hex. What version of sqlite3 command line are you using? I am on 3.34.1.

3.40.1

Why is there no id in your command?

Look closer and you will find that it is just redacted😂. It should be an integer anyway.

@Stratos-Crimson
Copy link
Author

Oh. Why would you redact that?!
I have just tried It with version 3.42.0 and It's still not giving me the hex.

@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

Oh. Why would you redact that?! I have just tried It with version 3.42.0 and It's still not giving me the hex.

I guess that's mean we have different data in that cell.

@Stratos-Crimson
Copy link
Author

Why are getting a hex anyway, the cell should contain the message, not the hex, right?

@Stratos-Crimson
Copy link
Author

Can you put the raw data into the cell?

@KnugiHK
Copy link
Owner

KnugiHK commented May 18, 2023

Why are getting a hex anyway, the cell should contain the message, not the hex, right?

Yeah, I am trying to understand what raw data you got and how I can put the raw data into the cell. Hex is just one of the possible solutions, but we proved that we have different data inside the cell already by using hex. quote() function does not necessarily output hex but also just a quoted text.

What about this:
sqlite3 -line msgstore.db 'SELECT quote(text_data) from message WHERE _id=<id>;' | hexdump -C

@Stratos-Crimson
Copy link
Author

Stratos-Crimson commented May 18, 2023

But quote(text_data) = is also going to be here so the hex won't match. Would It?

@Stratos-Crimson
Copy link
Author

And can you please write that python except block for now?

@Stratos-Crimson
Copy link
Author

It will also have the quotes in the hex right? Wouldn't that be modifying the hex?

@KnugiHK KnugiHK changed the title sqlite3.OperationalError: Could not decode to UTF-8 column 'data' with text ' Keep it up�����������������������������������' sqlite3.OperationalError: Could not decode to UTF-8 column 'data' with text May 19, 2023
KnugiHK added a commit that referenced this issue May 19, 2023
@KnugiHK
Copy link
Owner

KnugiHK commented May 19, 2023

Implemented the workaround for the problem in 9ac8839.

@KnugiHK
Copy link
Owner

KnugiHK commented May 19, 2023

It will also have the quotes in the hex right? Wouldn't that be modifying the hex?

It doesn't necessarily be hex.

@Stratos-Crimson
Copy link
Author

Does the workaround skip It or does It try to extract It?

Implemented the workaround for the problem in 9ac8839.

It will also have the quotes in the hex right? Wouldn't that be modifying the hex?

It doesn't necessarily be hex.

Did you put in the hex in the cell or the text?

@KnugiHK
Copy link
Owner

KnugiHK commented May 19, 2023

Does the workaround skip It or does It try to extract It?

The workaround skip the message.

Did you put in the hex in the cell or the text?

I put the hex in the hex area of the cell.

@Stratos-Crimson
Copy link
Author

Could It be that It didn't get translated? Why don't you translate It externally and put It in as text?

Does the workaround skip It or does It try to extract It?

The workaround skip the message.

Did you put in the hex in the cell or the text?

I put the hex in the hex area of the cell.

@Stratos-Crimson
Copy link
Author

This is the output:
00000000 71 75 6f 74 65 28 74 65 78 74 5f 64 61 74 61 29 |quote(text_data)|
00000010 20 3d 20 27 4b 65 65 70 20 69 74 20 75 70 f0 9f | = 'Keep it up..|
00000020 92 90 f0 9f 92 90 f0 9f 92 90 f0 9f 8c b9 f0 9f |................|
00000030 8c b9 f2 a3 b0 bd f0 b7 a0 bd f0 b7 a0 bd ed b2 |................|
00000040 9e 27 0a |.'.|
00000043

@Stratos-Crimson
Copy link
Author

This is the plain hex,
71756f746528746578745f6461746129203d20274b656570206974207570f09f9290f09f9290f09f9290f09f8cb9f09f8cb9f2a3b0bdf0b7a0bdf0b7a0bdedb29e270a

@KnugiHK
Copy link
Owner

KnugiHK commented Jun 1, 2023

If you don't mind you can send me your database. It will the the easiest way to do debugging.

@Stratos-Crimson
Copy link
Author

No can do.

@KnugiHK
Copy link
Owner

KnugiHK commented Jun 7, 2023

You can say Hi to me in the Matrix room and send the database to me through Matrix or send me an email with the link to the database at hello [at] knugi.com. You can find my PGP key at https://keyserver1.pgp.com.

@Stratos-Crimson
Copy link
Author

Isn't there a way to extract a specific row from the database?

@KnugiHK
Copy link
Owner

KnugiHK commented Jun 8, 2023

Can you send me an INSERT statement to that row?

@Stratos-Crimson
Copy link
Author

Pardon my ignorance but how do I do that?

@KnugiHK
Copy link
Owner

KnugiHK commented Jun 11, 2023

Use a supported browser to do it. Select the whole row and copy it as SQL.
image

@Stratos-Crimson
Copy link
Author

Mailed you.

KnugiHK added a commit that referenced this issue Jun 15, 2023
@KnugiHK
Copy link
Owner

KnugiHK commented Jun 25, 2023

The workaround is released in 0.9.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants