fix issues related to encoding #7791

URenko · 2024-04-22T13:50:14Z

Fix the way to disable encoding by introducing a new type Raw to solve #7456 and #6098.
See #7456 and the changes on the document for details.
The behavior of None was not changed to keep backward compatibility.

And fix vfs cache encoding described in #7760
The problem is when we create and write cache, we use the "OS" encoded ("OS" encoding is hardcoded for different operating systems, not specified by users).
But when we copy it to the remote (backend), we use local.Fs, which accept a name already encoded in "Standard" way.
So we need to send a "Standard" encoded name.

However, the "minimum setup" mentioned in #7760 has not been resolved. See my comment in it for details.
But it's an edge case.

Checklist

I have read the contribution guidelines.
I have added tests for all changes in this PR if appropriate.
I have added documentation for the changes if appropriate.
All commit messages are in house style.
I'm done, this Pull Request is ready for review :-)

ncw · 2024-04-22T16:20:12Z

The behavior of None was not changed to keep backward compatibility.

I think I'd prefer None to do what people expect and not do any encoding. What was your reasoning here?

URenko · 2024-04-23T10:32:03Z

The behavior of None was not changed to keep backward compatibility.

I think I'd prefer None to do what people expect and not do any encoding. What was your reasoning here?

Backward compatibility.
Since encoding has been introduced for a long time. There must be many people already misused None.
If we change the behavior and they upgrade it without reading the changelog, we will break their storage. For example, they will need to resync tons of files (this is what happened to me when encoding was introduced. I don't what it to happen again to others : ).
And If they read the changelog, they will change to Raw and reorganize things.

URenko · 2024-05-05T12:37:15Z

@ncw any progress or further concern? :-)

fix rclone#7824

URenko · 2024-05-09T19:03:56Z

fixed #7824

Specifically, it causes statements like rclone copy . to spontaneously miss if . expands to a path with a Full Width replacement character.

iseahound · 2024-05-20T21:47:29Z

I think you need to fix encoding once and for all. The main problem is that: | → ｜ → ‛｜ → ‛‛｜ → ‛‛‛｜ and so on where the encoding of one character as two characters instantly breaks the reversibility of the encoding. The example would be when a file system contains ‛｜. Was it originally encoded as ｜ or ‛｜? (Note this can be formally stated as every surjection has a right inverse )

Here are some suggestions:

Add the names of the storage systems. For example, it'd be best to be able to do --local-encoding=drive or --drive-encoding=Windows.
Use a better escape character. Why not use the Unicode private use area as a rclone exclusive character, and prefix all escape sequences with that? �‛｜This preserves the current usage of repeating ‛‛‛‛‛‛s. Admittedly, this does kick the can down the line, but anyone using private use characters is probably a super user of some sort and can read the documentation.
Document the replacement logic better. Currently as I understand it, the source and destination remotes both have a "stack of invalid characters". Should the two stacks be equal, no encoding is done. If the stacks are unequal, then encoding/decoding is done.

iseahound · 2024-05-20T22:24:22Z

Also, instead of using RAW and having None as well, maybe adding --destination-encoding=source would be much clearer. If your destination remote is drive for example, having --drive-encoding=source would copy the files as-is since the two stacks of invalid character encodings cancel each other out.

URenko · 2024-05-21T08:24:11Z

Hi @iseahound, thank you for your suggestions.

But first, let me clarify some of your misunderstandings.

The main problem is that: | → ｜ → ‛｜ → ‛‛｜ → ‛‛‛｜ and so on where the encoding of one character as two characters instantly breaks the reversibility of the encoding. The example would be when a file system contains ‛｜. Was it originally encoded as ｜ or ‛｜? (Note this can be formally stated as every surjection has a right inverse )

The encoding of one character as multiple characters is inevitable.
For example, if you've been exposed to some programming languages, you'll know that the \ → \\ → \\\\.
It is not possible to construct a bijection between a single character in an incomplete character set and a single character in a complete character set, only between a single and a single/multiple character.
As for your problem, a filename ‛｜ (I mean if you directly look at the backend) in a backend means ｜.

Why not use the Unicode private use area as a rclone exclusive character, and prefix all escape sequences with that? �‛｜This preserves the current usage of repeating ‛‛‛‛‛‛s. Admittedly, this does kick the can down the line, but anyone using private use characters is probably a super user of some sort and can read the documentation.

Repeating character ‛ is not the source of the problem.
Firstly, ‛(U+201B, not the character you can directly type) is extremely uncommon, and I bet no one's original files will have that character before they touch rclone.
Secondly, repeating ‛ usually occurs because people misuse rclone's encoding system.

For example, if one has a 你好｜世界？ in the local Linux filesystem, he copies it to Onedrive with rclone (he will have 你好‛｜世界‛？ on Onedrive), and then he copies it from Onedrive to a local Windows filesystem with Onedrive (he will have 你好‛｜世界‛？ on the local Windows filesystem), and if he copies it from the local Windows filesystem to Onedrive with rclone, he will have 你好‛‛｜世界‛‛？ on Onedrive.
The correct way is to use rclone from beginning to end, including creating, investigating, copying, and deleting. Then he will always only see 你好｜世界？.

Should the two stacks be equal, no encoding is done. If the stacks are unequal, then encoding/decoding is done.

Seems correct but not exactly right.
There is also an intermediate layer of standard encoding between the source backend and the target backend, so the way to make this short-circuit happen is to specify both the source and target encoding as standard. see #7456 .

URenko · 2024-05-21T08:53:11Z

Yes if we could go back to 2019 I would definitely choose to develop another coding system from scratch, but it has been 5 years.
Remember also that None is the default encoding for many backends. (It led to my nightmare 5 years ago.)

In addition to a couple of incorrect implementations that I fixed, there are two main design issues (but not implementation issues) with the current coding system. They don't pose a real problem if the user can use them properly and fixing them is destructive.

Use full-width characters which are common in Chinese and Japanese, as escape characters (|→｜, ?→？...).
It is not easy for uses to use rclone from beginning to end.
So at the very least, users are likely to create files on their own (e.g. 你好｜世界？).
Then rclone thinks it's encoded, i.e. it means 你好|世界? (which always happens unless encoding is Raw or use the short-circuit trick).
Use encoding by default.
If we do not use encoding by default, we will not run into the problem mentioned above.
If we could redesign the coding system. I'd recommend no default encoding and that the encoding be used as a layer of backend (like Hasher).
And when people run into encoding issues with uploading (which should be rare), they will look at the docs and then realize that there is a backend that can solve the problem.

But for now I'd like to keep backward compatibility as much as possible. And, at least for sure, we need a way to actually disable coding (Raw)

iseahound · 2024-05-21T12:35:22Z

There seems to be problems with orphaned files with escaped characters in the filename:

If source and destination encoding is both windows this works fine,
If source and destination encoding is both None then deletion cannot happen.

test.bat (Save as UTF-8)

@echo off
chcp 65001

mkdir a
mkdir b
echo "hello world" > b/‛｜.txt
rclone --config="" sync a/ b/ --local-encoding=None
pause

In other words, even if the two stacks "cancel each other out" there exists a difference in behavior. Using your standard --local-encoding None,Slash,Ctl,Del,Dot` seems to fix it for some reason...

URenko · 2024-05-21T14:52:16Z

Yes, there is an intermediate layer of "standard" encoding, so the short-circuit trick is to use "standard" for both source and destination.

But I think you have found another bug. 🙃

iseahound · 2024-05-21T16:01:28Z

I have a simple setup of 2 cloud services and one local drive 🤣

URenko force-pushed the master branch 3 times, most recently from 1863b81 to 9d74a49 Compare April 22, 2024 14:16

nielash mentioned this pull request Apr 23, 2024

bisync: another day, another set of integration test fixes #7795

Open

5 tasks

URenko force-pushed the master branch from 9d74a49 to af1a956 Compare April 23, 2024 10:24

URenko added 3 commits April 23, 2024 11:49

lib/encoder: add EncodeRaw

81baa5f

docs: correct description of encoding None and add Raw.

aaaf6fb

vfs: fix cache encoding with special characters

5468395

URenko force-pushed the master branch from af1a956 to 5468395 Compare April 23, 2024 11:52

URenko mentioned this pull request May 6, 2024

Windows: Destination Encoding characters are improperly escaped #7824

Open

local: fix encoding of root path

aa34627

fix rclone#7824

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix issues related to encoding #7791

fix issues related to encoding #7791

URenko commented Apr 22, 2024 •

edited

ncw commented Apr 22, 2024

URenko commented Apr 23, 2024

URenko commented May 5, 2024

URenko commented May 9, 2024

iseahound commented May 20, 2024

iseahound commented May 20, 2024

URenko commented May 21, 2024

URenko commented May 21, 2024

iseahound commented May 21, 2024 •

edited

URenko commented May 21, 2024

iseahound commented May 21, 2024

fix issues related to encoding #7791

Are you sure you want to change the base?

fix issues related to encoding #7791

Conversation

URenko commented Apr 22, 2024 • edited

Checklist

ncw commented Apr 22, 2024

URenko commented Apr 23, 2024

URenko commented May 5, 2024

URenko commented May 9, 2024

iseahound commented May 20, 2024

iseahound commented May 20, 2024

URenko commented May 21, 2024

URenko commented May 21, 2024

iseahound commented May 21, 2024 • edited

URenko commented May 21, 2024

iseahound commented May 21, 2024

URenko commented Apr 22, 2024 •

edited

iseahound commented May 21, 2024 •

edited