Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issues related to encoding #7791

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

fix issues related to encoding #7791

wants to merge 4 commits into from

Conversation

URenko
Copy link
Contributor

@URenko URenko commented Apr 22, 2024

Fix the way to disable encoding by introducing a new type Raw to solve #7456 and #6098.
See #7456 and the changes on the document for details.
The behavior of None was not changed to keep backward compatibility.

And fix vfs cache encoding described in #7760
The problem is when we create and write cache, we use the "OS" encoded ("OS" encoding is hardcoded for different operating systems, not specified by users).
But when we copy it to the remote (backend), we use local.Fs, which accept a name already encoded in "Standard" way.
So we need to send a "Standard" encoded name.

However, the "minimum setup" mentioned in #7760 has not been resolved. See my comment in it for details.
But it's an edge case.

Checklist

  • I have read the contribution guidelines.
  • I have added tests for all changes in this PR if appropriate.
  • I have added documentation for the changes if appropriate.
  • All commit messages are in house style.
  • I'm done, this Pull Request is ready for review :-)

@URenko URenko force-pushed the master branch 3 times, most recently from 1863b81 to 9d74a49 Compare April 22, 2024 14:16
@ncw
Copy link
Member

ncw commented Apr 22, 2024

The behavior of None was not changed to keep backward compatibility.

I think I'd prefer None to do what people expect and not do any encoding. What was your reasoning here?

@URenko
Copy link
Contributor Author

URenko commented Apr 23, 2024

The behavior of None was not changed to keep backward compatibility.

I think I'd prefer None to do what people expect and not do any encoding. What was your reasoning here?

Backward compatibility.
Since encoding has been introduced for a long time. There must be many people already misused None.
If we change the behavior and they upgrade it without reading the changelog, we will break their storage. For example, they will need to resync tons of files (this is what happened to me when encoding was introduced. I don't what it to happen again to others : ).
And If they read the changelog, they will change to Raw and reorganize things.

@URenko
Copy link
Contributor Author

URenko commented May 5, 2024

@ncw any progress or further concern? :-)

@URenko
Copy link
Contributor Author

URenko commented May 9, 2024

fixed #7824

Specifically, it causes statements like rclone copy . to spontaneously miss if . expands to a path with a Full Width replacement character.

@iseahound
Copy link

I think you need to fix encoding once and for all. The main problem is that: |‛|‛‛|‛‛‛| and so on where the encoding of one character as two characters instantly breaks the reversibility of the encoding. The example would be when a file system contains ‛|. Was it originally encoded as or ‛|? (Note this can be formally stated as every surjection has a right inverse )

Here are some suggestions:

  1. Add the names of the storage systems. For example, it'd be best to be able to do --local-encoding=drive or --drive-encoding=Windows.
  2. Use a better escape character. Why not use the Unicode private use area as a rclone exclusive character, and prefix all escape sequences with that? �‛|This preserves the current usage of repeating ‛‛‛‛‛‛s. Admittedly, this does kick the can down the line, but anyone using private use characters is probably a super user of some sort and can read the documentation.
  3. Document the replacement logic better. Currently as I understand it, the source and destination remotes both have a "stack of invalid characters". Should the two stacks be equal, no encoding is done. If the stacks are unequal, then encoding/decoding is done.

@iseahound
Copy link

  1. Also, instead of using RAW and having None as well, maybe adding --destination-encoding=source would be much clearer. If your destination remote is drive for example, having --drive-encoding=source would copy the files as-is since the two stacks of invalid character encodings cancel each other out.

@URenko
Copy link
Contributor Author

URenko commented May 21, 2024

Hi @iseahound, thank you for your suggestions.

But first, let me clarify some of your misunderstandings.

The main problem is that: |‛|‛‛|‛‛‛| and so on where the encoding of one character as two characters instantly breaks the reversibility of the encoding. The example would be when a file system contains ‛|. Was it originally encoded as or ‛|? (Note this can be formally stated as every surjection has a right inverse )

The encoding of one character as multiple characters is inevitable.
For example, if you've been exposed to some programming languages, you'll know that the \\\\\\\.
It is not possible to construct a bijection between a single character in an incomplete character set and a single character in a complete character set, only between a single and a single/multiple character.
As for your problem, a filename ‛| (I mean if you directly look at the backend) in a backend means .

Why not use the Unicode private use area as a rclone exclusive character, and prefix all escape sequences with that? �‛|This preserves the current usage of repeating ‛‛‛‛‛‛s. Admittedly, this does kick the can down the line, but anyone using private use characters is probably a super user of some sort and can read the documentation.

Repeating character is not the source of the problem.
Firstly, (U+201B, not the character you can directly type) is extremely uncommon, and I bet no one's original files will have that character before they touch rclone.
Secondly, repeating usually occurs because people misuse rclone's encoding system.

For example, if one has a 你好|世界? in the local Linux filesystem, he copies it to Onedrive with rclone (he will have 你好‛|世界‛? on Onedrive), and then he copies it from Onedrive to a local Windows filesystem with Onedrive (he will have 你好‛|世界‛? on the local Windows filesystem), and if he copies it from the local Windows filesystem to Onedrive with rclone, he will have 你好‛‛|世界‛‛? on Onedrive.
The correct way is to use rclone from beginning to end, including creating, investigating, copying, and deleting. Then he will always only see 你好|世界?.

Should the two stacks be equal, no encoding is done. If the stacks are unequal, then encoding/decoding is done.

Seems correct but not exactly right.
There is also an intermediate layer of standard encoding between the source backend and the target backend, so the way to make this short-circuit happen is to specify both the source and target encoding as standard. see #7456 .

@URenko
Copy link
Contributor Author

URenko commented May 21, 2024

Yes if we could go back to 2019 I would definitely choose to develop another coding system from scratch, but it has been 5 years.
Remember also that None is the default encoding for many backends. (It led to my nightmare 5 years ago.)

In addition to a couple of incorrect implementations that I fixed, there are two main design issues (but not implementation issues) with the current coding system. They don't pose a real problem if the user can use them properly and fixing them is destructive.

  • Use full-width characters which are common in Chinese and Japanese, as escape characters (|, ?...).
    It is not easy for uses to use rclone from beginning to end.
    So at the very least, users are likely to create files on their own (e.g. 你好|世界?).
    Then rclone thinks it's encoded, i.e. it means 你好|世界? (which always happens unless encoding is Raw or use the short-circuit trick).
  • Use encoding by default.
    If we do not use encoding by default, we will not run into the problem mentioned above.
    If we could redesign the coding system. I'd recommend no default encoding and that the encoding be used as a layer of backend (like Hasher).
    And when people run into encoding issues with uploading (which should be rare), they will look at the docs and then realize that there is a backend that can solve the problem.

But for now I'd like to keep backward compatibility as much as possible. And, at least for sure, we need a way to actually disable coding (Raw)

@iseahound
Copy link

iseahound commented May 21, 2024

There seems to be problems with orphaned files with escaped characters in the filename:

If source and destination encoding is both windows this works fine,
If source and destination encoding is both None then deletion cannot happen.

test.bat (Save as UTF-8)

@echo off
chcp 65001

mkdir a
mkdir b
echo "hello world" > b/‛|.txt
rclone --config="" sync a/ b/ --local-encoding=None
pause

In other words, even if the two stacks "cancel each other out" there exists a difference in behavior. Using your standard --local-encoding None,Slash,Ctl,Del,Dot` seems to fix it for some reason...

@URenko
Copy link
Contributor Author

URenko commented May 21, 2024

Yes, there is an intermediate layer of "standard" encoding, so the short-circuit trick is to use "standard" for both source and destination.

But I think you have found another bug. 🙃

@iseahound
Copy link

I have a simple setup of 2 cloud services and one local drive 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants