Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode and zero width charcaters #26

Open
gerald-brandt opened this issue Oct 29, 2020 · 8 comments
Open

Unicode and zero width charcaters #26

gerald-brandt opened this issue Oct 29, 2020 · 8 comments

Comments

@gerald-brandt
Copy link

gerald-brandt commented Oct 29, 2020

With zero width characters being drawn as a question mark, I'm wondering how to display something like the images attached. In this case, the zero width character should place a dot over the last symbol (image 1), but instead displays a single column wide question mark (image 2)
image_1
image_2

There is also extra spacing in image 2 that shouldn't be there.

Is there a way to get the string displayed properly?

This is the string
"\xe0\xa4\x95\xe0\xa4\xbe\xe0\xa4\x9a\xe0\xa4\x82\x0a"

@magiblot
Copy link
Owner

Hi Gerald!

Thanks for the question.

My top priority is to prevent the terminal display from becoming garbled. When I added Unicode support, since I was not sure how to deal with these combining characters, I decided to turn them into a question mark, as explained in README.md#Text display rules.

So this is not supported at the moment, but I may be able to figure something out.

@gerald-brandt
Copy link
Author

Combining characters overlay the previous character, hence the zero width.

Where in the code would I find this?

What about the space between the glyphs that should be there? Is this because of the column style layout?

@magiblot
Copy link
Owner

Hi Gerald,

What about the space between the glyphs that should be there? Is this because of the column style layout?

I don't know what may be causing this. There is nothing special I do to add space between glyphs. This may be caused by the terminal application itself.

Where in the code would I find this?

Okay. So I created a data struct representing the contents of a displayed cell, called TScreenCell (include/tvision/scrncell.h) and a data struct storing the text of a cell in UTF-8, called TCellChar (same header).

The screen is represented by a grid of TScreenCells. If a character is two columns wide, then this corresponds to two consecutive cells in the grid.

The current limitation is TCellChar allowing for just one character, where in reality it could be a sequence of UTF-8 codepoints of arbitrary length.

The functions in the TText namespace (include/tvision/ttext.h) deal with text processing and TScreenCell initialization. The functions that rely directly on text width are TText::eat and TText::next. In fact, TText::eat is where zero-width characters are replaced with .

So, in order to support combining characters, the following has to change:

  • The data structures in scrncell.h. This will involve undoing some optimizations such as triviality. Also much of BufferedDisplay (source/platform/buffdisp.cpp), which manipulates cells directly.
  • The logic in ttext.h functions.

Cheers.

@bormant
Copy link

bormant commented Oct 29, 2020

http://www.unicode.org/reports/tr29/ section "3 Grapheme Cluster Boundaries"
https://en.wikipedia.org/wiki/Combining_character
may be related to.

@gerald-brandt
Copy link
Author

gerald-brandt commented Oct 29, 2020

It definitely is grapheme based. The combining character is part of the grapheme. It sounds like the current implementation is codepoint based, which is almost always the wrong way to do it, but so, so easy. It would be nice if it was as simple as changing a TCellChar into a string.

If utf8proc http://juliastrings.github.io/utf8proc/doc/utf8proc_8h.html could be brought in to do all the unicode handling behind the scenes, it would probably simplify things.

magiblot added a commit that referenced this issue Oct 31, 2020
This turned out to be a lot easier than I anticipated. Zero-width characters simply get combined into the previous cell in TText::eat, or are not shown at all. The zero width joiner is always discarded so that it won't merge characters together, changing the width of a whole string.

In order to be able to combine several characters in the same screen cell, the size of TCellChar has been raised to 8 bytes. This allows a minimum of one combining character, which should be fine for most non-degenerate use cases. But since we use UTF-8, several codepoints may fit in there. It also preserves the assumption that TCellChar can be casted into a primitive type, so there's not much code that has to be changed elsewhere.

The only breaking change is TText::eat, where the 'width' and 'bytes' parameters are now indexes so that we can look back at the previous cell when we find a zero-width character. But this actually makes things simpler for the invokers of TText::eat. So it's a win-win change.

See #26.
magiblot added a commit that referenced this issue Oct 31, 2020
After some testing I have found Wikipedia articles where 8 bytes were not enough to fit all the diacritics in one cell. So I raised it to 12 bytes, where at least 2 combining characters can fit in the worst case. This should be enough for most real-world, natural language use cases. I don't care about zalgo.

12 bytes and alignas(4) is still a sweet spot for performance where most operations (including comparison) can be carried out in registers. It also preserves sizeof(TScreenCell) == 16, although that struct is likely to become larger in the future if true color support is added.

See #26.
magiblot added a commit to magiblot/turbo that referenced this issue Oct 31, 2020
@magiblot
Copy link
Owner

magiblot commented Oct 31, 2020

Thank you everyone for your suggestions.

I tried replacing TCellChar with std::string, and it was a disaster. Turbo Vision likes to keep intermediary screen buffers, and has to move them around several times before data is printed to screen. So in a single screen flush, the TCellChar constructor can be invoked millions of times. For this reason the current implementation relies strongly on TCellChar, TScreenCell and related structs being small and trivial, so that they occupy contiguous memory locations and can be copied with memcpy.

You could argue that I'm coupling the system with an implementation detail, or doing premature optimization. But the truth is that representing each cell with an individual string is not a good solution to this problem. I'm pretty sure not even GUI applications store text this way.

Does Turbo Vision need to delegate Unicode processing to a external library? Actually, it doesn't. Turbo Vision is not a text editing component. What it needs to know is how text is displayed on the terminal, and this is platform-dependent, while the Unicode standard is not. So it doesn't help me at all to know that "👨‍👩‍👧‍👦" is a grapheme cluster if the terminal will display it differently:

Screenshot_20201031_172043

Even if it's true that an arbitrary number of codepoints can fit in a single cell, I realized that:

  • In real-world use cases of natural language, you rarely ever need more than two or three combining characters together.
  • Common cases where lots of combining characters are used are:
    • Emojis, which as the picture above shows, are usually not grouped together by terminal applications.
    • Zalgo text, which I don't care about.

So what I did was:

  • Resize TCellChar from 4 to 12 bytes, making it capable of holding several codepoints encoded in UTF-8.
  • Change the logic in TText::eat so that zero-width characters are combined with the previous cell. If the TCellChar in the cell is full and no more text fits in it, nothing fatal happens: the character is simply discarded and won't be printed on screen.
  • Always discard the ZERO WIDTH JOINER character, which causes emojis to get combined on a few terminals (e.g. Kitty). This ensures text is displayed in a predictable way.

This preserves the already present assumptions, the most important of which is that the width of a string is the sum of the width of its characters. The performance impact of this feature is also minimal, because TCellChar is still trivial and is 4-byte-aligned.

No changes are required in the source code of Turbo Vision applications, except those using TText::eat or TText::next directly (the only of which I am aware of is Turbo, which I maintain myself).

Screenshot_20201031_175358

Screenshot_20201031_180341

Terminals which do not respect the result of wcwidth will suffer from screen garbling. This is the case of Hangul Jamo:

"ᅥ ᅦ ᅧ ᅨ ᅩ ᅪ ᅫ ᅬ ᅭ ᅮ ᅯ ᅰ ᅱ ᅲ ᅳ ᅴ ᅵ ᅶ ᅷ ᅸ ᅹ ᅺ ᅻ ᅼ ᅽ ᅾ ᅿ ᆀ ᆁ ᆂ ᆃ ᆄ

wcwidth for each of these characters is 0, so I'd expect them to combine with the space before them, but many terminals (Konsole, GNOME Terminal...) display them as standalone characters. Xterm and Alacritty satisfy my expectations.

Should Turbo Vision use an external Unicode library to determine that these characters have a width of 1? Tilde is another application with good Unicode support. It treats these characters as one column wide instead of zero. Guess what, it suffers from screen garbling on Xterm and Alacritty. So you can see how difficult it is to get this right.

I suggest you to upgrade to the latest commit and try again. The Turbo text editor has also been updated.

At this point, the most improvable thing is string iteration with TText::next and TText::prev, which is still codepoint-based. So when navigating text with arrow keys, you will see the cursor stop at every combining character. But this doesn't worry me as much.

Cheers.

@unxed
Copy link
Contributor

unxed commented Oct 31, 2020

It definitely is grapheme based. The combining character is part of the grapheme. It sounds like the current implementation is codepoint based, which is almost always the wrong way to do it, but so, so easy. It would be nice if it was as simple as changing a TCellChar into a string.

Users don't care about graphemes and code points. Users do care about their experience. They just want to have all letters/signs required by their language working :)

Perhaps limiting the number of code points per screen cell may play a role in the future if real-world problems arise that may be solved by many-many code points per cell. But history shows that looking too far into the future is not always the best option. Microsoft has decided to look into the future by choosing UTF16 as the standard for their Winapi, and now they live with the most awkward Unicode representation of all.

magiblot added a commit that referenced this issue Nov 2, 2020
The solution to this is to return a boolean from TText::eat that indicates whether it should be invoked again or not. This makes it simpler to use in some cases and more complex in others.

Related: #26.
magiblot added a commit to magiblot/turbo that referenced this issue Nov 2, 2020
@gerald-brandt
Copy link
Author

This looks good in my quick tests. Thanks for the work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants