Converting mixed urdu and english text RTF to HTML causes messed up characters #9758

GOTO10-DW · 2024-05-14T11:22:56Z

Explain the problem.
Convert from RTF to HTML produces messed up Characters. I'am not sure if this is similar to this one #9683
I use this Commandline
pandoc.exe input.rtf --metadata title=" " -f rtf -t html -s -o output.html

{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1031{\fonttbl{\f0\fswiss\fprq2\fcharset0 Calibri;}{\f1\fswiss\fprq2\fcharset178 Calibri;}{\f2\fnil\fcharset0 Arial;}}
{\*\generator Riched20 10.0.14393}\viewkind4\uc1 
\pard\rtlpar\widctlpar\qr\f0\fs22\line\f1\rtlch\lang1025\'c7\'e4\'d1\'cc\u1740? \'ca\'aa\'e4\'98 \'8a\u1740?\'e4\'98 \'c7\'e3\'c8\'d1 \'98\'ff \'e3\'d8\'c7\'c8\'de \'c0\'e6\'c7 \'d3\'ff \'c8\'cc\'e1\u1740? \'98\u1740? \'81\u1740?\'cf\'c7\'e6\'c7\'d1 \'98\'e6 \'c8\'9a\'aa\'c7\'e4\'ff \'e3\u1740?\'9f \u1740?\'e6\'d1\'81\u1740?\'e4 \u1740?\'e6\'e4\u1740?\'e4 \'98\'ff \'e3\'e3\'c7\'e1\'982022\f0\ltrch\lang1031  \~\f1\rtlch\lang1025\'e3\u1740?\'9f \'81\u1740?\'8d\'aa\'ff \'d1\'c0 \f0\ltrch\lang1031\par
\par

\pard\rtlpar\qr\f2\fs24\par

\pard\ltrpar\par
 Brussels (dpa) - European Union countries fell behind in 2022 on\par
expanding wind power generation, a study by the energy think tank\par
Ember found.\par
\par
}

RTF (Input)

HTML (Output)

Pandoc version?
Pandoc 3.2 on Windows Server 2016

The text was updated successfully, but these errors were encountered:

jgm · 2024-05-14T15:36:35Z

I assume you got the "unsupported code page" warning? This is the same issue as #9683. We can't really support all the legacy code pages; maybe there's a way to convert your document to unicode prior to passing it to pandoc?

GOTO10-DW · 2024-05-15T06:02:16Z

I got no warning when i convert the document. if there is no mixed text, the convert runs fine.

jgm · 2024-05-15T06:17:36Z

OK, I jumped to conclusions. Actually it's ansicp1252, which we support, so the problem lies elsewhere...

jgm · 2024-05-15T06:30:43Z

Hm, cp1252 just has latin characters: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

There's \fcharset178 which would probably tell us the meaning of the bytes of character data, if we had the proper lookup table. RTF spec just says "Specifies the character set of a font in the font table. Values for N are defined by Windows header files, and in the file RTFDEFS.H accompanying this document." but I can't find RTFDEFS.H.

jgm · 2024-05-15T06:31:20Z

In any case this might be out of scope, if it requires large lookup tables corresponding to fonts (see discussion of the other linked issue).

GOTO10-DW · 2024-05-15T07:27:33Z

Thanks for looking up my problem here. I can give you a working example if this helps. I'am quite not so fimiliar with rtf

{\rtf1\ansi\deflang1031\ftnbj\uc1\deff0
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \froman Times New Roman;}{\f2 \fnil Century Gothic;}{\f3 \fmodern Courier New;}{\f4 \fswiss Arial;}{\f5 \froman \fcharset0 arial;}{\f6 \froman \fcharset178 arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\f1\fs20\cf0\cb1\chcbpat1\ulc0 Normal;}{\cs1\cf0\cb1\chcbpat1\ulc0 Default Paragraph Font;}}
{\*\revtbl{Unknown;}}
\paperw12240\paperh15840\margl1080\margr1080\margt1080\margb1080\headery720\footery720\htmautsp1\nogrowautofit\deftab720\formshade\fet4\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn12240\pghsxn15840\marglsxn1080\margrsxn1080\margtsxn1080\margbsxn1080\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs20\sb135\sa270\ql\sbauto1\saauto1\hich\f6\dbch\f6\loch\f6\fs24\rtlch\u1576 \'c8\u1726 \'aa\u1575 \'c7\u1585 \'d1\u1578 \'ca\u1740 \'3f\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1605 \'e3\u1740 \'3f\u1672 \'8f\u1740 
\'3f\u1575 \'c7\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1575 \'c7\u1608 \'e6\u1585 \'d1\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1606 \'e4\u1740 \'3f\u1608 \'e6\u1586 \'d2\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1575 
\'c7\u1740 \'3f\u1580 \'cc\u1606 \'e4\u1587 \'d3\u1740 \'3f\u1608 \'e6\u1722 \'9f\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1705 \'98\u1746 \'ff\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1605 \'e3\u1591 \'d8\u1575 
\'c7\u1576 \'c8\u1602 \'de\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1740 \'3f\u1729 \'c0\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1606 \'e4\u1574 \'c6\u1740 \'3f\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1578 
\'ca\u1601 \'dd\u1578 \'ca\u1740 \'3f\u1588 \'d4\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1605 \'e3\u1575 \'c7\u1604 \'e1\u1740 \'3f\u1575 \'c7\u1578 \'ca\u1740 \'3f\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1580 
\'cc\u1585 \'d1\u1575 \'c7\u1574 \'c6\u1605 \'e3\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1662 \'81\u1585 \'d1\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1606 \'e4\u1711 \'90\u1575 \'c7\u1729 \'c0\hich\f5\dbch\f5\loch\f5\ltrch 
 \hich\f6\dbch\f6\loch\f6\rtlch\u1585  }

jgm · 2024-05-15T15:11:08Z

The one that works contains unicode escapes to back up the single-byte font characters; that's why it works. I think there might be programs that will unicodify an existing RTF document -- maybe Word can do this? You could look into it.

GOTO10-DW added the bug label May 14, 2024

GOTO10-DW changed the title ~~Converting mixed arabic and english text RTF to HTML causes messed up characters~~ Converting mixed urdu and english text RTF to HTML causes messed up characters May 14, 2024

jgm closed this as completed May 14, 2024

jgm reopened this May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting mixed urdu and english text RTF to HTML causes messed up characters #9758

Converting mixed urdu and english text RTF to HTML causes messed up characters #9758

GOTO10-DW commented May 14, 2024

jgm commented May 14, 2024

GOTO10-DW commented May 15, 2024

jgm commented May 15, 2024

jgm commented May 15, 2024

jgm commented May 15, 2024

GOTO10-DW commented May 15, 2024

jgm commented May 15, 2024

Converting mixed urdu and english text RTF to HTML causes messed up characters #9758

Converting mixed urdu and english text RTF to HTML causes messed up characters #9758

Comments

GOTO10-DW commented May 14, 2024

jgm commented May 14, 2024

GOTO10-DW commented May 15, 2024

jgm commented May 15, 2024

jgm commented May 15, 2024

jgm commented May 15, 2024

GOTO10-DW commented May 15, 2024

jgm commented May 15, 2024