Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting mixed urdu and english text RTF to HTML causes messed up characters #9758

Open
GOTO10-DW opened this issue May 14, 2024 · 7 comments
Labels

Comments

@GOTO10-DW
Copy link

Explain the problem.
Convert from RTF to HTML produces messed up Characters. I'am not sure if this is similar to this one #9683
I use this Commandline
pandoc.exe input.rtf --metadata title=" " -f rtf -t html -s -o output.html

{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1031{\fonttbl{\f0\fswiss\fprq2\fcharset0 Calibri;}{\f1\fswiss\fprq2\fcharset178 Calibri;}{\f2\fnil\fcharset0 Arial;}}
{\*\generator Riched20 10.0.14393}\viewkind4\uc1 
\pard\rtlpar\widctlpar\qr\f0\fs22\line\f1\rtlch\lang1025\'c7\'e4\'d1\'cc\u1740? \'ca\'aa\'e4\'98 \'8a\u1740?\'e4\'98 \'c7\'e3\'c8\'d1 \'98\'ff \'e3\'d8\'c7\'c8\'de \'c0\'e6\'c7 \'d3\'ff \'c8\'cc\'e1\u1740? \'98\u1740? \'81\u1740?\'cf\'c7\'e6\'c7\'d1 \'98\'e6 \'c8\'9a\'aa\'c7\'e4\'ff \'e3\u1740?\'9f \u1740?\'e6\'d1\'81\u1740?\'e4 \u1740?\'e6\'e4\u1740?\'e4 \'98\'ff \'e3\'e3\'c7\'e1\'982022\f0\ltrch\lang1031  \~\f1\rtlch\lang1025\'e3\u1740?\'9f \'81\u1740?\'8d\'aa\'ff \'d1\'c0 \f0\ltrch\lang1031\par
\par

\pard\rtlpar\qr\f2\fs24\par

\pard\ltrpar\par
 Brussels (dpa) - European Union countries fell behind in 2022 on\par
expanding wind power generation, a study by the energy think tank\par
Ember found.\par
\par
} 

RTF (Input)
%pn_0Mbr5WNEzQ
HTML (Output)
%pn_XSgaaJ7lOd

Pandoc version?
Pandoc 3.2 on Windows Server 2016

@GOTO10-DW GOTO10-DW added the bug label May 14, 2024
@GOTO10-DW GOTO10-DW changed the title Converting mixed arabic and english text RTF to HTML causes messed up characters Converting mixed urdu and english text RTF to HTML causes messed up characters May 14, 2024
@jgm
Copy link
Owner

jgm commented May 14, 2024

I assume you got the "unsupported code page" warning? This is the same issue as #9683. We can't really support all the legacy code pages; maybe there's a way to convert your document to unicode prior to passing it to pandoc?

@jgm jgm closed this as completed May 14, 2024
@GOTO10-DW
Copy link
Author

I got no warning when i convert the document. if there is no mixed text, the convert runs fine.

@jgm jgm reopened this May 15, 2024
@jgm
Copy link
Owner

jgm commented May 15, 2024

OK, I jumped to conclusions. Actually it's ansicp1252, which we support, so the problem lies elsewhere...

@jgm
Copy link
Owner

jgm commented May 15, 2024

Hm, cp1252 just has latin characters: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

There's \fcharset178 which would probably tell us the meaning of the bytes of character data, if we had the proper lookup table. RTF spec just says "Specifies the character set of a font in the font table. Values for N are defined by Windows header files, and in the file RTFDEFS.H accompanying this document." but I can't find RTFDEFS.H.

@jgm
Copy link
Owner

jgm commented May 15, 2024

In any case this might be out of scope, if it requires large lookup tables corresponding to fonts (see discussion of the other linked issue).

@GOTO10-DW
Copy link
Author

Thanks for looking up my problem here. I can give you a working example if this helps. I'am quite not so fimiliar with rtf

{\rtf1\ansi\deflang1031\ftnbj\uc1\deff0
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \froman Times New Roman;}{\f2 \fnil Century Gothic;}{\f3 \fmodern Courier New;}{\f4 \fswiss Arial;}{\f5 \froman \fcharset0 arial;}{\f6 \froman \fcharset178 arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\f1\fs20\cf0\cb1\chcbpat1\ulc0 Normal;}{\cs1\cf0\cb1\chcbpat1\ulc0 Default Paragraph Font;}}
{\*\revtbl{Unknown;}}
\paperw12240\paperh15840\margl1080\margr1080\margt1080\margb1080\headery720\footery720\htmautsp1\nogrowautofit\deftab720\formshade\fet4\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn12240\pghsxn15840\marglsxn1080\margrsxn1080\margtsxn1080\margbsxn1080\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs20\sb135\sa270\ql\sbauto1\saauto1\hich\f6\dbch\f6\loch\f6\fs24\rtlch\u1576 \'c8\u1726 \'aa\u1575 \'c7\u1585 \'d1\u1578 \'ca\u1740 \'3f\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1605 \'e3\u1740 \'3f\u1672 \'8f\u1740 
\'3f\u1575 \'c7\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1575 \'c7\u1608 \'e6\u1585 \'d1\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1606 \'e4\u1740 \'3f\u1608 \'e6\u1586 \'d2\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1575 
\'c7\u1740 \'3f\u1580 \'cc\u1606 \'e4\u1587 \'d3\u1740 \'3f\u1608 \'e6\u1722 \'9f\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1705 \'98\u1746 \'ff\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1605 \'e3\u1591 \'d8\u1575 
\'c7\u1576 \'c8\u1602 \'de\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1740 \'3f\u1729 \'c0\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1606 \'e4\u1574 \'c6\u1740 \'3f\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1578 
\'ca\u1601 \'dd\u1578 \'ca\u1740 \'3f\u1588 \'d4\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1605 \'e3\u1575 \'c7\u1604 \'e1\u1740 \'3f\u1575 \'c7\u1578 \'ca\u1740 \'3f\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1580 
\'cc\u1585 \'d1\u1575 \'c7\u1574 \'c6\u1605 \'e3\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1662 \'81\u1585 \'d1\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1606 \'e4\u1711 \'90\u1575 \'c7\u1729 \'c0\hich\f5\dbch\f5\loch\f5\ltrch 
 \hich\f6\dbch\f6\loch\f6\rtlch\u1585  }

image

@jgm
Copy link
Owner

jgm commented May 15, 2024

The one that works contains unicode escapes to back up the single-byte font characters; that's why it works. I think there might be programs that will unicodify an existing RTF document -- maybe Word can do this? You could look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants