making unicode pdf bookmarks with TeXML

I've added a new feature to TeXML. Content of the element "pdf" is converted to utf16be and encoded using escape-sequences. It is useful for making PDF strings, in particular, for PDF bookmark strings.

Problem

As an example, let start with a bit wrong example. Actually, there were no problems if only latin letters were used. But the text is partially in Russian, and it makes troubles.

<TeXML>
  <cmd name="documentclass" nl2="1">
    <parm>article</parm>
  </cmd>
  <cmd name="usepackage" nl2="1">
    <opt>T2A</opt>
    <parm>fontenc</parm>
  </cmd>
  <cmd name="usepackage" nl2="1">
    <opt>koi8-r</opt>
    <parm>inputenc</parm>
  </cmd>
  <cmd name="usepackage">
    <parm>hyperref</parm>
  </cmd>
  <env name="document">
    <cmd name="section">
      <parm>&#1047;&#1072;&#1075;&#1086;
&#1083;&#1086;&#1074;&#1086;&#1082; (Title)</parm>
    </cmd>
    &#1058;&#1077;&#1082;&#1089;&#1090; (Text)
  </env>
</TeXML>

Let convert it to LaTeX using the following command.

$ texml -e koi8-r file1.xml file1.tex

Result is as expected.

\\documentclass{article}
\\usepackage[T2A]{fontenc}
\\usepackage[koi8-r]{inputenc}
\\usepackage{hyperref}
\\begin{document}
\\section{Заголовок (Title)} ТекÑ?т (Text)
\\end{document}

Unfortunately, it doesn't compile:

$ pdflatex file1.tex 
...
Package hyperref Warning: Glyph not defined in PD1 encoding,
(hyperref)                removing `\CYRZ' on input line 6.
...

As result, there are no Russian letters in the bookmarks.

Solution

Let modify TeXML file a bit. Changed parts are highlighted in italic.

<TeXML>
  <cmd name="documentclass" nl2="1">
    <parm>article</parm>
  </cmd>
  <cmd name="usepackage" nl2="1">
    <opt>T2A</opt>
    <parm>fontenc</parm>
  </cmd>
  <cmd name="usepackage" nl2="1">
    <opt>koi8-r</opt>
    <parm>inputenc</parm>
  </cmd>
  <cmd name="usepackage">
    <opt>unicode</opt>
    <parm>hyperref</parm>
  </cmd>
  <env name="document">
    <cmd name="section">
      <opt>
        <cmd name="texorpdfstring">
          <parm></parm>
          <parm><pdf>&#1047;&#1072;&#1075;&#1086;
&#1083;&#1086;&#1074;&#1086;&#1082; (Title)</pdf></parm>
        </cmd>
      </opt>
      <parm>&#1047;&#1072;&#1075;&#1086;
&#1083;&#1086;&#1074;&#1086;&#1082; (Title)</parm>
    </cmd>
    &#1058;&#1077;&#1082;&#1089;&#1090; (Text)
  </env>
</TeXML>

This file results in the following TeX file. Changes are in italic.

\\documentclass{article}
\\usepackage[T2A]{fontenc}
\\usepackage[koi8-r]{inputenc}
\\usepackage[unicode]{hyperref}
\\begin{document}
\\section[ \\texorpdfstring{}{\\004\\027\\004\\060\\004\\063\\004\\076
\\004\\073\\004\\076\\004\\062\\004\\076\\004\\072\\000\\040\\000\\050\\000
\\124\\000\\151\\000\\164\\000\\154\\000\\145\\000\\051}
]{Заголовок (Title)} ТекÑ?т (Text)
\\end{document}

Due to the option "unicode", hyperref marks PDF strings to be Unicode strings. The square brackets after "section" provide an alternative section name for TOCs and the bookmarks. The command "textrpdfstring" expands itself as the first argument in the TeX mode, and as the second argument in the PDF mode. As I don't need content for TeX, I left it empty. (Warning: probably I'm wrong. What's about TOCs?) The second argument is an Unicode version of the title.

Compile with pdflatex -- and see the bookmark!

Categories: TeX TeXML

Updated: