docs/reference/glk-normalization.sgml

   1 <?xml version="1.0"?>
   2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
   3                "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [
   4 ]>
   5 <refentry id="chimara-A-Note-on-Unicode-Case-Folding-and-Normalization">
   6 <refmeta>
   7 <refentrytitle>A Note on Unicode Case-Folding and Normalization</refentrytitle>
   8 <manvolnum>3</manvolnum>
   9 <refmiscinfo>CHIMARA Library</refmiscinfo>
  10 </refmeta>
  11 <refnamediv>
  12 <refname>A Note on Unicode Case-Folding and Normalization</refname>
  13 <refpurpose>How to handle line input</refpurpose>
  14 </refnamediv>
  15 <refsect1>
  16 <title>Description</title>
  17 <para>
  18 With all of these Unicode transformations hovering about, an author might reasonably ask about the right way to handle line input.
  19 Our recommendation is: call glk_buffer_to_lower_case_uni(), followed by glk_buffer_canon_normalize_uni(), and then parse the result.
  20 The parsing process should of course match against strings that have been put through the same process.
  21 </para>
  22 <para>
  23 The Unicode spec (chapter 3.13) gives a different, three-step process: decomposition, case-folding, and decomposition again.
  24 Our recommendation comes through a series of practical compromises:
  25 </para>
  26 <itemizedlist>
  27   <listitem><para>
  28     The initial decomposition is only necessary because of a historical error in the Unicode spec: character 0x0345 (COMBINING GREEK YPOGEGRAMMENI) behaves inconsistently.
  29         We ignore this case, and skip this step.
  30   </para></listitem>
  31   <listitem><para>
  32     Case-folding is a slightly different operation from lower-casing.
  33         (Case-folding splits some combined characters, so that, for example, <quote>&szlig;</quote> can match both <quote>ss</quote> and <quote>SS</quote>.)
  34         However, Glk does not currently offer a case-folding function.
  35         We substitute glk_buffer_to_lower_case_uni().
  36   </para></listitem>
  37   <listitem><para>
  38     I'm not sure why the spec recommends decomposition (glk_buffer_canon_decompose_uni()) rather than glk_buffer_canon_normalize_uni().
  39         However, composed characters are the norm in source code, and therefore in compiled Inform game files.
  40         If we specified decomposition, the compiler would have to do extra work; also, the standard Inform dictionary table (with its fixed word length) would store fewer useful characters.
  41         Therefore, we substitute glk_buffer_canon_normalize_uni().
  42   </para></listitem>
  43 </itemizedlist>
  44 <note><para>
  45   We may revisit these recommendations in future versions of the spec.
  46 </para></note>
  47 </refsect1>
  48 </refentry>