Uunf normalizes Unicode text. It supports all Unicode
normalization forms. The module is independent from any IO mechanism or
Unicode text data structure and it can process text without a
complete in-memory representation of the data.
Release 0.9.3 — Unicode version 7.0.0 — Daniel Bünzli <daniel.buenzl email@example.com>
val is_scalar_value :
int -> bool
nis an Unicode scalar value.
[ `NFC | `NFD | `NFKC | `NFKD ]
`NFDnormalization form D, canonical decomposition.
`NFCnormalization form C, canonical decomposition followed by canonical composition (recommended for the www).
`NFKDnormalization form KD, compatibility decomposition.
`NFKCnormalization form KC, compatibility decomposition, followed by canonical composition.
val create :
[< form ] -> t
create nfis an Unicode text normalizer for the normal form
val form :
t -> form
form nis the normalization form of
val add :
[ `Await | `End | `Uchar of uchar ] -> [ `Await | `Uchar of uchar ]
add n vis:
uis the next character in the normalized sequence. The client must then call
`Awaitwhen the normalizer is ready to add a new
`Uchar u to add a new character to the sequence
to normalize and
`End to signal the end of sequence. After
adding one of these two values, always call
`Await is returned.
added directly after an
`Uchar was returned by the normalizer
or if an
`Uchar is added after
`End was added.
val reset :
t -> unit
reset nresets the normalizer to a state equivalent to the state of
Uunf.create (Uunf.form n).
val copy :
t -> t
copy nis a copy of
nin its current state. Subsequent
ndo not affect the copy.
These properties are used internally to implement the normalizers.
They are not needed to use the module but are exposed as they may
be useful to implement other algorithms.
val unicode_version :
unicode_versionis the Unicode version supported by the module.
val ccc :
uchar -> int
u's canonical combining class value.
val decomp :
uchar -> int array
u's decomposition mapping. If the empty array is returned,
udecomposes to itself.
The first number in the array contains additional information, it
cannot be used as an
Uunf.d_uchar on the number to get the
actual character and
Uunf.d_compatibility to find out if this is
a compatibility decomposition. All other characters of the array
are guaranteed to be of type
Warning. Do not mutate the array.
val d_uchar :
int -> uchar
val d_compatibility :
int -> bool
val composite :
uchar -> uchar -> uchar option
composite u1 u2is the primary composite canonically equivalent to the sequence
<u1,u2>, if any.
Uunf normalizer consumes only a small bounded amount of
memory on ordinary, meaningful text. However on legal but degenerate text like a
starter followed by
marks it will have to bufferize all the marks (a workaround is
to first convert your input to
A normalizer is a stateful filter that inputs a sequence of characters and outputs an equivalent sequence in the requested normal form.
Uunf.create returns a new normalizer for a given normal
To add characters to the sequence to normalize, call
let nfd = Uunf.create `NFD;;
`Uchar _. To end the sequence, call
`End. The normalized sequence of characters is returned, character by character, by the successive calls to
The client and the normalizer must wait on each other to limit
internal buffering: each time the client adds to the sequence by
`End it must continue to call
`Await until the normalizer returns
practice this leads to the following kind of control flow:
For example to normalize the character
let rec add acc v = match Uunf.add nfd v with
| `Uchar u -> add (u :: acc) `Await
| `Await -> acc
nfdto a list of characters we can write:
The next section has more examples.
let e_acute_nfd = List.rev (add (add  (`Uchar 0x00E9)) `End)
utf_8_normalize nf s is the UTF-8 encoded normal form
the UTF-8 encoded string
s. This example uses
Uutf to fold
over the characters of
s and to encode the normalized
sequence in a standard OCaml buffer.
let utf_8_normalize nf s =
let b = Buffer.create (String.length s * 3) in
let n = Uunf.create nf in
let rec add v = match Uunf.add n v with
| `Uchar u -> Uutf.Buffer.add_utf_8 b u; add `Await
| `Await -> ()
let add_uchar _ _ = function
| `Malformed _ -> add (`Uchar Uutf.u_rep)
| `Uchar _ as u -> add u
Uutf.String.fold_utf_8 add_uchar () s; add `End; Buffer.contents b