Utext

Unicode text for OCaml.

Utext provides a type for processing Unicode text.

See also Uucp and Pvec and consult a minimal unicode introduction. TODO make a minimal Utext specific minimal intro.

006ebe4 — Unicode version %%UNICODE_VERSION%% — homepage

References

The Unicode FAQ.
The Unicode Consortium. The Unicode Standard. (latest version)

Utext

unicode_version is the Unicode version supported by Utext.

The type for Unicode text, a persistent vector of Unicode characters.

empty is Pvec.empty, the empty Unicode text.

v ~len u is Pvec.v ~len u.

init ~len f is Pvec.init ~len f.

of_uchar u is Pvec.singleton u.

str s is Unicode text from the valid UTF-8 encoded bytes s.
Raises Invalid_argument if text is invalid UTF-8, use Utext.of_utf_8 and Utext.try_of_utf_8 to deal with untrusted input.

strf fmt ... is Format.kasprintf (fun s -> str s) fmt ...).

Predicates and comparison

is_empty t is true if t is empty, this is equal to Pvec.is_empty.

equal t0 t1 is true if the elements in each vector are equal. Warning. The test is textually meaningless unless t0 and t1 are known to be in a particular form, see e.g. Utext.canonical_caseless_key or normal forms.

FIXME. Should we provide a fool-proof equality that always compares in `NFD or `NFC ? Problem is that since we are using raw Pvec.t we cannot cache.

compare t0 t1 is the per element lexicographical order between t0 and t1. Warning. The comparison is textually meaningless.

Case mapping and folding

For more information about case see the Unicode case mapping FAQ and the case mapping charts. Note that these algorithms are insensitive to language and context and may produce sub-par results for some users.

lowercase t is t lowercased according to Unicode's default case conversion.

uppercase t is t uppercased according to Unicode's default case conversion.

capitalized t is t capitalized: if the first character of t is cased it is mapped to its title case mapping; otherwise t is left unchanged.

uncapitalized t is t uncapitalized: if the first character of t is cased it is mapped to its lowercase case mapping; otherwise t is left unchanged.

Case insensitive equality

Testing the equality of two Unicode texts in a case insensitive manner requires a fair amount of data massaging that includes normalization and case folding. These results should be cached if many comparisons have to be made on the same text. The following functions return keys for a given text that can be used to test equality against other keys. Do not test keys generated by different functions, the comparison would be meaningless. See also Utext.identifier_caseless_key.

casefold t is t casefolded according to Unicode's default casefold. This can be used to implement various forms of caseless equalities. equal (casefolded t0) (casefolded t1) determines default case equality (TUS D144) of t0 and t1. Warning. In general this notion is not good enough use one of the following functions.

canonical_caseless_key t is a key such that equal (canonical_caseless_key t0) (canonical_caseless_key t1) determines canonical caseless equality (TUS D145) of t0 and t1.

compatability_caseless_key t is a key such that equal (compatibility_caseless_key t0) (compatibility_caseless_key t1) determines compatibility caseless equality (TUS D146) of t0 and t1.

Unicode identifiers

is_identifier t is true iff t is a Default Unicode identifier, more precisely this is UAX31-R1.

identifier_caseless_key t is a key such that equal (identifier_caseless_key t0) (identifier_caseless_key t1) determines identifier caseless equality (TUS D147) of t0 and t1.

Breaking lines and paragraphs

These functions break text like a simple readline function would. If you are looking for line breaks to layout text, see line break segmentation.

The type for specifying newlines.

`ASCII newlines occur after a CR (U+000D), LF (U+000A) or CRLF (<U+000D, U+000A>).
`NLF newlines occur after the Unicode newline function, this is `ASCII along with NEL (Ub+0085).
`Readline newlines are determined as for a Unicode readline function (R4), this is `NLF along with FF (U+000C), LS (U+2028) or PS (U+2029).

lines ~drop_empty ~newline t breaks t into subtexts separated by newlines determined according to newline (defaults to `Readline). Separators are not part of the result and lost. If drop_empty is true (defaults to false) drops lines that are empty.

paragraphs ~newline t breaks t into subtexts separated either by two consecutive newlines (determined as `NLF or LS (U+2028)) or a single PS (U+2029). Separators are not part of the result and lost. If drop_empty is true (defaults to false) drops paragraphs that are empty.

Normalization

The type for normalization forms.

`NFD normalization form D, canonical decomposition.
`NFC normalization form C, canonical decomposition followed by canonical composition.
`NFKD normalization form KD, compatibility decomposition.
`NFKC normalization form KC, compatibility decomposition, followed by canonical composition.

normalized nf t is t normalized to nf.

is_normalized nf t is true iff t is in normalization form nf.

Segmentation

The type for boundaries.

`Grapheme_cluster determines extended grapheme clusters boundaries according to UAX 29 (corresponds, for most scripts, to user-perceived characters).
`Word determines word boundaries according to UAX 29.
`Sentence determines sentence boundaries according to UAX 29.
`Line_break determines mandatory line breaks and line break opportunities according to UAX 14.

segments b t is are the segments of text t delimited by two boundaries of type b.

segment_count b t is Pvec.length (segments b t).

Boundary positions

The type for positions. The positions of a vector v of length l range over [0;l]. They are the slits before each element and after the last one. They are labelled from left to right by increasing number. The ith index is between positions i and i+1.

positions  0   1   2   3   4    l-1    l
           +---+---+---+---+     +-----+
  indices  | 0 | 1 | 2 | 3 | ... | l-1 |
           +---+---+---+---+     +-----+

boundaries b t are the positions of boundaries b in t.

boundaries_mandatory is like Utext.boundaries but returns the mandatory status of a boundary if the kind of boundary sports that notion (or always true if not).

Escaping and unescaping

escaped t is t except characters whose general category is Control, U+0022 or U+005C which are escaped according to OCaml's lexical conventions for strings with:

Any U+0008 ('\b') escaped to the sequence <U+005C, U+0062> ("\\b")
Any U+0009 ('\t') escaped to the sequence <U+005C, U+0074> ("\\t")
Any U+000A ('\n') escaped to the sequence <U+005C, U+006E> "\\n"
Any U+000D ('\r') escaped to the sequence <U+005C, U+0072> ("\\r")
Any U+0022 ('\"') escaped to the sequence <U+005C, U+0022> ("\\\"")
Any U+005C ('\\') escaped to the sequence <U+005C, U+005C> ("\\\\")
Any other character is escaped by an hexadecimal "\u{H+}" escape with H a capital hexadecimal number.

Note. As far as OCaml is concerned \u{H+} escapes are only supported from 4.06 on.

unescaped s unescapes what Utext.escaped did and any other valid \u{H+} escape. The, at most six, hexadecimal digits H of Unicode hex escapes can be upper, lower, or mixed case. Any truncated or undefined by Utext.escaped escape makes the function return an Error idx with idx the start index of the offending escape.

The invariant unescape (escape t) = Ok t holds.

Decoding and encoding

encoding_guess s is the encoding guessed for s coupled with true iff there's an initial BOM.

Best-effort decoding

Warning. The following are a best-effort decodes in which any UTF-X decoding error is replaced by at least one replacement character Uchar.u_rep.

of_utf_8 ~first ~last s is the Unicode text that results of best-effort UTF-8 decoding the bytes of s that exist in the range [first;last]. first defaults to 0 and last to length s - 1.

of_utf_16le ~first ~last s is like Utext.of_utf_8 but decodes UTF-16LE.

of_utf_16be ~first ~last s is like Utext.of_utf_8 but decodes UTF-16BE.

Decoding with error handling

The type for decode result. This is:

Ok t if no decoding error occured.
Error (t, err, restart) if a decoding error occured. t is the text decoded until the error, err the byte index where the decode error occured and restart a valid byte index where a new best-effort decode could be restarted (if any).

try_of_utf_8 is like Utext.of_utf_8 except in case of error Error _ is returned as described in decode_result.

try_of_utf_16be is like Utext.try_of_utf_8 but decodes UTF-16BE.

Encoding

Warning. All these functions raise Invalid_argument if the result cannot fit in the limits of Sys.max_string_length.

to_utf_8 t is the UTF-8 encoding of t.

to_utf_16le t is the UTF-16LE encoding of t.

to_utf_16be t is the UTF-16BE encoding of t.

buffer_add_utf_8 b t adds the UTF-8 encoding of t to b.

buffer_add_utf_16le b t adds the UTF-16LE encoding of t to b.

buffer_add_utf_16be b t adds the UTF-16BE encoding of t to b.

Pretty-printing

pp ppf t prints the UTF-8 encoding of t instructing the ppf to use a length of 1 for each grapheme cluster of t.

pp_text ppf t is like Utext.pp except each line breaks is hinted to the formatter, see Uuseg_string.pp_utf_8_text for details.

pp_lines ppf t is like Utext.pp except only mandatory line breaks are hinted to the formatter, see Uuseg_string.pp_utf_8_lines for details.

dump_uchars ppf t formats t as a sequence of OCaml Uchar.t value using only US-ASCII encoded characters according to the Unicode notational convention for code points.

pp_toplevel ppf t formats t using Utext.escaped and Utext.pp in a manner suitable for the toplevel to represent a Utext.t value.

Warning. Before OCaml 4.06 the result might not be cut and pastable as \u{H+} escapes are not supported.

Module Utext

References

Utext

Predicates and comparison

Case mapping and folding

Case insensitive equality

Unicode identifiers

Breaking lines and paragraphs

Normalization

Segmentation

Boundary positions

Escaping and unescaping

Decoding and encoding

Best-effort decoding

Decoding with error handling

Encoding

Pretty-printing