Module Utext

module Utext: sig .. end
Unicode text for OCaml.

Utext provides a type for processing Unicode text.

See also Uucp and Pvec and consult a minimal unicode introduction. TODO make a minimal Utext specific minimal intro.

006ebe4 — Unicode version %%UNICODE_VERSION%% — homepage



val unicode_version : string
unicode_version is the Unicode version supported by Utext.
type t = Uchar.t Pvec.t 
The type for Unicode text, a persistent vector of Unicode characters.
val empty : t
empty is Pvec.empty, the empty Unicode text.
val v : len:int -> Uchar.t -> t
v ~len u is Pvec.v ~len u.
val init : len:int -> (int -> Uchar.t) -> t
init ~len f is Pvec.init ~len f.
val of_uchar : Uchar.t -> t
of_uchar u is Pvec.singleton u.
val str : string -> t
str s is Unicode text from the valid UTF-8 encoded bytes s.
Raises Invalid_argument if text is invalid UTF-8, use Utext.of_utf_8 and Utext.try_of_utf_8 to deal with untrusted input.
val strf : ('a, Format.formatter, unit, t) Pervasives.format4 -> 'a
strf fmt ... is Format.kasprintf (fun s -> str s) fmt ...).

Predicates and comparison

See also Pvec's predicates and comparisons.

val is_empty : t -> bool
is_empty t is true if t is empty, this is equal to Pvec.is_empty.
val equal : t -> t -> bool
equal t0 t1 is true if the elements in each vector are equal. Warning. The test is textually meaningless unless t0 and t1 are known to be in a particular form, see e.g. Utext.canonical_caseless_key or normal forms.

FIXME. Should we provide a fool-proof equality that always compares in `NFD or `NFC ? Problem is that since we are using raw Pvec.t we cannot cache.

val compare : t -> t -> int
compare t0 t1 is the per element lexicographical order between t0 and t1. Warning. The comparison is textually meaningless.

Case mapping and folding

For more information about case see the Unicode case mapping FAQ and the case mapping charts. Note that these algorithms are insensitive to language and context and may produce sub-par results for some users.

val lowercased : t -> t
lowercase t is t lowercased according to Unicode's default case conversion.
val uppercased : t -> t
uppercase t is t uppercased according to Unicode's default case conversion.
val capitalized : t -> t
capitalized t is t capitalized: if the first character of t is cased it is mapped to its title case mapping; otherwise t is left unchanged.
val uncapitalized : t -> t
uncapitalized t is t uncapitalized: if the first character of t is cased it is mapped to its lowercase case mapping; otherwise t is left unchanged.

Case insensitive equality

Testing the equality of two Unicode texts in a case insensitive manner requires a fair amount of data massaging that includes normalization and case folding. These results should be cached if many comparisons have to be made on the same text. The following functions return keys for a given text that can be used to test equality against other keys. Do not test keys generated by different functions, the comparison would be meaningless. See also Utext.identifier_caseless_key.

val casefolded : t -> t
casefold t is t casefolded according to Unicode's default casefold. This can be used to implement various forms of caseless equalities. equal (casefolded t0) (casefolded t1) determines default case equality (TUS D144) of t0 and t1. Warning. In general this notion is not good enough use one of the following functions.
val canonical_caseless_key : t -> t
canonical_caseless_key t is a key such that equal (canonical_caseless_key t0) (canonical_caseless_key t1) determines canonical caseless equality (TUS D145) of t0 and t1.
val compatibility_caseless_key : t -> t
compatability_caseless_key t is a key such that equal (compatibility_caseless_key t0) (compatibility_caseless_key t1) determines compatibility caseless equality (TUS D146) of t0 and t1.

Unicode identifiers

For more information see UAX 31 Unicode Identifier and Pattern Syntax.

val is_identifier : t -> bool
is_identifier t is true iff t is a Default Unicode identifier, more precisely this is UAX31-R1.
val identifier_caseless_key : t -> t
identifier_caseless_key t is a key such that equal (identifier_caseless_key t0) (identifier_caseless_key t1) determines identifier caseless equality (TUS D147) of t0 and t1.

Breaking lines and paragraphs

These functions break text like a simple readline function would. If you are looking for line breaks to layout text, see line break segmentation.

type newline = [ `ASCII | `NLF | `Readline ] 
The type for specifying newlines.
val lines : ?drop_empty:bool -> ?newline:newline -> t -> t Pvec.t
lines ~drop_empty ~newline t breaks t into subtexts separated by newlines determined according to newline (defaults to `Readline). Separators are not part of the result and lost. If drop_empty is true (defaults to false) drops lines that are empty.
val paragraphs : ?drop_empty:bool -> t -> t Pvec.t
paragraphs ~newline t breaks t into subtexts separated either by two consecutive newlines (determined as `NLF or LS (U+2028)) or a single PS (U+2029). Separators are not part of the result and lost. If drop_empty is true (defaults to false) drops paragraphs that are empty.


For more information on normalization consult a short introduction, the UAX #15 Unicode Normalization Forms and normalization charts.

type normalization = [ `NFC | `NFD | `NFKC | `NFKD ] 
The type for normalization forms.
val normalized : normalization -> t -> t
normalized nf t is t normalized to nf.
val is_normalized : normalization -> t -> bool
is_normalized nf t is true iff t is in normalization form nf.


For more information consult the UAX #29 Unicode Text Segmentation, the UAX #14 Unicode Line Breaking Algorithm and the web based ICU break utility.

type boundary = [ `Grapheme_cluster | `Line_break | `Sentence | `Word ] 
The type for boundaries.
val segments : boundary -> t -> t Pvec.t
segments b t is are the segments of text t delimited by two boundaries of type b.
val segment_count : boundary -> t -> int
segment_count b t is Pvec.length (segments b t).

Boundary positions

type pos = int 
The type for positions. The positions of a vector v of length l range over [0;l]. They are the slits before each element and after the last one. They are labelled from left to right by increasing number. The ith index is between positions i and i+1.
positions  0   1   2   3   4    l-1    l
           +---+---+---+---+     +-----+
  indices  | 0 | 1 | 2 | 3 | ... | l-1 |
           +---+---+---+---+     +-----+

val boundaries : boundary -> t -> pos Pvec.t
boundaries b t are the positions of boundaries b in t.
val boundaries_mandatory : boundary -> t -> (pos * bool) Pvec.t
boundaries_mandatory is like Utext.boundaries but returns the mandatory status of a boundary if the kind of boundary sports that notion (or always true if not).

Escaping and unescaping

val escaped : t -> t
escaped t is t except characters whose general category is Control, U+0022 or U+005C which are escaped according to OCaml's lexical conventions for strings with:

Note. As far as OCaml is concerned \u{H+} escapes are only supported from 4.06 on.

val unescaped : t -> (t, int) Pervasives.result
unescaped s unescapes what Utext.escaped did and any other valid \u{H+} escape. The, at most six, hexadecimal digits H of Unicode hex escapes can be upper, lower, or mixed case. Any truncated or undefined by Utext.escaped escape makes the function return an Error idx with idx the start index of the offending escape.

The invariant unescape (escape t) = Ok t holds.

Decoding and encoding

val encoding_guess : string -> [ `UTF_16BE | `UTF_16LE | `UTF_8 ] * bool
encoding_guess s is the encoding guessed for s coupled with true iff there's an initial BOM.

Best-effort decoding

Warning. The following are a best-effort decodes in which any UTF-X decoding error is replaced by at least one replacement character Uchar.u_rep.

val of_utf_8 : ?first:int -> ?last:int -> string -> t
of_utf_8 ~first ~last s is the Unicode text that results of best-effort UTF-8 decoding the bytes of s that exist in the range [first;last]. first defaults to 0 and last to length s - 1.
val of_utf_16le : ?first:int -> ?last:int -> string -> t
of_utf_16le ~first ~last s is like Utext.of_utf_8 but decodes UTF-16LE.
val of_utf_16be : ?first:int -> ?last:int -> string -> t
of_utf_16be ~first ~last s is like Utext.of_utf_8 but decodes UTF-16BE.

Decoding with error handling

type decode = (t, t * int * int option) Pervasives.result 
The type for decode result. This is:
val try_of_utf_8 : ?first:int -> ?last:int -> string -> decode
try_of_utf_8 is like Utext.of_utf_8 except in case of error Error _ is returned as described in decode_result.
val try_of_utf_16le : ?first:int -> ?last:int -> string -> decode
try_of_utf_16be is like Utext.try_of_utf_8 but decodes UTF-16BE.
val try_of_utf_16le : ?first:int -> ?last:int -> string -> decode
try_of_utf_16be is like Utext.try_of_utf_8 but decodes UTF-16BE.


Warning. All these functions raise Invalid_argument if the result cannot fit in the limits of Sys.max_string_length.

val to_utf_8 : t -> string
to_utf_8 t is the UTF-8 encoding of t.
val to_utf_16le : t -> string
to_utf_16le t is the UTF-16LE encoding of t.
val to_utf_16be : t -> string
to_utf_16be t is the UTF-16BE encoding of t.
val buffer_add_utf_8 : Buffer.t -> t -> unit
buffer_add_utf_8 b t adds the UTF-8 encoding of t to b.
val buffer_add_utf_16le : Buffer.t -> t -> unit
buffer_add_utf_16le b t adds the UTF-16LE encoding of t to b.
val buffer_add_utf_16be : Buffer.t -> t -> unit
buffer_add_utf_16be b t adds the UTF-16BE encoding of t to b.


val pp : Format.formatter -> t -> unit
pp ppf t prints the UTF-8 encoding of t instructing the ppf to use a length of 1 for each grapheme cluster of t.
val pp_text : Format.formatter -> t -> unit
pp_text ppf t is like Utext.pp except each line breaks is hinted to the formatter, see Uuseg_string.pp_utf_8_text for details.
val pp_lines : Format.formatter -> t -> unit
pp_lines ppf t is like Utext.pp except only mandatory line breaks are hinted to the formatter, see Uuseg_string.pp_utf_8_lines for details.
val pp_uchars : Format.formatter -> t -> unit
dump_uchars ppf t formats t as a sequence of OCaml Uchar.t value using only US-ASCII encoded characters according to the Unicode notational convention for code points.
val pp_toplevel : Format.formatter -> t -> unit
pp_toplevel ppf t formats t using Utext.escaped and Utext.pp in a manner suitable for the toplevel to represent a Utext.t value.

Warning. Before OCaml 4.06 the result might not be cut and pastable as \u{H+} escapes are not supported.

val pp_toplevel_pvec : Format.formatter -> t Pvec.t -> unit
pp_toplevel_pvec ppf ts formats ts using Utext.pp_toplevel.