module Utext:sig..end
Utext provides a type for processing Unicode text.
See also Uucp and Pvec and consult a
minimal unicode introduction.
TODO make a minimal Utext specific minimal intro.
006ebe4 — Unicode version %%UNICODE_VERSION%% — homepage
val unicode_version : stringunicode_version is the Unicode version supported by Utext.typet =Uchar.t Pvec.t
val empty : tempty is Pvec.empty, the empty Unicode text.val v : len:int -> Uchar.t -> tv ~len u is Pvec.v ~len u.val init : len:int -> (int -> Uchar.t) -> tinit ~len f is Pvec.init ~len f.val of_uchar : Uchar.t -> tof_uchar u is Pvec.singleton u.val str : string -> tstr s is Unicode text from the valid UTF-8 encoded bytes s.Invalid_argument if text is invalid UTF-8, use
Utext.of_utf_8 and Utext.try_of_utf_8 to deal with untrusted input.val strf : ('a, Format.formatter, unit, t) Pervasives.format4 -> 'astrf fmt ... is Format.kasprintf (fun s -> str s) fmt ...).
See also Pvec's predicates and comparisons.
val is_empty : t -> boolis_empty t is true if t is empty, this is equal to
Pvec.is_empty.val equal : t -> t -> boolequal t0 t1 is true if the elements in each vector are
equal. Warning. The test is textually meaningless unless
t0 and t1 are known to be in a particular form, see e.g.
Utext.canonical_caseless_key or normal forms.
FIXME. Should we provide a fool-proof equality that
always compares in `NFD or `NFC ? Problem is that
since we are using raw Pvec.t we cannot cache.
val compare : t -> t -> intcompare t0 t1 is the per element lexicographical order between
t0 and t1. Warning. The comparison is textually
meaningless.
For more information about case see the
Unicode
case mapping FAQ and the
case mapping charts. Note
that these algorithms are insensitive to language and context and
may produce sub-par results for some users.
val lowercased : t -> tlowercase t is t lowercased according to Unicode's default case
conversion.val uppercased : t -> tuppercase t is t uppercased according to Unicode's default case
conversion.val capitalized : t -> tcapitalized t is t capitalized: if the first character of t
is cased it is mapped to its
title case mapping; otherwise t is
left unchanged.val uncapitalized : t -> tuncapitalized t is t uncapitalized: if the first character of
t is cased it is mapped to its
lowercase case mapping; otherwise t
is left unchanged.
Testing the equality of two Unicode texts in a case insensitive
manner requires a fair amount of data massaging that includes
normalization and case folding. These results
should be cached if many comparisons have to be made on the same
text. The following functions return keys for a given text that
can be used to test equality against other keys. Do not test
keys generated by different functions, the comparison would be
meaningless. See also Utext.identifier_caseless_key.
val casefolded : t -> tcasefold t is t casefolded according to Unicode's default
casefold. This can be used to implement various forms of caseless
equalities. equal (casefolded t0) (casefolded t1) determines
default case equality
(TUS
D144) of t0 and t1. Warning. In general this notion is
not good enough use one of the following functions.val canonical_caseless_key : t -> tcanonical_caseless_key t is a key such that
equal (canonical_caseless_key t0) (canonical_caseless_key t1)
determines canonical caseless
equality (TUS
D145) of t0 and t1.val compatibility_caseless_key : t -> tcompatability_caseless_key t is a key such that
equal (compatibility_caseless_key t0) (compatibility_caseless_key t1)
determines compatibility caseless
equality (TUS
D146) of t0 and t1.
For more information see UAX 31
Unicode Identifier and Pattern Syntax.
val is_identifier : t -> bool
val identifier_caseless_key : t -> tidentifier_caseless_key t is a key such that
equal (identifier_caseless_key t0) (identifier_caseless_key t1)
determines identifier caseless
equality (TUS
D147) of t0 and t1.
These functions break text like a simple readline function
would. If you are looking for line breaks to layout text, see
line break segmentation.
typenewline =[ `ASCII | `NLF | `Readline ]
`ASCII newlines occur after a CR (U+000D), LF (U+000A) or
CRLF (<U+000D, U+000A>).`NLF newlines occur after the
Unicode newline function, this
is `ASCII along with NEL (Ub+0085).`Readline newlines are determined as for a
Unicode readline function (R4),
this is `NLF along with FF (U+000C), LS (U+2028) or
PS (U+2029).val lines : ?drop_empty:bool -> ?newline:newline -> t -> t Pvec.tlines ~drop_empty ~newline t breaks t into subtexts separated
by newlines determined according to newline (defaults to
`Readline). Separators are not part of the result and lost. If
drop_empty is true (defaults to false) drops lines that are
empty.val paragraphs : ?drop_empty:bool -> t -> t Pvec.tparagraphs ~newline t breaks t into subtexts separated either
by two consecutive newlines (determined as `NLF or
LS (U+2028)) or a single PS (U+2029). Separators are not part of
the result and lost. If drop_empty is true (defaults to
false) drops paragraphs that are empty.
For more information on normalization consult a short
introduction, the
UAX #15 Unicode
Normalization Forms and
normalization
charts.
typenormalization =[ `NFC | `NFD | `NFKC | `NFKD ]
`NFD
normalization form D, canonical decomposition.`NFC
normalization form C, canonical decomposition followed by
canonical composition.`NFKD
normalization form KD, compatibility decomposition.`NFKC
normalization form KC, compatibility decomposition,
followed by canonical composition.val normalized : normalization -> t -> tnormalized nf t is t normalized to nf.val is_normalized : normalization -> t -> boolis_normalized nf t is true iff t is in normalization form nf.
For more information consult the
UAX #29 Unicode Text
Segmentation, the UAX #14
Unicode Line Breaking Algorithm and the web based
ICU break utility.
typeboundary =[ `Grapheme_cluster | `Line_break | `Sentence | `Word ]
`Grapheme_cluster determines
extended grapheme clusters boundaries according to UAX 29
(corresponds, for most scripts, to user-perceived characters).`Word determines word boundaries according to UAX 29.`Sentence determines sentence boundaries according to UAX 29.`Line_break determines mandatory line breaks and
line break opportunities according to UAX 14.val segments : boundary -> t -> t Pvec.tsegments b t is are the segments of text t delimited by two
boundaries of type b.val segment_count : boundary -> t -> intsegment_count b t is Pvec.length (segments b t).typepos =int
v of length l
range over [0;l]. They are the slits before each element and after
the last one. They are labelled from left to right by increasing number.
The ith index is between positions i and i+1.
positions 0 1 2 3 4 l-1 l
+---+---+---+---+ +-----+
indices | 0 | 1 | 2 | 3 | ... | l-1 |
+---+---+---+---+ +-----+val boundaries : boundary -> t -> pos Pvec.tboundaries b t are the positions of boundaries b in
t.val boundaries_mandatory : boundary -> t -> (pos * bool) Pvec.tboundaries_mandatory is like Utext.boundaries but returns
the mandatory status of a boundary if the kind of boundary
sports that notion (or always true if not).val escaped : t -> tescaped t is t except characters whose general category is
Control, U+0022 or U+005C which are escaped according to OCaml's
lexical conventions for strings with:
'\b') escaped to the sequence <U+005C, U+0062>
("\\b")'\t') escaped to the sequence <U+005C, U+0074>
("\\t")'\n') escaped to the sequence <U+005C, U+006E>
"\\n"'\r') escaped to the sequence <U+005C, U+0072>
("\\r")'\"') escaped to the sequence <U+005C, U+0022>
("\\\"")'\\') escaped to the sequence <U+005C, U+005C>
("\\\\")"\u{H+}" escape
with H a capital hexadecimal number.
Note. As far as OCaml is concerned \u{H+} escapes are only
supported from 4.06 on.
val unescaped : t -> (t, int) Pervasives.resultunescaped s unescapes what Utext.escaped did and any other valid
\u{H+} escape. The, at most six, hexadecimal digits H of Unicode
hex escapes can be upper, lower, or mixed case. Any truncated or
undefined by Utext.escaped escape makes the function return
an Error idx with idx the start index of the offending escape.
The invariant unescape (escape t) = Ok t holds.
val encoding_guess : string -> [ `UTF_16BE | `UTF_16LE | `UTF_8 ] * bool
Warning. The following are a best-effort decodes in which any UTF-X
decoding error is replaced by at least one replacement character
Uchar.u_rep.
val of_utf_8 : ?first:int -> ?last:int -> string -> tof_utf_8 ~first ~last s is the Unicode text that results of
best-effort UTF-8 decoding the bytes of s that exist in the
range [first;last]. first defaults to 0 and last to
length s - 1.val of_utf_16le : ?first:int -> ?last:int -> string -> t
val of_utf_16be : ?first:int -> ?last:int -> string -> t
typedecode =(t, t * int * int option) Pervasives.result
Ok t if no decoding error occured.Error (t, err, restart) if a decoding error occured. t is
the text decoded until the error, err the byte index where
the decode error occured and restart a valid byte index where
a new best-effort decode could be restarted (if any).val try_of_utf_8 : ?first:int -> ?last:int -> string -> decodetry_of_utf_8 is like Utext.of_utf_8 except in case of error
Error _ is returned as described in decode_result.val try_of_utf_16le : ?first:int -> ?last:int -> string -> decode
val try_of_utf_16le : ?first:int -> ?last:int -> string -> decode
Warning. All these functions raise Invalid_argument if the
result cannot fit in the limits of Sys.max_string_length.
val to_utf_8 : t -> stringto_utf_8 t is the UTF-8 encoding of t.val to_utf_16le : t -> stringto_utf_16le t is the UTF-16LE encoding of t.val to_utf_16be : t -> stringto_utf_16be t is the UTF-16BE encoding of t.val buffer_add_utf_8 : Buffer.t -> t -> unitbuffer_add_utf_8 b t adds the UTF-8 encoding of t to b.val buffer_add_utf_16le : Buffer.t -> t -> unitbuffer_add_utf_16le b t adds the UTF-16LE encoding of t to b.val buffer_add_utf_16be : Buffer.t -> t -> unitbuffer_add_utf_16be b t adds the UTF-16BE encoding of t to b.val pp : Format.formatter -> t -> unitpp ppf t prints the UTF-8 encoding of t instructing the ppf
to use a length of 1 for each grapheme cluster of t.val pp_text : Format.formatter -> t -> unitpp_text ppf t is like Utext.pp except each line breaks is hinted
to the formatter, see Uuseg_string.pp_utf_8_text for details.val pp_lines : Format.formatter -> t -> unitpp_lines ppf t is like Utext.pp except only mandatory line breaks
are hinted to the formatter, see Uuseg_string.pp_utf_8_lines for
details.val pp_uchars : Format.formatter -> t -> unitdump_uchars ppf t formats t as a sequence of OCaml Uchar.t value
using only US-ASCII encoded characters according to the Unicode
notational convention for code points.val pp_toplevel : Format.formatter -> t -> unitpp_toplevel ppf t formats t using Utext.escaped and Utext.pp in a manner
suitable for the toplevel to represent a Utext.t value.
Warning. Before OCaml 4.06 the result might not be cut and pastable
as \u{H+} escapes are not supported.
val pp_toplevel_pvec : Format.formatter -> t Pvec.t -> unit