module Utext:`sig`

..`end`

Unicode text for OCaml.
### References

`Utext`

provides a type for processing Unicode text.

See also `Uucp`

and `Pvec`

and consult a
minimal unicode introduction.
TODO make a minimal Utext specific minimal intro.

*006ebe4 — Unicode version %%UNICODE_VERSION%% —
homepage*

- The Unicode FAQ.
- The Unicode Consortium.
*The Unicode Standard*. (latest version)

`val unicode_version : ``string`

`unicode_version`

is the Unicode version supported by `Utext`

.type`t =`

`Uchar.t Pvec.t`

The type for Unicode text, a persistent vector of Unicode
characters.

`val empty : ``t`

`empty`

is `Pvec.empty`

, the empty Unicode text.`val v : ``len:int -> Uchar.t -> t`

`v ~len u`

is `Pvec.v`

` ~len u`

.`val init : ``len:int -> (int -> Uchar.t) -> t`

`init ~len f`

is `Pvec.init`

` ~len f`

.`val of_uchar : ``Uchar.t -> t`

`of_uchar u`

is `Pvec.singleton`

` u`

.`val str : ``string -> t`

`str s`

is Unicode text from the `s`

.`Invalid_argument`

if `text`

is invalid UTF-8, use
`Utext.of_utf_8`

and `Utext.try_of_utf_8`

to deal with untrusted input.`val strf : ``('a, Format.formatter, unit, t) Pervasives.format4 -> 'a`

`strf fmt ...`

is `Format.kasprintf (fun s -> str s) fmt ...)`

.
See also `Pvec`

's predicates and comparisons.

`val is_empty : ``t -> bool`

`is_empty t`

is `true`

if `t`

is empty, this is equal to
`Pvec.is_empty.`

`val equal : ``t -> t -> bool`

`equal t0 t1`

is `true`

if the elements in each vector are
equal. `t0`

and `t1`

are known to be in a particular form, see e.g.
`Utext.canonical_caseless_key`

or normal forms.
**FIXME.** Should we provide a fool-proof equality that
always compares in ``NFD`

or ``NFC`

? Problem is that
since we are using raw Pvec.t we cannot cache.

`val compare : ``t -> t -> int`

`compare t0 t1`

is the per element lexicographical order between
`t0`

and `t1`

.
For more information about case see the
Unicode
case mapping FAQ and the
case mapping charts. Note
that these algorithms are insensitive to language and context and
may produce sub-par results for some users.

`val lowercased : ``t -> t`

`lowercase t`

is `t`

lowercased according to Unicode's default case
conversion.`val uppercased : ``t -> t`

`uppercase t`

is `t`

uppercased according to Unicode's default case
conversion.`val capitalized : ``t -> t`

`capitalized t`

is `t`

capitalized: if the first character of `t`

is cased it is mapped to its
title case mapping; otherwise `t`

is
left unchanged.`val uncapitalized : ``t -> t`

`uncapitalized t`

is `t`

uncapitalized: if the first character of
`t`

is cased it is mapped to its
lowercase case mapping; otherwise `t`

is left unchanged.
Testing the equality of two Unicode texts in a case insensitive
manner requires a fair amount of data massaging that includes
normalization and case folding. These results
should be cached if many comparisons have to be made on the same
text. The following functions return keys for a given text that
can be used to test equality against other keys. **Do not** test
keys generated by different functions, the comparison would be
meaningless. See also `Utext.identifier_caseless_key`

.

`val casefolded : ``t -> t`

`casefold t`

is `t`

casefolded according to Unicode's default
casefold. This can be used to implement various forms of caseless
equalities. `equal (casefolded t0) (casefolded t1)`

determines
default case equality
(TUS
D144) of `t0`

and `t1`

. `val canonical_caseless_key : ``t -> t`

`canonical_caseless_key t`

is a key such that
`equal (canonical_caseless_key t0) (canonical_caseless_key t1)`

determines canonical caseless
equality (TUS
D145) of `t0`

and `t1`

.`val compatibility_caseless_key : ``t -> t`

`compatability_caseless_key t`

is a key such that
`equal (compatibility_caseless_key t0) (compatibility_caseless_key t1)`

determines compatibility caseless
equality (TUS
D146) of `t0`

and `t1`

.
For more information see UAX 31
Unicode Identifier and Pattern Syntax.

`val is_identifier : ``t -> bool`

`val identifier_caseless_key : ``t -> t`

`identifier_caseless_key t`

is a key such that
`equal (identifier_caseless_key t0) (identifier_caseless_key t1)`

determines identifier caseless
equality (TUS
D147) of `t0`

and `t1`

.
These functions break text like a simple `readline`

function
would. If you are looking for line breaks to layout text, see
line break segmentation.

type`newline =`

`[ `ASCII | `NLF | `Readline ]`

The type for specifying newlines.

``ASCII`

newlines occur after a CR (U+000D), LF (U+000A) or CRLF (`<U+000D, U+000A>`

).``NLF`

newlines occur after the*Unicode newline function*, this is``ASCII`

along with NEL (Ub+0085).``Readline`

newlines are determined as for a*Unicode readline function*(R4), this is``NLF`

along with FF (U+000C), LS (U+2028) or PS (U+2029).

`val lines : ``?drop_empty:bool -> ?newline:newline -> t -> t Pvec.t`

`lines ~drop_empty ~newline t`

breaks `t`

into subtexts separated
by newlines determined according to `newline`

(defaults to
``Readline`

). Separators are not part of the result and lost. If
`drop_empty`

is `true`

(defaults to `false`

) drops lines that are
empty.`val paragraphs : ``?drop_empty:bool -> t -> t Pvec.t`

`paragraphs ~newline t`

breaks `t`

into subtexts separated either
by two consecutive newlines (determined as ``NLF`

or
LS (U+2028)) or a single PS (U+2029). Separators are not part of
the result and lost. If `drop_empty`

is `true`

(defaults to
`false`

) drops paragraphs that are empty.
For more information on normalization consult a short
introduction, the
UAX #15 Unicode
Normalization Forms and
normalization
charts.

type`normalization =`

`[ `NFC | `NFD | `NFKC | `NFKD ]`

The type for normalization forms.

``NFD`

normalization form D, canonical decomposition.``NFC`

normalization form C, canonical decomposition followed by canonical composition.``NFKD`

normalization form KD, compatibility decomposition.``NFKC`

normalization form KC, compatibility decomposition, followed by canonical composition.

`val normalized : ``normalization -> t -> t`

`normalized nf t`

is `t`

normalized to `nf`

.`val is_normalized : ``normalization -> t -> bool`

`is_normalized nf t`

is `true`

iff `t`

is in normalization form `nf`

.
For more information consult the
UAX #29 Unicode Text
Segmentation, the UAX #14
Unicode Line Breaking Algorithm and the web based
ICU break utility.

type`boundary =`

`[ `Grapheme_cluster | `Line_break | `Sentence | `Word ]`

The type for boundaries.

``Grapheme_cluster`

determines extended grapheme clusters boundaries according to UAX 29 (corresponds, for most scripts, to user-perceived characters).``Word`

determines word boundaries according to UAX 29.``Sentence`

determines sentence boundaries according to UAX 29.``Line_break`

determines mandatory line breaks and line break opportunities according to UAX 14.

`val segments : ``boundary -> t -> t Pvec.t`

`segments b t`

is are the segments of text `t`

delimited by two
boundaries of type `b`

.`val segment_count : ``boundary -> t -> int`

`segment_count b t`

is `Pvec.length (segments b t)`

.type`pos =`

`int`

The type for positions. The positions of a vector

`v`

of length `l`

range over [`0`

;`l`

]. They are the slits before each element and after
the last one. They are labelled from left to right by increasing number.
The `i`

th index is between positions `i`

and `i+1`

.
positions 0 1 2 3 4 l-1 l +---+---+---+---+ +-----+ indices | 0 | 1 | 2 | 3 | ... | l-1 | +---+---+---+---+ +-----+

`val boundaries : ``boundary -> t -> pos Pvec.t`

`boundaries b t`

are the positions of boundaries `b`

in
`t`

.`val boundaries_mandatory : ``boundary -> t -> (pos * bool) Pvec.t`

`boundaries_mandatory`

is like `Utext.boundaries`

but returns
the mandatory status of a boundary if the kind of boundary
sports that notion (or always `true`

if not).`val escaped : ``t -> t`

`escaped t`

is `t`

except characters whose general category is
`Control`

, U+0022 or U+005C which are escaped according to OCaml's
lexical conventions for strings with:
- Any U+0008 (
`'\b'`

) escaped to the sequence <U+005C, U+0062> (`"\\b"`

) - Any U+0009 (
`'\t'`

) escaped to the sequence <U+005C, U+0074> (`"\\t"`

) - Any U+000A (
`'\n'`

) escaped to the sequence <U+005C, U+006E>`"\\n"`

- Any U+000D (
`'\r'`

) escaped to the sequence <U+005C, U+0072> (`"\\r"`

) - Any U+0022 (
`'\"'`

) escaped to the sequence <U+005C, U+0022> (`"\\\""`

) - Any U+005C (
`'\\'`

) escaped to the sequence <U+005C, U+005C> (`"\\\\"`

) - Any other character is escaped by an
*hexadecimal*`"\u{H+}"`

escape with`H`

a capital hexadecimal number.

**Note.** As far as OCaml is concerned `\u{H+}`

escapes are only
supported from 4.06 on.

`val unescaped : ``t -> (t, int) Pervasives.result`

`unescaped s`

unescapes what `Utext.escaped`

did and any other valid
`\u{H+}`

escape. The, at most six, hexadecimal digits `H`

of Unicode
hex escapes can be upper, lower, or mixed case. Any truncated or
undefined by `Utext.escaped`

escape makes the function return
an `Error idx`

with `idx`

the start index of the offending escape.
The invariant `unescape (escape t) = Ok t`

holds.

`val encoding_guess : ``string -> [ `UTF_16BE | `UTF_16LE | `UTF_8 ] * bool`

**Warning.** The following are a best-effort decodes in which any UTF-X
decoding error is replaced by at least one replacement character
`Uchar.u_rep`

.

`val of_utf_8 : ``?first:int -> ?last:int -> string -> t`

`of_utf_8 ~first ~last s`

is the Unicode text that results of
best-effort UTF-8 decoding the bytes of `s`

that exist in the
range [`first`

;`last`

]. `first`

defaults to `0`

and `last`

to
`length s - 1`

.`val of_utf_16le : ``?first:int -> ?last:int -> string -> t`

`val of_utf_16be : ``?first:int -> ?last:int -> string -> t`

type`decode =`

`(t, t * int * int option) Pervasives.result`

The type for decode result. This is:

`Ok t`

if no decoding error occured.`Error (t, err, restart)`

if a decoding error occured.`t`

is the text decoded until the error,`err`

the byte index where the decode error occured and`restart`

a valid byte index where a new best-effort decode could be restarted (if any).

`val try_of_utf_8 : ``?first:int -> ?last:int -> string -> decode`

`try_of_utf_8`

is like `Utext.of_utf_8`

except in case of error
`Error _`

is returned as described in `decode_result`

.`val try_of_utf_16le : ``?first:int -> ?last:int -> string -> decode`

`val try_of_utf_16le : ``?first:int -> ?last:int -> string -> decode`

**Warning.** All these functions raise `Invalid_argument`

if the
result cannot fit in the limits of `Sys.max_string_length`

.

`val to_utf_8 : ``t -> string`

`to_utf_8 t`

is the UTF-8 encoding of `t`

.`val to_utf_16le : ``t -> string`

`to_utf_16le t`

is the UTF-16LE encoding of `t`

.`val to_utf_16be : ``t -> string`

`to_utf_16be t`

is the UTF-16BE encoding of `t`

.`val buffer_add_utf_8 : ``Buffer.t -> t -> unit`

`buffer_add_utf_8 b t`

adds the UTF-8 encoding of `t`

to `b`

.`val buffer_add_utf_16le : ``Buffer.t -> t -> unit`

`buffer_add_utf_16le b t`

adds the UTF-16LE encoding of `t`

to `b`

.`val buffer_add_utf_16be : ``Buffer.t -> t -> unit`

`buffer_add_utf_16be b t`

adds the UTF-16BE encoding of `t`

to `b`

.`val pp : ``Format.formatter -> t -> unit`

`pp ppf t`

prints the UTF-8 encoding of `t`

instructing the `ppf`

to use a length of `1`

for each grapheme cluster of `t`

.`val pp_text : ``Format.formatter -> t -> unit`

`pp_text ppf t`

is like `Utext.pp`

except each line breaks is hinted
to the formatter, see `Uuseg_string.pp_utf_8_text`

for details.`val pp_lines : ``Format.formatter -> t -> unit`

`pp_lines ppf t`

is like `Utext.pp`

except only `Uuseg_string.pp_utf_8_lines`

for
details.`val pp_uchars : ``Format.formatter -> t -> unit`

`dump_uchars ppf t`

formats `t`

as a sequence of OCaml `Uchar.t`

value
using only US-ASCII encoded characters according to the Unicode
notational convention for code points.`val pp_toplevel : ``Format.formatter -> t -> unit`

`pp_toplevel ppf t`

formats `t`

using `Utext.escaped`

and `Utext.pp`

in a manner
suitable for the toplevel to represent a `Utext.t`

value.
**Warning.** Before OCaml 4.06 the result might not be cut and pastable
as `\u{H+}`

escapes are not supported.

`val pp_toplevel_pvec : ``Format.formatter -> t Pvec.t -> unit`