module Utext:sig
..end
Utext
provides a type for processing Unicode text.
See also Uucp
and Pvec
and consult a
minimal unicode introduction.
TODO make a minimal Utext specific minimal intro.
006ebe4 — Unicode version %%UNICODE_VERSION%% — homepage
val unicode_version : string
unicode_version
is the Unicode version supported by Utext
.typet =
Uchar.t Pvec.t
val empty : t
empty
is Pvec.empty
, the empty Unicode text.val v : len:int -> Uchar.t -> t
v ~len u
is Pvec.v
~len u
.val init : len:int -> (int -> Uchar.t) -> t
init ~len f
is Pvec.init
~len f
.val of_uchar : Uchar.t -> t
of_uchar u
is Pvec.singleton
u
.val str : string -> t
str s
is Unicode text from the valid UTF-8 encoded bytes s
.Invalid_argument
if text
is invalid UTF-8, use
Utext.of_utf_8
and Utext.try_of_utf_8
to deal with untrusted input.val strf : ('a, Format.formatter, unit, t) Pervasives.format4 -> 'a
strf fmt ...
is Format.kasprintf (fun s -> str s) fmt ...)
.
See also Pvec
's predicates and comparisons.
val is_empty : t -> bool
is_empty t
is true
if t
is empty, this is equal to
Pvec.is_empty.
val equal : t -> t -> bool
equal t0 t1
is true
if the elements in each vector are
equal. Warning. The test is textually meaningless unless
t0
and t1
are known to be in a particular form, see e.g.
Utext.canonical_caseless_key
or normal forms.
FIXME. Should we provide a fool-proof equality that
always compares in `NFD
or `NFC
? Problem is that
since we are using raw Pvec.t we cannot cache.
val compare : t -> t -> int
compare t0 t1
is the per element lexicographical order between
t0
and t1
. Warning. The comparison is textually
meaningless.
For more information about case see the
Unicode
case mapping FAQ and the
case mapping charts. Note
that these algorithms are insensitive to language and context and
may produce sub-par results for some users.
val lowercased : t -> t
lowercase t
is t
lowercased according to Unicode's default case
conversion.val uppercased : t -> t
uppercase t
is t
uppercased according to Unicode's default case
conversion.val capitalized : t -> t
capitalized t
is t
capitalized: if the first character of t
is cased it is mapped to its
title case mapping; otherwise t
is
left unchanged.val uncapitalized : t -> t
uncapitalized t
is t
uncapitalized: if the first character of
t
is cased it is mapped to its
lowercase case mapping; otherwise t
is left unchanged.
Testing the equality of two Unicode texts in a case insensitive
manner requires a fair amount of data massaging that includes
normalization and case folding. These results
should be cached if many comparisons have to be made on the same
text. The following functions return keys for a given text that
can be used to test equality against other keys. Do not test
keys generated by different functions, the comparison would be
meaningless. See also Utext.identifier_caseless_key
.
val casefolded : t -> t
casefold t
is t
casefolded according to Unicode's default
casefold. This can be used to implement various forms of caseless
equalities. equal (casefolded t0) (casefolded t1)
determines
default case equality
(TUS
D144) of t0
and t1
. Warning. In general this notion is
not good enough use one of the following functions.val canonical_caseless_key : t -> t
canonical_caseless_key t
is a key such that
equal (canonical_caseless_key t0) (canonical_caseless_key t1)
determines canonical caseless
equality (TUS
D145) of t0
and t1
.val compatibility_caseless_key : t -> t
compatability_caseless_key t
is a key such that
equal (compatibility_caseless_key t0) (compatibility_caseless_key t1)
determines compatibility caseless
equality (TUS
D146) of t0
and t1
.
For more information see UAX 31
Unicode Identifier and Pattern Syntax.
val is_identifier : t -> bool
val identifier_caseless_key : t -> t
identifier_caseless_key t
is a key such that
equal (identifier_caseless_key t0) (identifier_caseless_key t1)
determines identifier caseless
equality (TUS
D147) of t0
and t1
.
These functions break text like a simple readline
function
would. If you are looking for line breaks to layout text, see
line break segmentation.
typenewline =
[ `ASCII | `NLF | `Readline ]
`ASCII
newlines occur after a CR (U+000D), LF (U+000A) or
CRLF (<U+000D, U+000A>
).`NLF
newlines occur after the
Unicode newline function, this
is `ASCII
along with NEL (Ub+0085).`Readline
newlines are determined as for a
Unicode readline function (R4),
this is `NLF
along with FF (U+000C), LS (U+2028) or
PS (U+2029).val lines : ?drop_empty:bool -> ?newline:newline -> t -> t Pvec.t
lines ~drop_empty ~newline t
breaks t
into subtexts separated
by newlines determined according to newline
(defaults to
`Readline
). Separators are not part of the result and lost. If
drop_empty
is true
(defaults to false
) drops lines that are
empty.val paragraphs : ?drop_empty:bool -> t -> t Pvec.t
paragraphs ~newline t
breaks t
into subtexts separated either
by two consecutive newlines (determined as `NLF
or
LS (U+2028)) or a single PS (U+2029). Separators are not part of
the result and lost. If drop_empty
is true
(defaults to
false
) drops paragraphs that are empty.
For more information on normalization consult a short
introduction, the
UAX #15 Unicode
Normalization Forms and
normalization
charts.
typenormalization =
[ `NFC | `NFD | `NFKC | `NFKD ]
`NFD
normalization form D, canonical decomposition.`NFC
normalization form C, canonical decomposition followed by
canonical composition.`NFKD
normalization form KD, compatibility decomposition.`NFKC
normalization form KC, compatibility decomposition,
followed by canonical composition.val normalized : normalization -> t -> t
normalized nf t
is t
normalized to nf
.val is_normalized : normalization -> t -> bool
is_normalized nf t
is true
iff t
is in normalization form nf
.
For more information consult the
UAX #29 Unicode Text
Segmentation, the UAX #14
Unicode Line Breaking Algorithm and the web based
ICU break utility.
typeboundary =
[ `Grapheme_cluster | `Line_break | `Sentence | `Word ]
`Grapheme_cluster
determines
extended grapheme clusters boundaries according to UAX 29
(corresponds, for most scripts, to user-perceived characters).`Word
determines word boundaries according to UAX 29.`Sentence
determines sentence boundaries according to UAX 29.`Line_break
determines mandatory line breaks and
line break opportunities according to UAX 14.val segments : boundary -> t -> t Pvec.t
segments b t
is are the segments of text t
delimited by two
boundaries of type b
.val segment_count : boundary -> t -> int
segment_count b t
is Pvec.length (segments b t)
.typepos =
int
v
of length l
range over [0
;l
]. They are the slits before each element and after
the last one. They are labelled from left to right by increasing number.
The i
th index is between positions i
and i+1
.
positions 0 1 2 3 4 l-1 l +---+---+---+---+ +-----+ indices | 0 | 1 | 2 | 3 | ... | l-1 | +---+---+---+---+ +-----+
val boundaries : boundary -> t -> pos Pvec.t
boundaries b t
are the positions of boundaries b
in
t
.val boundaries_mandatory : boundary -> t -> (pos * bool) Pvec.t
boundaries_mandatory
is like Utext.boundaries
but returns
the mandatory status of a boundary if the kind of boundary
sports that notion (or always true
if not).val escaped : t -> t
escaped t
is t
except characters whose general category is
Control
, U+0022 or U+005C which are escaped according to OCaml's
lexical conventions for strings with:
'\b'
) escaped to the sequence <U+005C, U+0062>
("\\b"
)'\t'
) escaped to the sequence <U+005C, U+0074>
("\\t"
)'\n'
) escaped to the sequence <U+005C, U+006E>
"\\n"
'\r'
) escaped to the sequence <U+005C, U+0072>
("\\r"
)'\"'
) escaped to the sequence <U+005C, U+0022>
("\\\""
)'\\'
) escaped to the sequence <U+005C, U+005C>
("\\\\"
)"\u{H+}"
escape
with H
a capital hexadecimal number.
Note. As far as OCaml is concerned \u{H+}
escapes are only
supported from 4.06 on.
val unescaped : t -> (t, int) Pervasives.result
unescaped s
unescapes what Utext.escaped
did and any other valid
\u{H+}
escape. The, at most six, hexadecimal digits H
of Unicode
hex escapes can be upper, lower, or mixed case. Any truncated or
undefined by Utext.escaped
escape makes the function return
an Error idx
with idx
the start index of the offending escape.
The invariant unescape (escape t) = Ok t
holds.
val encoding_guess : string -> [ `UTF_16BE | `UTF_16LE | `UTF_8 ] * bool
Warning. The following are a best-effort decodes in which any UTF-X
decoding error is replaced by at least one replacement character
Uchar.u_rep
.
val of_utf_8 : ?first:int -> ?last:int -> string -> t
of_utf_8 ~first ~last s
is the Unicode text that results of
best-effort UTF-8 decoding the bytes of s
that exist in the
range [first
;last
]. first
defaults to 0
and last
to
length s - 1
.val of_utf_16le : ?first:int -> ?last:int -> string -> t
val of_utf_16be : ?first:int -> ?last:int -> string -> t
typedecode =
(t, t * int * int option) Pervasives.result
Ok t
if no decoding error occured.Error (t, err, restart)
if a decoding error occured. t
is
the text decoded until the error, err
the byte index where
the decode error occured and restart
a valid byte index where
a new best-effort decode could be restarted (if any).val try_of_utf_8 : ?first:int -> ?last:int -> string -> decode
try_of_utf_8
is like Utext.of_utf_8
except in case of error
Error _
is returned as described in decode_result
.val try_of_utf_16le : ?first:int -> ?last:int -> string -> decode
val try_of_utf_16le : ?first:int -> ?last:int -> string -> decode
Warning. All these functions raise Invalid_argument
if the
result cannot fit in the limits of Sys.max_string_length
.
val to_utf_8 : t -> string
to_utf_8 t
is the UTF-8 encoding of t
.val to_utf_16le : t -> string
to_utf_16le t
is the UTF-16LE encoding of t
.val to_utf_16be : t -> string
to_utf_16be t
is the UTF-16BE encoding of t
.val buffer_add_utf_8 : Buffer.t -> t -> unit
buffer_add_utf_8 b t
adds the UTF-8 encoding of t
to b
.val buffer_add_utf_16le : Buffer.t -> t -> unit
buffer_add_utf_16le b t
adds the UTF-16LE encoding of t
to b
.val buffer_add_utf_16be : Buffer.t -> t -> unit
buffer_add_utf_16be b t
adds the UTF-16BE encoding of t
to b
.val pp : Format.formatter -> t -> unit
pp ppf t
prints the UTF-8 encoding of t
instructing the ppf
to use a length of 1
for each grapheme cluster of t
.val pp_text : Format.formatter -> t -> unit
pp_text ppf t
is like Utext.pp
except each line breaks is hinted
to the formatter, see Uuseg_string.pp_utf_8_text
for details.val pp_lines : Format.formatter -> t -> unit
pp_lines ppf t
is like Utext.pp
except only mandatory line breaks
are hinted to the formatter, see Uuseg_string.pp_utf_8_lines
for
details.val pp_uchars : Format.formatter -> t -> unit
dump_uchars ppf t
formats t
as a sequence of OCaml Uchar.t
value
using only US-ASCII encoded characters according to the Unicode
notational convention for code points.val pp_toplevel : Format.formatter -> t -> unit
pp_toplevel ppf t
formats t
using Utext.escaped
and Utext.pp
in a manner
suitable for the toplevel to represent a Utext.t
value.
Warning. Before OCaml 4.06 the result might not be cut and pastable
as \u{H+}
escapes are not supported.
val pp_toplevel_pvec : Format.formatter -> t Pvec.t -> unit