Unicode text segmentation.
Uuseg segments Unicode text. It implements the locale
independent Unicode text segmentation algorithms to detect
grapheme cluster, word and sentence boundaries and the Unicode
line breaking algorithm to detect line break opportunities.
The module is independent from any IO mechanism or Unicode text data structure and it can process text without a complete in-memory representation.
The supported Unicode version
is determined by the
Warning Version 12.0.0 of UAX #29 grapheme cluster and word segmentation are not stricly conformant with respect to emojis see this issue for details.
v12.0.0 — Unicode version 12.0.0 — homepage
val unicode_version :
unicode_version is the Unicode version supported by
The type for custom segmenters. See
[ `Custom of custom
| `Word ]
The type for boundaries.
`Grapheme_clusterdetermines extended grapheme clusters boundaries according to UAX 29 (corresponds, for most scripts, to user-perceived characters).
`Worddetermines word boundaries according to UAX 29.
`Sentencedetermines sentence boundaries according to UAX 29.
`Line_breakdetermines mandatory line breaks and line break opportunities according to UAX 14.
val pp_boundary :
Stdlib.Format.formatter -> boundary -> unit
pp_boundary ppf b prints an unspecified representation of
The type for Unicode text segmenters.
[ `Await | `Boundary | `End | `Uchar of Stdlib.Uchar.t ]
The type for segmenter results. See
val create :
[< boundary ] -> t
create b is an Unicode text segmenter for boundaries of type
val boundary :
t -> boundary
boundary s is the type of boundaries detected by
val add :
t -> [ `Await | `End | `Uchar of Stdlib.Uchar.t ] -> ret
add s v is:
`Boundaryif there is a boundary at that point in the sequence of characters. The client must then call
uis the next character in the sequence. The client must then call
`Awaitwhen the segmenter is ready to add a new
`Endwas added and all
`Uchar u to add a new character to the sequence to
`End to signal the end of sequence. After adding one
of these two values always call
`End is returned.
`Endis added while that last add did not return
`Awaitor if an
`Endis added after an
`Endwas already added.
val mandatory :
t -> bool
mandatory s is
true if the last
`Boundary returned by
was mandatory. This function only makes sense for
`Custom segmenters that sport that notion. For
other segmenters or if no
`Boundary was returned so far,
val copy :
t -> t
copy s is a copy of
s in its current state. Subsequent
s do not affect the copy.
val pp_ret :
Stdlib.Format.formatter -> [< ret ] -> unit
pp_ret ppf v prints an unspecified representation of
val custom :
?mandatory:('a -> bool) ->
create:(unit -> 'a) ->
copy:('a -> 'a) ->
add:('a -> [ `Await | `End | `Uchar of Stdlib.Uchar.t ] -> ret) ->
unit -> custom
create ~mandatory ~name ~create ~copy ~add is a custom segmenter.
nameis a name to identify the segmenter.
createis called when the segmenter is created it should return a custom segmenter value.
copyis called with the segmenter value whenever the segmenter is copied. It should return a copy of the segmenter value.
mandatoryis called with the segmenter value to define the result of the
Uuseg.mandatoryfunction. Defaults always returns
addis called with the segmenter value to define the result of the
Uuseg.addvalue. The returned value should respect the semantics of
Uuseg.add. Use the functions
Uuseg.adds error cases.
val err_exp_await :
[< ret ] -> 'a
err_exp_await fnd should be used by custom segmenters when
the client tries to
`End while the last
returned value was not an
val err_ended :
[< ret ] -> 'a
err_ended () should be used by custom segmenter when the client
`End was already added.
`Grapheme_cluster segmenter will always consume only a small
bounded amount of memory on any text. Other segmenters will also
do so on non-degenerate text, but it's possible to feed them with
input that will make them buffer an arbitrary amount of
A segmenter is a stateful filter that inputs a sequence of characters
and outputs the same sequence except characters are interleaved
`Boundary values whenever the segmenter detects a boundary.
Uuseg.create returns a new segmenter for a given boundary
let words = Uuseg.create `Word
To add characters to the sequence to segment, call
`Uchar _. To end the sequence call
`End. The segmented sequence of characters is returned character
by character, interleaved with
`Boundary values at the appropriate
places, by the successive calls to
The client and the segmenter must wait on each other to limit
internal buffering: each time the client adds to the sequence
`End it must continue to
`Await until the segmenter returns
`End. In practice this leads to the following kind of control flow:
let rec add acc v = match Uuseg.add words v with | `Uchar u -> add (`Uchar u :: acc) `Await | `Boundary -> add (`B :: acc) `Await | `Await | `End -> acc
For example to segment the sequence <
"a b") to a list of characters interleaved with
`B values on word
boundaries we can write:
let uchar = `Uchar (Uchar.of_int u) let seq = [uchar 0x0041; uchar 0x0020; uchar 0x0042] let seq_words = List.rev (add (List.fold_left add  seq) `End)
utf_8_segments seg s is the list of UTF-8 encoded
seg segments of
the UTF-8 encoded string
s. This example uses
Uutf to fold over
the characters of
s and to encode the characters in a standard
OCaml buffer. Note that this function can be derived directly from
let utf_8_segments seg s = let b = Buffer.create 42 in let flush_segment acc = let segment = Buffer.contents b in Buffer.clear b; if segment = "" then acc else segment :: acc in let seg = Uuseg.create (seg :> Uuseg.boundary) in let rec add acc v = match Uuseg.add seg v with | `Uchar u -> Uutf.Buffer.add_utf_8 b u; add acc `Await | `Boundary -> add (flush_segment acc) `Await | `Await -> acc in let rec uchar acc _ = function | `Uchar _ as u -> add acc u | `Malformed _ -> add acc (`Uchar Uutf.u_rep) in List.rev (flush_segment (add (Uutf.String.fold_utf_8 uchar  s) `End))