Module Uucp.Break

module Break: sig .. end
Break properties.

These properties are mainly for the Unicode text segmentation and line breaking algorithm.

References




Line break


type line = [ `AI
| `AL
| `B2
| `BA
| `BB
| `BK
| `CB
| `CJ
| `CL
| `CM
| `CP
| `CR
| `EB
| `EM
| `EX
| `GL
| `H2
| `H3
| `HL
| `HY
| `ID
| `IN
| `IS
| `JL
| `JT
| `JV
| `LF
| `NL
| `NS
| `NU
| `OP
| `PO
| `PR
| `QU
| `RI
| `SA
| `SG
| `SP
| `SY
| `WJ
| `XX
| `ZW
| `ZWJ ]
The type for line breaks.
val pp_line : Format.formatter -> line -> unit
pp_line ppf l prints an unspecified representation of l on ppf.
val line : Uucp.uchar -> line
line u is u's line break property.

Grapheme cluster break


type grapheme_cluster = [ `CN
| `CR
| `EB
| `EBG
| `EM
| `EX
| `GAZ
| `L
| `LF
| `LV
| `LVT
| `PP
| `RI
| `SM
| `T
| `V
| `XX
| `ZWJ ]
The type for grapheme cluster breaks.
val pp_grapheme_cluster : Format.formatter -> grapheme_cluster -> unit
pp_grapheme_cluster ppf g prints an unspecified representation of g on ppf.
val grapheme_cluster : Uucp.uchar -> grapheme_cluster
grapheme_cluster u is u's grapheme cluster break property.

Word break


type word = [ `CR
| `DQ
| `EB
| `EBG
| `EM
| `EX
| `Extend
| `FO
| `GAZ
| `HL
| `KA
| `LE
| `LF
| `MB
| `ML
| `MN
| `NL
| `NU
| `RI
| `SQ
| `XX
| `ZWJ ]
The type for word breaks.
val pp_word : Format.formatter -> word -> unit
pp_grapheme_cluster ppf g prints an unspecified representation of g on ppf.
val word : Uucp.uchar -> word
world u is u's word break property.

Sentence break


type sentence = [ `AT
| `CL
| `CR
| `EX
| `FO
| `LE
| `LF
| `LO
| `NU
| `SC
| `SE
| `SP
| `ST
| `UP
| `XX ]
The type for sentence breaks.
val pp_sentence : Format.formatter -> sentence -> unit
pp_grapheme_cluster ppf g prints an unspecified representation of g on ppf.
val sentence : Uucp.uchar -> sentence
sentence u is u's sentence break property.

East Asian width


type east_asian_width = [ `A | `F | `H | `N | `Na | `W ] 
The type for East Asian widths.
val pp_east_asian_width : Format.formatter -> east_asian_width -> unit
pp_east_asian_width ppf w prints an unspecified representation of w on ppf.
val east_asian_width : Uucp.uchar -> east_asian_width
east_asian_width u is u's East Asian width property.

Terminal width


val tty_width_hint : Uucp.uchar -> int
tty_width_hint u approximates u's column width as rendered by a typical character terminal.

The current implementation of the function returns either 0, 1, 2 or -1. The value -1 is only returned for scalar values for which the property is non-sensical; clients are expected to sanitize their inputs and not to use the function with these scalar values which are those in range U+0001-U+001F (C0 controls without U+0000) and U+007F-U+009F (DELETE and C1 controls).

Note. Converting a string to normalization form C before folding this function over its scalar values will, in general, yield better approximations (e.g. on Hangul).

Warning. This is not a normative property and only a heuristic. If you find yourself using this function please read carefully the following lines.

This function is the moral equivalent of POSIX wcwidth, in that its purpose is to help align text displayed by a character terminal. It mimics wcwidth, as widely implemented, in yet another way: it is mostly wrong.

Computing column width is a surprisingly difficult task in general. Much of the software infrastructure still carries legacy assumptions about the nature of text harking back to the ASCII era. Different terminal emulators attempt to cope with general Unicode text in different ways, creating a fundamental problem: width of text fragments will vary across terminal emulators, with no way of getting feedback from the output layer back into the text-producing layer.

For example: on a modern Linux system, a collection of terminals will disagree on some or all of U+00AD, U+0CBF, and U+2029. They will likewise disagree about unassigned characters (category Cn), sometimes contradicting the system's wcwidth (e.g. U+0378, U+0530). Terminals using bare libxft will display complex scripts differently from terminals using HarfBuzz, and the rendering on OS X will be slightly different from both.

tty_width_hint uses a simple and predictable width algorithm, based on Markus Kuhn's portable wcwidth:

This approach works well, in that it gives results generally consistent with a wide range of terminals, for alphabetic scripts, and for east Asian syllabic and logographic scripts in non-decomposed form. Support varies for abjad scripts in the presence of vowel marks, and it mostly breaks down on abugidas.

Moreover, non-text symbols like Emoji or Yijing hexagrams will be incorrectly classified as 1-wide, but this in fact agrees with their rendering on many terminals.

Clients should not over-rely on tty_width_hint. It provides a best-effort approximation which will sometimes fail in practice.


Low level interface


module Low: sig .. end
Low level interface.