Module Bytesrw_utf

UTF streams.

A few tools to deal with UTF encoded streams.

Encoding

type encoding = [
  1. | `Utf_8
  2. | `Utf_16be
  3. | `Utf_16le
]

The type for UTF encodings.

val pp_encoding : Stdlib.Format.formatter -> [< encoding | `Utf_16 ] -> unit

pp_encoding formats encoding to its IANA character setname.

Encoding guess

val guess_reader_encoding : Bytesrw.Bytes.Reader.t -> encoding

guess_reader_encoding r guesses the encoding at the stream position of r by sniffing three bytes and applying this heuristic which is subject to change in the future.

Validate

ensure_reads encoding r filters the reads of r to make sure the stream is a valid encoding byte stream. Invalid byte sequences

Encoding guess heurisitic

Note, this was taken from Uutf. Twelve years laters I'm not sure it's the best way to go about it, in particular this was constrained by the making the JSON guess according to the old spec, JSON starts with ASCII but international text does not.

The heuristic is compatible with BOM based recognition and the old JSON encoding recognition that relies on ASCII being present at the beginning of the stream (JSON mandates UTF-8 nowadays).

The heuristic looks at the first three bytes of input (or less if impossible) and takes the first matching byte pattern in the table below.

xx = any byte
.. = any byte or no byte (input too small)
pp = positive byte
uu = valid UTF-8 first byte

Bytes    | Guess     | Rationale
---------+-----------+-----------------------------------------------
EF BB BF | `UTF_8    | UTF-8 BOM
FE FF .. | `UTF_16BE | UTF-16BE BOM
FF FE .. | `UTF_16LE | UTF-16LE BOM
00 pp .. | `UTF_16BE | ASCII UTF-16BE and U+0000 is often forbidden
pp 00 .. | `UTF_16LE | ASCII UTF-16LE and U+0000 is often forbidden
uu .. .. | `UTF_8    | ASCII UTF-8 or valid UTF-8 first byte.
xx xx .. | `UTF_16BE | Not UTF-8 => UTF-16, no BOM => UTF-16BE
.. .. .. | `UTF_8    | Single malformed UTF-8 byte or no input.