Module Bytesrw_utf

UTF streams.

A few tools to deal with UTF encoded streams. For now just encoding guessing, more may be added in the future.

Sample code for decoding UTF-8 with position tracking using a byte stream reader and encoding UTF-8 with a byte stream writer can be found here.

Encodings

module Encoding : sig ... end

Encoding specification.

Encoding guess

val guess_reader_encoding : Bytesrw.Bytes.Reader.t -> Encoding.t

guess_reader_encoding r guesses the encoding at the stream position of r by sniffing three bytes and applying this heuristic which is subject to change in the future.

Encoding guess heurisitic

The heuristic is compatible with BOM based recognition and the old JSON encoding recognition (UTF-8 is mandated nowadays) that relies on ASCII being present at the beginning of the stream.

The heuristic looks at the first three bytes of input (or less if impossible) and takes the first matching byte pattern in the table below.

xx = any byte
.. = any byte or no byte (input too small)
pp = positive byte
uu = valid UTF-8 first byte

Bytes    | Guess     | Rationale
---------+-----------+-----------------------------------------------
EF BB BF | `UTF_8    | UTF-8 BOM
FE FF .. | `UTF_16BE | UTF-16BE BOM
FF FE .. | `UTF_16LE | UTF-16LE BOM
00 pp .. | `UTF_16BE | ASCII UTF-16BE and U+0000 is often forbidden
pp 00 .. | `UTF_16LE | ASCII UTF-16LE and U+0000 is often forbidden
uu .. .. | `UTF_8    | ASCII UTF-8 or valid UTF-8 first byte.
xx xx .. | `UTF_16BE | Not UTF-8 => UTF-16, no BOM => UTF-16BE
.. .. .. | `UTF_8    | Single malformed UTF-8 byte or no input.