Bytesrw_utf
UTF streams.
A few tools to deal with UTF encoded streams. For now just encoding guessing, more may be added in the future.
Sample code for decoding UTF-8 with position tracking using a byte stream reader and encoding UTF-8 with a byte stream writer can be found here.
module Encoding : sig ... end
Encoding specification.
val guess_reader_encoding : Bytesrw.Bytes.Reader.t -> Encoding.t
guess_reader_encoding r
guesses the encoding at the stream position of r
by sniffing three bytes and applying this heuristic which is subject to change in the future.
The heuristic is compatible with BOM based recognition and the old JSON encoding recognition (UTF-8 is mandated nowadays) that relies on ASCII being present at the beginning of the stream.
The heuristic looks at the first three bytes of input (or less if impossible) and takes the first matching byte pattern in the table below.
xx = any byte .. = any byte or no byte (input too small) pp = positive byte uu = valid UTF-8 first byte Bytes | Guess | Rationale ---------+-----------+----------------------------------------------- EF BB BF | `UTF_8 | UTF-8 BOM FE FF .. | `UTF_16BE | UTF-16BE BOM FF FE .. | `UTF_16LE | UTF-16LE BOM 00 pp .. | `UTF_16BE | ASCII UTF-16BE and U+0000 is often forbidden pp 00 .. | `UTF_16LE | ASCII UTF-16LE and U+0000 is often forbidden uu .. .. | `UTF_8 | ASCII UTF-8 or valid UTF-8 first byte. xx xx .. | `UTF_16BE | Not UTF-8 => UTF-16, no BOM => UTF-16BE .. .. .. | `UTF_8 | Single malformed UTF-8 byte or no input.