Bytesrw_utf
UTF streams.
A few tools to deal with UTF encoded streams.
val pp_encoding : Stdlib.Format.formatter -> [< encoding | `Utf_16 ] -> unit
pp_encoding
formats encoding to its IANA character setname.
val guess_reader_encoding : Bytesrw.Bytes.Reader.t -> encoding
guess_reader_encoding r
guesses the encoding at the stream position of r
by sniffing three bytes and applying this heuristic which is subject to change in the future.
val ensure_reads : encoding -> Bytesrw.Bytes.Reader.filter
ensure_reads encoding r
filters the reads of r
to make sure the stream is a valid encoding
byte stream. Invalid byte sequences
Note, this was taken from Uutf. Twelve years laters I'm not sure it's the best way to go about it, in particular this was constrained by the making the JSON guess according to the old spec, JSON starts with ASCII but international text does not.
The heuristic is compatible with BOM based recognition and the old JSON encoding recognition that relies on ASCII being present at the beginning of the stream (JSON mandates UTF-8 nowadays).
The heuristic looks at the first three bytes of input (or less if impossible) and takes the first matching byte pattern in the table below.
xx = any byte .. = any byte or no byte (input too small) pp = positive byte uu = valid UTF-8 first byte Bytes | Guess | Rationale ---------+-----------+----------------------------------------------- EF BB BF | `UTF_8 | UTF-8 BOM FE FF .. | `UTF_16BE | UTF-16BE BOM FF FE .. | `UTF_16LE | UTF-16LE BOM 00 pp .. | `UTF_16BE | ASCII UTF-16BE and U+0000 is often forbidden pp 00 .. | `UTF_16LE | ASCII UTF-16LE and U+0000 is often forbidden uu .. .. | `UTF_8 | ASCII UTF-8 or valid UTF-8 first byte. xx xx .. | `UTF_16BE | Not UTF-8 => UTF-16, no BOM => UTF-16BE .. .. .. | `UTF_8 | Single malformed UTF-8 byte or no input.