Module B0_text.Textdec

Text decoder.

A text decoder inputs UTF-8 encoded characters from a string. It checks its validity and maintains information the absolute byte positions and line position (incrementing on LF, CR or CRLF) of the last decoded character. It also has a token buffer that can be used for lexing.

Decodes

type decode = int

The type for decodes. This is either an arbitrary Unicode scalar value, sot or eot, if not either of those can be safely converted to an Uchar.t value with Uchar.unsafe_of_int.

val sot : decode

sot is x11_0000 (Uchar.max + 1) an integer to represent the start of text.

val eot : decode

eot is x11_0001 (Uchar.max + 2) an integer to represent the end of text.

val pp_decode : Stdlib.Format.formatter -> decode -> unit

pp_decode formats decodes for inspection. This can be used in error messages, it escapes control characters and uses the strings "start of text" and "end of text" for sot and eot.

Decoders

type t

The type for text decoders.

val make : ?file:Textloc.filepath -> string -> t

make ~file s decodes UTF-8 text from s assuming to have been read from a file file (defaults to Textloc.file_none).

val input : t -> string

input d is the input string of d

val file : t -> Textloc.filepath

file d is the file associated to d.

Decoding

val current : t -> decode

current d is the current decode. This is either:

  • sot, if next was never called on d.
  • eot, if all input characters have been decoded via next.
  • A Unicode scalar value.
val is_error : t -> bool

is_error d is true if current d is Uchar.rep and not the result of a valid UTF-8 decode.

val next : t -> unit

next d decodes the next UTF-8 character into current and updates the text locations. Repeated calls to next after eot has been returned have no effect.

If an UTF-8 decoding error occurs current becomes Uchar.rep and is_error returns true. next can still be called afterwards for best-effort decoding.

Text locations

Byte positions

val first_byte_pos : t -> Textloc.byte_pos

first_byte_pos d is the first byte position of the current decode. If current is:

  • sot, this is 0.
  • eot, this is String.length (input d)
  • A Unicode Scalar value, this is the first index in input d of its UTF-8 encoding.
val last_byte_pos : t -> Textloc.byte_pos

last_byte_pos d is the last position of the current decode. If current is:

  • sot, this is 0.
  • eot, this is String.length (input d)
  • A Unicode Scalar value, this is the last index in input d of its UTF-8 encoding.

Line positions

val line_num : t -> Textloc.line_num

line_num d is the current line number.

val line_start : t -> Textloc.byte_pos

line_num d is the first byte position on the current line. See Textloc.line_pos.

val line_pos : t -> Textloc.line_pos

line_pos d is the line position of the current decode.

val prev_line_num : t -> Textloc.line_num

prev_line_num d is the previous line number. This is line_num minus one or 1 on the first line.

val prev_line_start : t -> Textloc.byte_pos

prev_line_start is the line start of the previous line.

val prev_line_pos : t -> Textloc.line_pos

previous_line_pos d is the line position of the previous line.

Text locations

pos d is first_byte_pos d, line_pos d. This is the first position of the current decode.

val textloc : t -> Textloc.t

textloc d is the text position of the current decode. The text location spans the UTF-8 bytes of the decode it is on line_pos d.

val textloc_span : t -> start:(Textloc.byte_pos * Textloc.line_pos) -> Textloc.t

textloc_span d ~start is a text location that spans from start to the last byte of the current decode.

val textloc_span_to_prev_decode : t -> start:(Textloc.byte_pos * Textloc.line_pos) -> Textloc.t

textloc_span_to_prev_decode d ~start is a text location that spans from start to the last byte of the previous decode.

Lexeme buffer

val lexeme_clear : t -> unit

lexeme_clear d clears the lexeme buffer.

val lexeme_pop : t -> string

lexeme_pop d gets the lexeme buffer contents and clears is.

val lexeme_add : t -> Stdlib.Uchar.t -> unit

lexeme_add d u adds the UTF-8 encoding of u to the lexeme buffer.