Module Sexpm

module Sexpm: sig .. end
S-expression codecs and updates.

Sexpm codecs sequences of s-expressions to different representations. Support is provided for:

Consult the syntax of s-expressions, tips and limitations and examples of use.

4ea5218 — homepage



IO


type input = unit -> (Bytes.t * int * int) option 
The type for input byte streams. On Some (b, pos, len) the bytes of b from pos to pos+len are readable until the next call to the function. On None the end of stream is reached.
val input_of_string : string -> input
input_of_string s is a byte stream reading from s.
val input_of_in_channel : ?bytes:Bytes.t -> Pervasives.in_channel -> input
input_of_in_channel ~bytes ic is a byte stream reading from ic using bytes to read the data.
Raises Invalid_argument if the length of bytes is 0.
type output = (Bytes.t * int * int) option -> unit 
The type for output byte streams. Given Some (b, pos, len) the bytes of b from pos to pos+len are readable until the function returns. On None the end of stream is signalled to the function.
val output_of_buffer : Buffer.t -> output
output_to_buffer b is an output stream writing to buffer b.
val output_of_out_channel : Pervasives.out_channel -> output
output_of_out_channel oc is an output stream writing to oc. It is the client's duty to flush it or close it.

Decoders


type pos = int * int 
The type for character positions. A one-based line and zero-based column number.
type range = pos * pos 
The type for character ranges.
val pp_range : Format.formatter -> range -> unit
pp_range ppf r prints r on ppf according to GNU conventions.
type error = [ `Codec of string
| `Illegal_bytes of string
| `Illegal_escape of
[ `Exp_hex of Uchar.t
| `Exp_hex_rbrace of Uchar.t
| `Exp_lbrace of Uchar.t
| `Exp_rbrace of Uchar.t
| `Not_esc of Uchar.t
| `Not_uchar of int ]
| `Illegal_uchar of Uchar.t
| `Unclosed of [ `Escape | `List | `Quoted_token ]
| `Unexpected_le ]
The type for decode errors.
val pp_error : Format.formatter -> error -> unit
pp_error ppf e prints an unspecified representation of e on ppf.
type decoder 
The type for UTF-8 s-expression decoders.
val decoder : ?layout:bool -> input -> decoder
decoder i is a decoder decoding s-expressions from i the UTF-8 encoded byte stream i.

If layout is true (defaults to false), whitespace, comments and raw atoms are preserved by the underlying lexeme decoder. This allows the layout of the decoded sequence's concrete syntax to be preserved.

val decoder_layout : decoder -> bool
decoder_layout d is the value of layout, see Sexpm.decoder.
val decoded_range : decoder -> range
decoded_range d is the range of characters spanning the last decoded lexeme or error.
val decoded_sexp_range : decoder -> range
decoder_sexp_range d is the range of characters spanning the last fully parsed s-expression. The value is accurate only after Atom _ and Le lexemes.

Encoders


type encoder 
The type for UTF-8 s-expression encoders.
type encoder_style = [ `Minify | `Pp | `Raw ] 
The type for the encoder output style.
val encoder : ?bytes:Bytes.t ->
?nl:bool ->
?quote:bool -> ?style:encoder_style -> output -> encoder
encoder ~nl ~bytes ~minify o is an encoder encoding s-expressions on o.
Raises Invalid_argument if the length of bytes is 0.

Custom s-expression codec

Codecs sequences of s-expressions by mapping them to a custom s-expression type. See an example.

val decode_custom : atom:(decoder -> string -> 'a) ->
list:(decoder -> 'a list -> 'a) ->
decoder -> ('a option, error * range) Pervasives.result
decode_custom ~atom ~list d is Some v, constructed using atom and list by folding over a single s-expression decoded from d or None if the end stream was reached.
val encode_custom : ('a -> [ `Atom of string | `List of 'a list ]) ->
encoder -> 'a option -> unit
encode_custom unfold e v is encodes a single s-expression on e by unfolding v using unfold. Use with None to signal the end of the sequence.

Tagged s-expression codec

Codecs sequences of s-expressions to a generic tagged representation of s-expressions.

type 'a t = [ `Atom of string | `List of 'a t list ] * 'a 
The type for s-expressions tagged with values of type 'a. Either an unescaped, UTF-8 encoded, atom or a list of s-expressions.
type 'a tagger = decoder -> [ `Atom of string | `List of 'a t list ] -> 'a t 
The type for s-expression taggers. Given the decoder and the last decoded s-expression, returns a tagged expression.
val unit_tag : unit tagger
unit_tag is fun _ se -> se, ().
val range_tag : range tagger
range_tag is fun d se -> se, Sexpm.decoded_sexp_range d.
val decode_tagged : tag:'a tagger ->
decoder ->
('a t option, error * range) Pervasives.result
decode_tagged tag d is an s-expression decoded from d and tagged according to tag or None if the end of stream was reached.
val encode_tagged : encoder -> 'a t option -> unit
encode_tagged e se encodes the s-expression se on e. Use with None to signal the end of sequence.

Lexeme sequence codec

Codecs sequences of s-expressions as layout preserving, well-formed, sequences of lexemes.

type lexeme = 
| White of string
| Comment of string
| Ls
| Le
| Atom of string * string option
The type for s-expression lexemes.

On decoding, whitespace, comments and raw atoms are reported only if the decoder has layout. No sequence of two White can be decoded from an input sequence.

A well-formed sequence of lexemes belongs to the language of the following RFC 5234 ABNF grammar defined on a sequence of lexemes.

sexp-seq = *(White _ / Comment _ / sexp)
    sexp = Atom _ / Ls *(sexp-seq) Le
A Sexpm.decoder returns only well-formed sequences of lexemes even when Sexpm.errors are returned (see Error recovery). The lexemes strings are always UTF-8 encoded and in these strings characters originally escaped in the input are in their unescaped representation. White _, Commment _ and raw atom representations are only ever returned if the decoder has layout.

An Sexpm.encoder accepts only well-formed sequences of lexemes or Invalid_argument is raised. The strings given to the lexemes are assumed to be UTF-8 encoded, this is not checked by the module. For atoms if raw is given this raw string will be used to output the token it is the client's duty to make sure it is properly quoted and escaped. If raw is not provided the module will handle this. For comments if the string does not end with a new line and further lexemes are encoded a new line is added.

val pp_lexeme : Format.formatter -> lexeme -> unit
pp_lexeme ppf l prints an unspecified non-sexp representation of l on ppf.
val decode_lexeme : decoder ->
(lexeme option, error * range) Pervasives.result
decode_lexeme d is:

Note. Repeated invocation eventually always returns Ok None, even in case of error.

val peek_lexeme : decoder ->
(lexeme option, error * range) Pervasives.result
peek_lexeme d is like Sexpm.decode_lexeme but doesn't remove the lexeme from the input sequence.
val encode_lexeme : encoder -> lexeme option -> unit
encode_lexeme e l encodes l on e. None should always be eventually encoded.
Raises Invalid_argument if a non well-formed sequence of lexemes is encoded.

Warning. The encoder does not check the following properties and for the output to be valid it is the client's duty to make sure that in encoding:



Atom escaping


val escape : ?quote:bool -> string -> string
escape ~quote s escapes the UTF-8 encoded string s to an s-expression token or quoted token. If quote is true (default) the token is quoted iff s needs escapes or is the empty string. If quote is false the token is never quoted except if s is the empty string.

OCaml values codecs


type trace = lexeme list 
The type for the list of lexemes that described a value.
type 'a codec 
The type for codecs for values of type 'a.
val codec_with_default : 'a -> 'a codec -> 'a codec
codec_with_default v c is c with default value c.
val codec_with_codec : ?eq:('a -> 'a -> bool) -> ?default:'a -> 'a codec -> 'a codec
codec_with_codec ~eq ~default c is c with equality eq (unchanged if unspecified) and default value c

Encoding and decoding values


type 'a c 
val encode_value : 'a c -> encoder -> 'a option -> unit
val encode_traced_value : 'a codec -> encoder -> ('a * 'a) option * trace -> unit
val decode_traced_value : 'a codec ->
decoder ->
('a option * trace, error * range) Pervasives.result

OCaml standard library type codecs


val unit : unit codec
val bool : bool codec
val int : int codec
val float : float codec
val float_fmt : (float -> string, unit, string) Pervasives.format -> float codec
val float_hex : float codec
val string : string codec
val result : ok:'a codec ->
error:'b codec -> ('a, 'b) Pervasives.result codec
val t2 : 'a codec -> 'b codec -> ('a * 'b) codec
val list : 'a codec -> 'a list codec
val set : (module Set.S with type elt = 'a and type t = 'b) ->
'a codec -> 'b codec

S-expression syntax

We define the syntax of s-expressions over a sequence of Unicode characters in which all US-ASCII control characters except whitespace are forbidden in unescaped form.

S-expressions and sequences thereof

A s-expression is either an atom or a list of s-expressions interspaced with whitespace and comments. A sequence of s-expressions is simply a succession of s-expressions interspaced with whitespace and comments.

These elements are informally described below and finally made precise via an ABNF grammar.

Whitespace

Whitespace is a sequence of whitespace characters, namely, space (U+0020), tab (U+0009), line feed (U+000A) or carriage return (U+000D).

Comments

Unless escaped or inside quoted tokens (see below) anything that follows a semicolon (U+003B) is ignored until the next end of line, that is either a line feed (U+000A), a carriage return (U+000D) or a carriage return and a line feed (<U+000D,U+000A>).

(this is not a comment) ; This is a comment
(this is not a comment)

Atoms

An atom represents ground data as a string of Unicode characters. It can, via escapes, represent any sequence of Unicode characters, including control characters and U+0000. It cannot represent an arbitrary byte sequence except via a client-defined encoding convention (e.g. Base64, OCaml byte escape codes, etc.).

An atom is specified either via a token or a quoted token. Quoted tokens can represent atoms with spaces, parentheses and semicolons without having to escape them.

abc        ; a token for the atom "abc"
abc\;\ \(d ; a token for the atom "abc; (d"
"abc"      ; a quoted token for the atom "abc"
"abc; (d"  ; a quoted token for the atom "abc; (d"
""         ; the quoted token for the atom ""

A token and its quoted form represent the same atom. All atoms except the empty atom "" can be represented in both forms.

Escapes are introduced by a backslash. Double quotes and backslashes have to be escaped everywhere except in comments.

"\u{1F42B}\n\"\\" ; atom with U+1F42B, a new line, a quote and a backslash
\u{1F42B}\n\"\\   ; idem
The following escape sequences are recognized:

Tokens and quoted tokens can be split across lines by using a backslash (U+005C) followed by a line feed (<U+000A>), a carriage return (U+000D) or a carriage return and a line feed (<U+000D,U+000A>). Any space (U+0020) or tab (U+0009) character at the beginning of the new line is ignored.

"\
  a\
  \ " ; the atom "a "
  a\
  \   ; the atom "a "

Lists

Lists are delimited by left and right parentheses (U+0028, U+0029) and their elements are s-expressions separated by optional whitespace and comments. For example:

(a list (of four) expressions)
(a list(of four)expressions)
("a"list("of"four)expressions)
(a list (of ; This is a comment
four) expressions)
() ; the empty list

S-expression grammar

The following RFC 5234 ABNF grammar is defined on a sequence of Unicode characters.

 sexp-seq = *(ws / comment / sexp)
     sexp = atom / list
     list = %x0028 sexp-seq %x0029
     atom = token / qtoken
    token = (t-char / escape) *(t-char / escape / cont)
   qtoken = %x0022 *(q-char / escape / cont) %x0022
   escape = %x005C (%x0020 / %x0022 / %x0028 / %x0029 / 0x003B / %x005C /
                    %x006E / %x0072 / %x0074 / %x0075 %x007B unum %x007D)
     unum = 1*6(HEXDIG)
     cont = %x005C nl
       ws = *(ws-char)
  comment = %x003B *(c-char) nl
       nl = %x000A / %x000D / %x000D %x000A
   t-char = %x0021 / %x0023-0027 / %x002A-%x003A / %x003C-005B /
            %x005D-D7FF / %xE000-10FFFF
   q-char = t-char / ws-char / %x0028 / %x0029 / %x003B
  ws-char = %x0020 / %x0009 / %x000A / %x000D
   c-char = %x0009 / %x0020-D7FF / %xE000-10FFFF
A few additional constraints not expressed by the grammar:

Error recovery

After a decoding error, if best-effort decoding is performed, the following happens before continuing:

Due to the U+FFFD insertions layout cannot be assumed to be preserved when error recovery is performed, however outputing an erroring stream of lexemes results in a valid sequence of s-expressions.

Tips and limitations



Examples

Custom representation

The following codecs a sequence of s-expressions to a custom definition of s-expressions. Note that for illustration purposes this representation is exactly the tagged s-expression representation of the next example.

type sexp = [ `Atom of string | `List of sexp list ] * Sexpm.range

let of_string : string -> (sexp list, Sexpm.error * Sexpm.range) result =
fun s ->
  let d = Sexpm.(decoder @@ input_of_string s) in
  let atom d a = `Atom a, Sexpm.decoded_sexp_range d in
  let list d l = `List l, Sexpm.decoded_sexp_range d in
  let rec loop acc = match Sexpm.decode_custom ~atom ~list d with
  | Ok Some v -> loop (v :: acc)
  | Ok None -> Ok (List.rev acc)
  | Error _ as e -> e
  in
  loop []

let to_string : ?style:Sexpm.encoder_style -> sexp list -> string
fun ses ->
  let unfold = fst in
  let b = Buffer.create 256 in
  let d = Sexpm.(encoder ?style @@ output_to_buffer b) in
  let rec loop = function
  | se :: ses -> Sexpm.encode_custom unfold d (Some se); loop ses
  | [] -> Sexpm.encode_custom d NoneBuffer.contents b
  in
  loop ses

Tagged representation

type sexp = Sexpm.range Sexpm.t

let of_string : string -> (sexp list, Sexpm.error * Sexpm.range) result =
fun s ->
  let d = Sexpm.(decoder @@ input_of_string s) in
  let rec loop acc = match Sexpm.decode_tagged ~tag:Sexpm.range_tag d with
  | Ok Some v -> loop (v :: acc)
  | Ok None -> Ok (List.rev acc)
  | Error _ as e -> e
  in
  loop []

let to_string : ?style:Sexpm.encoder_style -> sexp list -> string
fun ses ->
  let b = Buffer.create 256 in
  let d = Sexpm.(encoder ?style @@ output_to_buffer b) in
  let rec loop = function
  | se :: ses -> Sexpm.encode_tagged d (Some se); loop ses
  | [] -> Sexpm.encode_tagged d NoneBuffer.contents b
  in
  loop ss