Module Serialk_sexp

S-expression support.

The module Sexp has a codec for his syntax and general definitions for working with s-expressions. Sexpg generates s-expressions without going through a generic representation. Sexpq queries and updates generic representations with combinators.

A short introduction to s-expressions and the syntax parsed by the codec is described here.

Open this module to usee it, this only introduces modules in your scope.

Warning. Serialization functions always assumes all OCaml strings in the data you provide is UTF-8 encoded. This is not checked by the module.

API

module Sexp : sig ... end

S-expression definitions and codec.

module Sexpg : sig ... end

S-expression generation.

module Sexpq : sig ... end

S-expression queries.

Dictionaries

An s-expression dictionary is a list of bindings. A binding is a list that starts with a key and the remaining elements of the list are the binding's value. For example in this binding:

(key v0 v1 ...)

The key is key and the value the possibly empty list v0, v1, ... of s-expressions. The API for dictionaries represents the value by a fake (doesn't exist syntactically) s-expression list whose text location starts at the first element of the value.

Path & caret syntax

Path and carets provide a way for end users to address s-expressions and edit locations.

A path is a sequence of key and list indexing operations. Applying the path to an s-expression leads to an s-expression or nothing if one of the indices does not exist, or an error if ones tries to index an atom.

A caret is a path and a spatial specification for the s-expression found by the path. The caret indicates either the void before that expression, the expression itself (over caret) or the void after it.

Here are a few examples of paths and carets, syntactically the charater 'v' is used to denote the caret's insertion point before or after a path. There's no distinction between a path an over caret.

ocaml.deps        # value of key 'deps' of dictionary 'ocaml'
ocaml.v[deps]     # before the key binding (if any)
ocaml.[deps]v     # after the key binding (if any)

ocaml.deps.[0]    # first element of key 'deps' of dictionary 'ocaml'
ocaml.deps.v[0]   # before first element (if any)
ocaml.deps.[0]v   # after first element (if any)

ocaml.deps.[-1]   # last element of key 'deps' of dictionary 'ocaml'
ocaml.deps.v[-1]  # before last element (if any)
ocaml.deps.[-1]v  # after last element (if any)

More formally a path is a . seperated list of indices.

An index is written [i] with i either a zero-based list index (with negative indices counting from the end of the list, -1 is the last element) or a dictionary key key. If there is no ambiguity, the surrounding brackets can be dropped.

A caret is a path whose last index brackets can be prefixed or suffixed by the character 'v' to respectively denote the void before or after the s-expression found by the path.

Note. The syntax has no form of quoting at the moment this means key names can't contain, [, ], or be numbers.

S-expression syntax

S-expressions are a general way of describing data via atoms (sequences of characters) and lists delimited by parentheses. Here are a few examples of s-expressions and their syntax:

this-is-an_atom
(this is a list of seven atoms)
(this list contains (a nested) list)

; This is a comment
; Anything that follows a semi-colon is ignored until the next line

(this list ; has three atoms and an embeded ()
 comment)

"this is a quoted atom, it can contain spaces ; and ()"

"quoted atoms can be split ^
 across lines or contain Unicode esc^u{0061}pes"

We define the syntax of s-expressions over a sequence of Unicode characters in which all US-ASCII control characters (U+0000..U+001F and U+007F) except whitespace are forbidden in unescaped form.

S-expressions

An s-expression is either an atom or a list of s-expressions interspaced with whitespace and comments. A sequence of s-expressions is a succession of s-expressions interspaced with whitespace and comments.

These elements are informally described below and finally made precise via an ABNF grammar.

Whitespace

Whitespace is a sequence of whitespace characters, namely, space ' ' (U+0020), tab '\t' (U+0009), line feed '\n' (U+000A), vertical tab '\t' (U+000B), form feed (U+000C) and carriage return '\r' (U+000D).

Comments

Unless it occurs inside an atom in quoted form (see below) anything that follows a semicolon ';' (U+003B) is ignored until the next end of line, that is either a line feed '\n' (U+000A), a carriage return '\r' (U+000D) or a carriage return and a line feed "\r\n" (<U+000D,U+000A>).

(this is not a comment) ; This is a comment
(this is not a comment)

Atoms

An atom represents ground data as a string of Unicode characters. It can, via escapes, represent any sequence of Unicode characters, including control characters and U+0000. It cannot represent an arbitrary byte sequence except via a client-defined encoding convention (e.g. Base64 or hex encoding).

Atoms can be specified either via an unquoted or a quoted form. In unquoted form the atom is written without delimiters. In quoted form the atom is delimited by double quote '"' (U+0022) characters, it is mandatory for atoms that contain whitespace, parentheses '(' ')', semicolons ';', quotes '"', carets '^' or characters that need to be escaped.

abc        ; a token for the atom "abc"
"abc"      ; a quoted token for the atom "abc"
"abc; (d"  ; a quoted token for the atom "abc; (d"
""         ; the quoted token for the atom ""

For atoms that do not need to be quoted, both their unquoted and quoted form represent the same string; e.g. the string "true" can be represented both by the atoms true and "true". The empty string can only be represented in quoted form by "".

In quoted form escapes are introduced by a caret '^'. Double quotes '"' and carets '^' must always be escaped.

"^^"             ; atom for ^
"^n"             ; atom for line feed U+000A
"^u{0000}"       ; atom for U+0000
"^"^u{1F42B}^""  ; atom with a quote, U+1F42B and a quote

The following escape sequences are recognized:

  • "^ " (<U+005E,U+0020>) for space ' ' (U+0020)
  • "^\"" (<U+005E,U+0022>) for double quote '"' (U+0022) mandatory
  • "^^" (<U+005E,U+005E>) for caret '^' (U+005E) mandatory
  • "^n" (<U+005E,U+006E>) for line feed '\n' (U+000A)
  • "^r" (<U+005E,U+0072>) for carriage return '\r' (U+000D)
  • "^u{X}" with X is from 1 to at most 6 upper or lower case hexadecimal digits standing for the corresponding Unicode character U+X.
  • Any other character except line feed '\n' (U+000A) or carriage return '\r' (U+000D), following a caret is an illegal sequence of characters. In the two former cases the atom continues on the next line and white space is ignored.

An atom in quoted form can be split across lines by using a caret '^' (U+005E) followed by a line feed '\n' (U+000A) or a carriage return '\r' (U+000D); any subsequent whitespace is ignored.

"^
  a^
  ^ " ; the atom "a "

The character ^ (U+005E) is used as an escape character rather than the usual \ (U+005C) in order to make quoted Windows® file paths decently readable and, not the least, utterly please DKM.

Lists

Lists are delimited by left '(' (U+0028) and right ')' (U+0029) parentheses. Their elements are s-expressions separated by optional whitespace and comments. For example:

(a list (of four) expressions)
(a list(of four)expressions)
("a"list("of"four)expressions)
(a list (of ; This is a comment
four) expressions)
() ; the empty list

Formal grammar

The following RFC 5234 ABNF grammar is defined on a sequence of Unicode characters.

 sexp-seq = *(ws / comment / sexp)
     sexp = atom / list
     list = %x0028 sexp-seq %x0029
     atom = token / qtoken
    token = t-char *(t-char)
   qtoken = %x0022 *(q-char / escape / cont) %x0022
   escape = %x005E (%x0020 / %x0022 / %x005E / %x006E / %x0072 /
                    %x0075 %x007B unum %x007D)
     unum = 1*6(HEXDIG)
     cont = %x005E nl ws
       ws = *(ws-char)
  comment = %x003B *(c-char) nl
       nl = %x000A / %x000D / %x000D %x000A
   t-char = %x0021 / %x0023-0027 / %x002A-%x003A / %x003C-%x005D /
            %x005F-%x007E / %x0080-D7FF / %xE000-10FFFF
   q-char = t-char / ws-char / %x0028 / %x0029 / %x003B
  ws-char = %x0020 / %x0009 / %x000A / %x000B / %x000C / %x000D
   c-char = %x0009 / %x000B / %x000C / %x0020-D7FF / %xE000-10FFFF

A few additional constraints not expressed by the grammar:

  • unum once interpreted as an hexadecimal number must be a Unicode scalar value.
  • A comment can be ended by the end of the character sequence rather than nl.