Module Cmarkit

CommonMark parser and abstract syntax tree.

References.

Abstract syntax tree

String spans

module Span : sig ... end

String spans.

Layout information

Values of these types can be ignored by renderers. They are used to recover verbatim layout information from the original CommonMark input when the abstract syntax tree data cannot represent it faithfully anymore, see Best-effort layout preservation. For programatically generated nodes these strings can be left empty or filled with a desired layout in case CommonMark is being rendered.

type layout = string

The type for string layout information.

type layout_span = Span.t

The type for layout spans.

type layout_indent = int

The type for space indentation information.

Node metadata

module Tloc : sig ... end

Text locations.

module Meta : sig ... end

Abstract syntax tree node metadata.

Documents

type 'a node = 'a * Meta.t

The type for abstract syntax tree nodes. The node and its metadata.

type attribute = string

The type for attributes. An attribute is text outside the text flow, it's a string which may hold entities that need resolution.

type inline = ..

The type for CommonMark inlines. See Inlines.

type block = ..

The type for CommonMark blocks. See Blocks.

type t = block list node

The type for CommonMark documents, a list of blocks.

Blocks

type blank_line = layout

The type for blank lines. Kept for best-effort layout preservation, ignore them otherwise.

type block_quote = {
  1. block_quote_layout : layout list;
  2. block_quote_blocks : block list;
}

The type for block quotes.

type code_block = {
  1. code_block_layout : [ `Indented | `Fenced of [ `Tilde | `Backtick ] * (layout_indent * int) * (layout_indent * int) option ];
  2. code_block_info : attribute node;
  3. code_block_lines : string node list;
}

The type for indented and fenced code blocks.

type heading = {
  1. heading_layout : [ `Atx of layout | `Setext of layout ];
  2. heading_level : int;
    (*

    from 1 to 6

    *)
  3. heading : inline list;
}

The type for ATX and Setext headings.

type html_block = {
  1. html_block_lines : string node list;
}

The type for HTML blocks.

type reference = {
  1. reference_label_key : string;
  2. reference_label : layout_indent * attribute node;
  3. reference_url : layout_span * bool * attribute node;
  4. reference_title : (layout_span * attribute node) option;
}
type reference_map = reference Stdlib.Map.Make(Stdlib.String).t

reference_map is the type for reference maps. Mapping reference_label_key to their reference value.

type list_item = {
  1. list_item_layout : string;
  2. list_item : block list;
}

The type for list items.

type list' = {
  1. list_kind : [ `Ul of layout | `Ol of layout * int ];
  2. list_tight : bool;
  3. list_items : list_item node list;
}

The type for lists. The comment for blank_line applies here aswell.

type paragraph = {
  1. paragraph_layout : layout * layout;
  2. paragraph : inline list;
}

The type for paragraphs. paragraph_ws is the leading and trailing whitespace kept for round tripping it can be ignored.

type thematic_break = layout

The type for thematic breaks.

type block +=
  1. | Blank_line of blank_line node
    (*

    Kept for layout.

    *)
  2. | Block_quote of block_quote node
  3. | Blocks of block list node
    (*

    Convenience

    *)
  4. | Code_block of code_block node
  5. | Heading of heading node
  6. | Html_block of html_block node
  7. | Reference of reference node
    (*

    Kept for layout

    *)
  8. | List of list' node
  9. | Paragraph of paragraph node
  10. | Thematic_break of thematic_break node

CommonMark blocks.

Inlines

type entity = [
  1. | `Numeric of [ `Dec | `Hex ] * Stdlib.Uchar.t
  2. | `Name of string
]

The type for entities. For

  • `Numeric (layout, u), invalid character references and U+0000 yield a Uchar.rep for u. Layout indicates whether this was a decimal or hexadecimal entity (leading zeros, casing information or invalid references are lost).
  • `Name we have the entity name proper without with its starting '&' and ending ';'. The name has not been checked that it is an HTML5 entity name.

CommonMark inlines.

Parsing options

Parser

val unicode_version : string

unicode_version is the Unicode version supported by the library.

val of_string : ?with_locs:bool -> ?file:Tloc.fpath -> string -> t * reference_map

of_string s is an abstract syntax tree and a reference map for the UTF-8 encoded CommonMark document s.

  • with_locs indicates whether locations should be kept (defaults to true)
  • file is the file path from which s is assumed to have been read (defaults to Tloc.file_none)

UTF-8 decoding errors and U+0000 are turned into Uchar.rep characters.

Note. Since some of the abstract syntax tree data has Span.t values on s you can safely assume s will live as long as its nodes will.

Tools

Maybe useful for abstract syntax tree processing or renderers.

module Ascii : sig ... end

US-ASCII character and string functions.

is_unsafe_link url is true if url is deemed unsafe. This is the case if url starts with a caseless match of javascript:, vbscript:, file: or data: except if data:image/{gif,png,jpeg,webp}.

These rules were taken from cmark, the C reference implementation of CommonMark.

val language_of_code_block_info : string -> string option

language_of_code_block_info i extract a language for the code block info string i. This is the first word of the info string.

val pct_decode : string -> string

pct_decode s is s percent decoded. FIXME. Remove.

Notes

Entity handling

Entities are reified in the abstract syntax tree and left for renderers to resolve them.

This is the reason why we have the attribute type for the few places where entities can appear but no inline is allowed. Renderers should be careful to resolve entities in these strings if they need to.

Best-effort layout preservation

In order to be able to transform user CommonMark documents without normalizing them too much, the abstract syntax tree has a few data cases and fields dedicated to the original document layout.

To keep things reasonably simple a few things are not attempted like preserving the exact layout of block quote nesting, irregular identation, insignificant trailing whitespace or the precise way numbers where specified in character rerference.

In general we try to keep the following desirable properties:

  1. Generating a CommonMark document from the abstract syntax tree and ignoring all layout data should round trip the same abstract syntax tree modulo the layout fields.
  2. Generating a CommonMark document that partially uses layout information should round trip the same abstract syntax tree module the layout fields. This means that programatically added nodes need not care about the layout in which they are injected.
  3. Layout information should not include document data itself. Otherwise updating node data also needs to update the layout data, which is error prone and unconvenient. There are a few exceptions, e.g. entity.

In practice CommonMark being quite sensitive to little syntactic elements, 1. and 2. may not work in every case.

Locations

Extensions