Module B0_url

Sloppy URL processing.

URL standards are in a sorry state. This module takes a sloppy approach to URL processing. It only breaks URLs into their components and classifies them.

Warning. None of the functions here perform percent encoding or decoding. Use Percent when deemed appropriate.

URLs

type scheme = string

The type for schemes, without the ':' separator.

type authority = string

The type for HOST:PORT authorities.

type path = string

The type for paths.

type query = string

The type for queries, without the '?' separator.

type fragment = string

The type for fragments, without the '#' seperator.

type t = string

The type for URLs.

val scheme : t -> scheme option

scheme u is the scheme of u, if any.

val authority : t -> authority option

authority u is the authority of u, if any.

val path : t -> path option

path u is the path of u, if any.

val query : t -> query option

query u is the query of u, if any.

val fragment : t -> fragment option

fragment u is the fragment of u, if any.

Kinds

type relative_kind = [
  1. | `Scheme
  2. | `Abs_path
  3. | `Rel_path
  4. | `Empty
]

The type for kinds of relative references. Represents this alternation.

type kind = [
  1. | `Abs
  2. | `Rel of relative_kind
]

The type for kinds of URLs. Represents this this alternation.

val kind : t -> kind

kind u determines the kind of u. It decides that u is absolute if u starts with a scheme and :.

Operations

val update : ?scheme:scheme option -> ?authority:string option -> ?path:path option -> ?query:query option -> ?fragment:fragment option -> t -> t

update u updates the specified components of u. If unspecified kept as in u, if updated with None the component is deleted from u.

val append : t -> t -> t

append root u is u if kind u is `Abs. Otherwise uses root to make it absolute according to its relative_kind. The result is guaranteed to be absolute if root is, the result may be surprising or non-sensical if root isn't (FIXME can't we characterize that more ?).

Scraping

val list_of_text_scrape : ?root:t -> string -> t list

list_of_text_scrape ?root s roughly finds absolute and relative URLs in the ASCII compatible (including UTF-8) textual data s by looking in order:

  1. For the next href or src substring then tries to parses the content of an HTML attribute. This may result in relative or absolute paths.
  2. For next http substrings in s and then delimits an URL depending on the previous characters and checks that the delimited URL starts with http:// or https://.

Relative URLs are appended to root if provided. Otherwise they are kept as is. The result may have duplicates.

Formatting

val pp : Stdlib.Format.formatter -> t -> unit

pp formats an URL. For now this is just Format.pp_print_string.

val pp_kind : Stdlib.Format.formatter -> kind -> unit

pp_kind formats an unspecified representation of kinds.

Percent encoding

module Percent : sig ... end

Percent-encoding codecs according to RFC 3986.