Module Webs_url

Sloppy URL processing.

URL standards are in a sorry state. This module takes a sloppy approach to URL processing. It only breaks URLs into their components and classifies them.

Warning. None of the functions here perform percent encoding or decoding.

URLs

type scheme = string

The type for schemes.

type authority = string

The type for authority.

type path = string

The type for paths.

type query = string

The type for queries (without the '?' seperator).

type fragment = string

The type for fragments (without the '#' seperator).

type t = string

The type for URLs.

Kinds

type relative_kind = [
  1. | `Scheme
  2. | `Abs_path
  3. | `Rel_path
  4. | `Empty
]

The type for kinds of relative references. Represents this alternation.

type kind = [
  1. | `Abs
  2. | `Rel of relative_kind
]

The type for kinds of URLs. Represents this this alternation.

val kind : t -> kind

kind u determines the kind of u. It decides that u is absolute if u starts with a scheme and :.

val absolute : root:t -> t -> t

absolute ~root url is url if kind url is `Abs. Otherwise uses root to make it absolute according to its relative_kind. The result is guaranteed to be absolute if root is, the result may be surprising or non-sensical if root isn't (FIXME maybe we should rather call that concat and make it like Fpath.concat). Warning. This doesn't resolve relative path segments.

Components

val scheme : t -> scheme option

scheme u is the scheme from u, if any.

val authority : t -> authority option

authority u extracts a URL authority (HOST:PORT) from u, if any.

val path : t -> path option

path u is the path of u, if any.

val query : t -> query option

query u is the query of u, if any.

val fragment : t -> fragment option

fragment u is the fragment of u, if any.

val update : ?scheme:scheme option -> ?authority:string option -> ?path:path option -> ?query:query option -> ?fragment:fragment option -> t -> t

update u updates the specified components of u. If unspecified kept as in u, if updated with None the component is deleted from u.

Scraping

val list_of_text_scrape : ?root:t -> string -> t list

list_of_text_scrape ?root s roughly finds absolute and relative URLs in s by looking in order:

  1. For the next href or src substring then tries to parses the content of an HTML attribute. This may result in relative or absolute paths.
  2. For next http substrings in s and then delimits an URL depending on the previous characters and checks that the delimited URL starts with http:// or https://.

Relative URLs are made absolute with root if provided. Otherwise they are kept as is. The result may have duplicates.

Formatting

val pp : Stdlib.Format.formatter -> t -> unit

pp formats an URL. For now this is just Format.pp_print_string.

val pp_kind : Stdlib.Format.formatter -> kind -> unit

pp_kind formats an unspecified representation of kinds.