B0_std.String
Strings.
include module type of Stdlib.String
val to_seq : t -> char Stdlib.Seq.t
val to_seqi : t -> (int * char) Stdlib.Seq.t
val of_seq : char Stdlib.Seq.t -> t
val get_utf_8_uchar : t -> int -> Stdlib.Uchar.utf_decode
val is_valid_utf_8 : t -> bool
val get_utf_16be_uchar : t -> int -> Stdlib.Uchar.utf_decode
val is_valid_utf_16be : t -> bool
val get_utf_16le_uchar : t -> int -> Stdlib.Uchar.utf_decode
val is_valid_utf_16le : t -> bool
val hash : t -> int
val seeded_hash : int -> t -> int
includes ~affix s
is true
iff there exists an index j
such that for all indices i
of affix
, sub.[i] = s.[j+ 1]
.
find_index ~start sat
is the index of the first character of s
that satisfies sat
before or at start
(defaults to 0
).
rfind_index ~start sat
is the index of the first character of s
that satisfies sat
before or at start
(defaults to String.length s - 1
).
find_sub ~start ~sub s
is the start position (if any) of the first occurence of sub
in s
after or at position start
(which includes index start
if it exists, defaults to 0
). Note if you need to search for sub
multiple times in s
use find_sub_all
it is more efficient.
rfind_sub ~start ~sub s
is the start position (if any) of the first occurences of sub
in s
before or at position start
(which includes index start
if it exists, defaults to String.length s
).
Note if you need to search for sub
multiple times in s
use rfind_sub_all
it is more efficient.
find_sub_all ~start f ~sub s acc
, starting with acc
, folds f
over all non-overlapping starting positions of sub
in s
after or at position start
(which includes index start
if it exists, defaults to 0
). This is acc
if sub
could not be found in s
.
rfind_sub_all ~start f ~sub s acc
, starting with acc
, folds f
over all non-overlapping starting positions of sub
in s
before or at position start
(which includes index start
if it exists, defaults to String.length s
). This is acc
if sub
could not be found in s
.
replace_first ~start ~sub ~by s
replaces in s
the first occurence of sub
at or after position start
(defaults to 0
) by by
.
replace_all ~start ~sub ~by
replaces in s
all non-overlapping occurences of sub
at or after position start
(default to 0
) by by
.
subrange ~first ~last s
are the consecutive bytes of s
whose indices exist in the range [first
;last
].
first
defaults to 0
and last to String.length s - 1
.
Note that both first
and last
can be any integer. If first > last
the interval is empty and the empty string is returned.
take n s
are the first n
bytes of s
. This is s
if n >= length s
and ""
if n <= 0
.
rtake n s
are the last n
bytes of s
. This is s
if n >= length s
and ""
if n <= 0
.
drop n s
is s
without the first n
bytes of s
. This is ""
if n >= length s
and s
if n <= 0
.
rdrop n s
is s
without the last n
bytes of s
. This is ""
if n >= length s
and s
if n <= 0
.
take_while sat s
are the first consecutive sat
statisfying bytes of s
.
keep_right sat s
are the last consecutive sat
satisfying bytes of s
.
drop_while sat s
is s
without the first consecutive sat
satisfying bytes of s
.
rdrop_while sat s
is s
without the last consecutive sat
satisfying bytes of s
.
span_while sat s
is (take_while sat s, drop_while sat s)
.
rspan_while sat s
is (rdrop_while sat s, rtake_while sat s)
.
cut ~sep s
is the pair Some (left, right)
made of the two (possibly empty) substrings of s
that are delimited by the first match of the separator sep
or None
if sep
can't be matched in s
. Matching starts at position 0
using find_sub
.
The invariant concat sep [left; right] = s
holds.
split ~sep s
is the list of all substrings of s
that are delimited by non-overlapping matches of the separator sep
. If sep
can't be matched in s
, the list [s]
is returned. Matches starts at position 0
and are determined using find_sub_all
.
Substrings sub
for which drop sub
is true
are not included in the result. drop
default to Fun.const false
.
The invariant concat sep (split ~sep s) = s
holds.
rsplit ~sep s
is like split
but matching starts at position length s
using rfind_sub_all
fold_ascii_lines ~strip_newlines f acc s
folds over the lines of s
by calling f linenum acc' line
with linenum
the one-based line number count, acc'
the result of accumulating acc
with f
so far and line
the data of the line (without the newline found in the data if strip_newlines
is true
).
Lines are delimited by newline sequences which are either one of "\n"
, "\r\n"
or "\r"
. More precisely the function determines lines and line data as follows:
s = ""
, the function considers there are no lines in s
and acc
is returned without f
being called.s <> ""
, s
is repeteadly split on the first newline sequences "\n"
, "\r\n"
or "\r"
into (left, newline, right)
, left
(or left ^ newline
when strip_newlines = false
) is given to f
and the process is repeated with right
until a split can no longer be found. At that point this final string is given to f
and the process stops.detach_ascii_newline s
is (data, endline)
with:
endline
either the suffix "\n"
, "\r\n"
or "\r"
of s
or ""
if s
has no such suffix.data
the bytes before endline
such that data ^ newline = s
next_token ~is_sep ~is_token s
skips characters satisfying is_sep
from s
, then gather zero or more consecutive characters satisfying is_token
into a string which is returned along the remaining characters after that. is_sep
defaults to Char.Ascii.is_white
and is_token
is Char.Ascii.is_graphic
.
tokens s
are the strings separated by sequences of is_sep
characters (default to Char.Ascii.is_white
). The empty list is returned if s
is empty or made only of separators.
distinct ss
is ss
without duplicates, the list order is preserved.
unique ~limit ~exist n
is n
if exists n
is false
or r = strf "%s~%d" n d
with d
the smallest integer such that exists r
if false
. If no d
in [1
;limit
] satisfies the condition Invalid_argument
is raised, limit
defaults to 1e6
.
All additions available in OCaml 5.4
edit_distance s0 s1
is the number of single character edits (understood as insertion, deletion, substitution, transposition) that are needed to change s0
into s1
.
If limit
is provided the function returns with limit
as soon as it was determined that s0
and s1
have distance of at least limit
. This is faster if you have a fixed limit, for example for spellchecking.
The function assumes the strings are UTF-8 encoded and uses Uchar.t
for the notion of character. Decoding errors are replaced by Uchar.rep
. Normalizing the strings to NFC gives better results.
Note. This implements the simpler Optimal String Alignement (OSA) distance, not the Damerau-Levenshtein distance. With this function "ca"
and "abc"
have a distance of 3 not 2.
spellcheck iter_dict s
are the strings enumerated by the iterator iter_dict
whose edit distance to s
is the smallest and at most max_dist s
. If multiple corrections are returned their order is as found in iter_dict
. The default max_dist s
is:
0
if s
has 0 to 2 Unicode characters.1
if s
has 3 to 4 Unicode characters.2
otherwise.If your dictionary is a list l
, a suitable iter_dict
is given by (fun yield -> List.iter yield l)
.
All strings are assumed to be UTF-8 encoded, decoding errors are replaced by Uchar.rep
characters.
The following functions can only (un)escape a single byte. See also these functions to convert a string to printable ASCII characters.
byte_escaper char_len set_char
is a byte escaper such that:
char_len c
is the length of the unescaped byte c
in the escaped form. If 1
is returned then c
is assumed to be unchanged use byte_replacer
if that does not holdset_char b i c
sets an unescaped byte c
to its escaped form at index i
in b
and returns the next writable index. set_char
is called regardless if c
needs to be escaped or not in the latter case you must write c
(use byte_replacer
if that is not the case). No bounds check need to be performed on i
or the returned value.For any b
, c
and i
the invariant i + char_len c = set_char b i c
must hold.
Here's a small example that escapes '"'
by prefixing them by backslashes. double quotes from strings:
let escape_dquotes s =
let char_len = function '"' -> 2 | _ -> 1 in
let set_char b i = function
| '"' -> Bytes.set b i '\\'; Bytes.set b (i+1) '"'; i + 2
| c -> Bytes.set b i c; i + 1
in
String.byte_escaper char_len set_char s
byte_replacer char_len set_char
is like byte_escaper
but a byte can be substituted by another one by set_char
.
See byte_unescaper
.
val byte_unescaper :
(string -> int -> int) ->
(bytes -> int -> string -> int -> int) ->
string ->
(string, int) Stdlib.result
byte_unescaper char_len_at set_char
is a byte unescaper such that:
char_len_at s i
is the length of an escaped byte at index i
of s
. If 1
is returned then the byte is assumed to be unchanged by the unescape, use byte_unreplacer
if that does not hold.set_char b k s i
sets at index k
in b
the unescaped byte read at index i
in s
and returns the next readable index in s
. set_char
is called regardless of wheter the byte at i
must be unescaped or not in the latter case you must write s.i
only (use byte_unreplacer
if that is not the case). No bounds check need to be performed on k
, i
or the returned value.For any b
, s
, k
and i
the invariant i + char_len_at s i = set_char b k s i
must hold.
Both char_len_at
and set_char
may raise Illegal_escape i
if the given index i
has an illegal or truncated escape. The unescaper turns this exception into Error i
if that happens.
val byte_unreplacer :
(string -> int -> int) ->
(bytes -> int -> string -> int -> int) ->
string ->
(string, int) Stdlib.result
byte_unreplacer char_len_at set_char
is like byte_unescaper
except set_char
can set a different byte whenever char_len_at
returns 1
.
module Ascii : sig ... end
ASCII string support.
subst_pct_vars ~buf vars s
substitutes in s
sub-strings of the form %%VAR%%
by the value of vars "VAR"
(if any).
val pp : string Fmt.t
pp ppf s
prints s
's bytes on ppf
.
module Set : sig ... end
String sets.
module Map : sig ... end
String maps.