Module `Dataset.Var`

Variable processing.

Note. Unless otherwise specified `Var`.Float variables treat `nan`s are missing values. `nan` values can still be returned, for example if that's the single value available.

Summarize

`val count : ('o, 'a) Var.t -> 'o t -> ('a * int) t`

`count var d` groups observations by unique values of variable `var` and reports the number of observations found in each group. Variables values are sorted in increasing order, the dataset can be accessed with `Var`.for_var_count` var`. See also `group`.

`val sum : ('o, 'a) Var.t -> 'o t -> float`

`sum var d` is the sum of the values of `var` in `d`. `nan` values are excluded from the computation. This is `nan` on non-numeric types or if there are only `nan`s. On floats uses the Kahan-Babuška algorithm (§3).

`val mean : ('o, 'a) Var.t -> 'o t -> float`

`mean var d` is the arithmetic mean of variable `var` in `d`. `nan` values are excluded from the computation. This is `nan` on non-numeric types or if there are only `nan`s. Uses `sum` to compute the result.

`val quantile : ('o, 'a) Var.t -> 'o t -> float -> float`

`quantile var d` is a function `quant` such that `quant p` is the `p`-quantile of `d` on variable `var` using the R-7 definition. `quant` clamps its argument to [`0`;`1`]. `nan` values are excluded from the computation. The function is `Fun.const nan` on non-numeric types or if there only `nan`s.

`val median : ('o, 'a) Var.t -> 'o t -> float`

`median var d` is `quantile var d 0.5`. If you also need other quantiles, use the function returned by `quantile var d`.

`val variance : ('o, 'a) Var.t -> 'o t -> float`

`variance var d` is the unbiased sample variance of variable `var` in `d` computed using Welford's algorithm. This is `nan` on non-numeric types or if there are only `nan`s or less than two numbers.

`val deviation : ('o, 'a) Var.t -> 'o t -> float`

`deviation var d` is `sqrt (deviation var d)`, the standard deviation of variable `var` in `d`.

Grouping

`val group : by:('o, 'a) Var.t -> 'o t -> ('a * 'o t) t`

`group ~by:var d` groups observations of `d` by the equivalence relation determined by variable `by`. The sequence of groups is ordered by `Var.compare_value by`.

Range

`val min : ('o, 'a) Var.t -> 'o t -> 'a`

`min var d` is the minimal value of `var` in `d` as determined by `Evidence.Var.min_value`.

`val max : ('o, 'a) Var.t -> 'o t -> 'a`

`min var d` is the maximal value of `var` in `d` as determined by `Evidence.Var.max_value`.

`val min_max : ('o, 'a) Var.t -> 'o t -> 'a * 'a`

`min_max var d` is `(min var d, max var d)` but more efficient.

`val values : ('o, 'a) Var.t -> 'o t -> 'a t`

`value var d` is the unique values found in `var` sorted by increasing `Var`.compare_value order.

```val dom : (module Stdlib.Set.S with type elt = 'a and type t = 'set) -> ('o, 'a) Var.t -> 'o t -> 'set```

`dom` is `values` but as a set.

Transforming

`val update : ('o, 'a) Var.t -> (int -> 'o -> 'a) -> 'o t -> 'o t`

`update var f d` updates variable `var` of each observation of `d` with `f`. Note that this fails on observations with absurd products.

`val set : ('o, 'a) Var.t -> 'a -> int -> 'o t -> 'o t`

`set var v i d` sets variable `var` of the `i`th observation of `d` to `v`. For efficiency do not call that function repeateadly, `update` is a better option.

`val cumsum : ('o, 'a) Var.t -> 'o t -> float t`

`cumsum var d` is the cumulative sum of variable `var`.