# Statistics API

Vega includes **statistics** functions for to model probability distributions and perform other statistical calculations. These methods are bound to the top-level `vega`

object, and can also be used in a stand-alone fashion by using the vega-statistics project.

## Statistics API Reference

## Random Number Generation

Returns a uniform pseudo-random number in the domain [0, 1). By default this is simply a call to JavaScript’s built-in `Math.random`

function. All Vega routines that require random numbers should use this function.

Sets the random number generator to the provided function *randfunc*. Subsequent calls to random will invoke the new function to generate random numbers. Setting a custom generator can be helpful if one wishes to use an alternative source of randomness or replace the default generator with a deterministic function for testing purposes.

Returns a new random number generator with the given random *seed*. The returned function takes zero arguments and generates random values in the domain [0, 1) using a linear congruential generator (LCG). This method is helpful in conjunction with setRandom to provide seeded random numbers for stable outputs and testing.

## Probability Distributions

Methods for sampling and calculating probability distributions. Each method takes a set of distributional parameters and returns a distribution object representing a random variable.

Distribution objects expose the following methods:

- dist.
**sample**(): Samples a random value drawn from this distribution. - dist.
**pdf**(*value*): Calculates the value of the probability density function at the given input domain*value*. - dist.
**cdf**(*value*): Calculates the value of the cumulative distribution function at the given input domain*value*. - dist.
**icdf**(*probability*): Calculates the inverse of the cumulative distribution function for the given input*probability*.

#
vega.**randomNormal**([*mean*, *stdev*])
<>

Creates a distribution object representing a normal (Gaussian) probability distribution with specified *mean* and standard deviation *stdev*. If unspecified, the mean defaults to `0`

and the standard deviation defaults to `1`

.

Once created, *mean* and *stdev* values can be accessed or modified using the `mean`

and `stdev`

getter/setter methods.

#
vega.**randomUniform**([*min*, *max*])
<>

Creates a distribution object representing a continuous uniform probability distribution over the interval [*min*, *max*). If unspecified, *min* defaults to `0`

and *max* defaults to `1`

. If only one argument is provided, it is interpreted as the *max* value.

Once created, *min* and *max* values can be accessed or modified using the `min`

and `max`

getter/setter methods.

#
vega.**randomInteger**([*min*,] *max*)
<>

Creates a distribution object representing a discrete uniform probability distribution over the integer domain [*min*, *max*). If only one argument is provided, it is interpreted as the *max* value. If unspecified, *min* defaults to `0`

.

Once created, *min* and *max* values can be accessed or modified using the `min`

and `max`

getter/setter methods.

#
vega.**randomMixture**(*distributions*[, *weights*])
<>

Creates a distribution object representing a (weighted) mixture of probability distributions. The *distributions* argument should be an array of distribution objects. The optional *weights* array provides proportional numerical weights for each distribution. If provided, the values in the *weights* array will be normalized to ensure that weights sum to 1. Any unspecified weight values default to `1`

(prior to normalization). Mixture distributions do **not** support the `icdf`

method: calling `icdf`

will result in an error.

Once created, the *distributions* and *weights* arrays can be accessed or modified using the `distributions`

and `weights`

getter/setter methods.

#
vega.**randomKDE**(*values*[, *bandwidth*])
<>

Creates a distribution object representing a kernel density estimate for an array of numerical *values*. This method uses a Gaussian kernel to estimate a smoothed, continuous probability distribution. The optional *bandwidth* parameter determines the width of the Gaussian kernel. If the *bandwidth* is either `0`

or unspecified, a default bandwidth value will be automatically estimated based on the input data. KDE distributions do **not** support the `icdf`

method: calling `icdf`

will result in an error.

Once created, *data* and *bandwidth* values can be accessed or modified using the `data`

and `bandwidth`

getter/setter methods.

### Regression

Two-dimensional regression methods to predict one variable given another.

#
vega.**regressionLinear**(*data*, *x*, *y*)
<>

Fit a linear regression model with functional form *y = a + b * x* for the input *data* array and corresponding *x* and *y* accessor functions. Returns an object for the fit model parameters with the following properties:

*coef*: An array of fitted coefficients of the form*[a, b]*.*predict*: A function that returns a regression prediction for an input*x*value.*rSquared*: The R^{2}coefficient of determination, indicating the amount of total variance of*y*accounted for by the model.

#
vega.**regressionLog**(*data*, *x*, *y*)
<>

Fit a logarithmic regression model with functional form *y = a + b * log(x)* for the input input *data* array and corresponding *x* and *y* accessor functions.

Returns an object for the fit model parameters with the following properties:

*coef*: An array of fitted coefficients of the form*[a, b]*.*predict*: A function that returns a regression prediction for an input*x*value.*rSquared*: The R^{2}coefficient of determination, indicating the amount of total variance of*y*accounted for by the model.

#
vega.**regressionExp**(*data*, *x*, *y*)
<>

Fit an exponential regression model with functional form *y = a + e ^{b * x}* for the input

*data*array and corresponding

*x*and

*y*accessor functions. Returns an object for the fit model parameters with the following properties:

*coef*: An array of fitted coefficients of the form*[a, b]*.*predict*: A function that returns a regression prediction for an input*x*value.*rSquared*: The R^{2}coefficient of determination, indicating the amount of total variance of*y*accounted for by the model.

#
vega.**regressionPow**(*data*, *x*, *y*)
<>

Fit a power law regression model with functional form *y = a * x ^{b}* for the input

*data*array and corresponding

*x*and

*y*accessor functions. Returns an object for the fit model parameters with the following properties:

*coef*: An array of fitted coefficients of the form*[a, b]*.*predict*: A function that returns a regression prediction for an input*x*value.*rSquared*: The R^{2}coefficient of determination, indicating the amount of total variance of*y*accounted for by the model.

#
vega.**regressionLinear**(*data*, *x*, *y*)
<>

Fit a quadratic regression model with functional form *y = a + b * x + c * x ^{2}* for the input

*data*array and corresponding

*x*and

*y*accessor functions. Returns an object for the fit model parameters with the following properties:

*coef*: An array of fitted coefficients of the form*[a, b, c]*,*predict*: A function that returns a regression prediction for an input*x*value.*rSquared*: The R^{2}coefficient of determination, indicating the amount of total variance of*y*accounted for by the model.

#
vega.**regressionPoly**(*data*, *x*, *y*, *order*)
<>

Fit a polynomial regression model of specified *order* with functional form *y = a + b * x + … + k * x ^{order}* for the input

*data*array and corresponding

*x*and

*y*accessor functions. Returns an object for the fit model parameters with the following properties:

*coef*: An*(order + 1)*-length array of polynomial coefficients of the form*[a, b, c, d, …]*.*predict*: A function that returns a regression prediction for an input*x*value.*rSquared*: The R^{2}coefficient of determination, indicating the amount of total variance of*y*accounted for by the model.

#
vega.**regressionLoess**(*data*, *x*, *y*, *bandwidth*)
<>

Fit a smoothed, non-parametric trend line the input *data* array and corresponding *x* and *y* accessor functions using *loess* (locally-estimated scatterplot smoothing). Loess performs a sequence of local weighted regressions over a sliding window of nearest-neighbor points. The *bandwidth* argument determines the size of the sliding window, expressed as a [0, 1] fraction of the total number of data points included.

#
vega.**sampleCurve**(*f*, *extent*[, *minSteps*, *maxSteps*])
<>

Generate sample points from an interpolation function *f* for the provided domain *extent* and return an array of *[x, y]* points. Performs adaptive subdivision to dynamically sample more points in regions of higher curvature. Subdivision stops when the difference in angles between the current samples and a proposed subdivision falls below one-quarter of a degree. The optional *minSteps* argument (default 25), determines the minimal number of initial, uniformly-spaced sample points to draw. The optional *maxSteps* argument (default 200), indicates the maximum resolution at which adaptive sampling will stop, defined relative to a uniform grid of size *maxSteps*. If *minSteps* and *maxSteps* are identical, no adaptive sampling will be performed and only the initial, uniformly-spaced samples will be returned.

## Statistics Routines

Statistical methods for calculating bins, bootstrapped confidence intervals, and quartile boundaries.

Determine a quantitative binning scheme, for example to create a histogram. Based on the options provided given, this method will search over a space of possible bins, aligning step sizes with a given number base and applying constraints such as the maximum number of allowable bins. Given a set of options (see below), returns an object describing the binning scheme, in terms of `start`

, `stop`

and `step`

properties.

The supported options properties are:

*extent*: (required) A two-element (`[min, max]`

) array indicating the range of desired bin values.*base*: The number base to use for automatic bin determination (default base`10`

).*maxbins*: The maximum number of allowable bins (default`20`

).*step*: An exact step size to use between bins. If provided, the*maxbins*and*steps*options will be ignored.*steps*: An array of allowable step sizes to choose from. If provided, the*maxbins*option will be ignored.*minstep*: A minimum allowable step size (particularly useful for integer values, default`0`

).*divide*: An array of scale factors indicating allowable subdivisions. The default value is`[5, 2]`

, which indicates that the method may consider dividing bin sizes by 5 and/or 2. For example, for an initial step size of 10, the method can check if bin sizes of 2 (= 10/5), 5 (= 10/2), or 1 (= 10/(5*2)) might also satisfy the given constraints.*nice*: Boolean indicating if the start and stop values should be nicely-rounded relative to the step size (default`true`

).

```
vega.bin({extent:[0, 1], maxbins:10}); // {start:0, stop:1, step:0.1}
vega.bin({extent:[0, 1], maxbins:5}); // {start:0, stop:10, step:2}
vega.bin({extent:[5, 10], maxbins:5}); // {start:5, stop:10, step:1}
```

#
vega.**bootstrapCI**(*array*, *samples*, *alpha*[, *accessor*])
<>

Calculates a bootstrapped confidence interval for an input *array* of values, based on a given number of *samples* iterations and a target *alpha* value. For example, an *alpha* value of `0.05`

corresponds to a 95% confidence interval An optional *accessor* function can be used to first extract numerical values from an array of input objects, and is equivalent to first calling `array.map(accessor)`

. This method ignores null, undefined and NaN values.

#
vega.**quartiles**(*array*[, *accessor*])
<>

Given an *array* of numeric values, returns an array of quartile boundaries. The return value is a 3-element array consisting of the first, second (median), and third quartile boundaries. An optional *accessor* function can be used to first extract numerical values from an array of input objects, and is equivalent to first calling `array.map(accessor)`

. This method ignores null, undefined and NaN values.