Statistics API
Vega includes statistics functions for to model probability distributions and perform other statistical calculations. These methods are bound to the top-level vega
object, and can also be used in a stand-alone fashion by using the vega-statistics project.
Statistics API Reference
- Random Number Generation
- Distribution Methods
- Distribution Objects
- Probability Distributions
- Regression
- Statistics Routines
Random Number Generation
Returns a uniform pseudo-random number in the domain [0, 1). By default this is simply a call to JavaScript’s built-in Math.random
function. All Vega routines that require random numbers should use this function.
Sets the random number generator to the provided function randfunc. Subsequent calls to random will invoke the new function to generate random numbers. Setting a custom generator can be helpful if one wishes to use an alternative source of randomness or replace the default generator with a deterministic function for testing purposes.
Returns a new random number generator with the given random seed. The returned function takes zero arguments and generates random values in the domain [0, 1) using a linear congruential generator (LCG). This method is helpful in conjunction with setRandom to provide seeded random numbers for stable outputs and testing.
Distribution Methods
Methods for sampling and calculating values for probability distributions.
# vega.sampleNormal([mean, stdev]) <>
Returns a sample from a univariate normal (Gaussian) probability distribution with specified mean and standard deviation stdev. If unspecified, the mean defaults to 0
and the standard deviation defaults to 1
.
# vega.cumulativeNormal(value[, mean, stdev]) <>
Returns the value of the cumulative distribution function at the given input domain value for a normal distribution with specified mean and standard deviation stdev. If unspecified, the mean defaults to 0
and the standard deviation defaults to 1
.
# vega.densityNormal(value[, mean, stdev]) <>
Returns the value of the probability density function at the given input domain value, for a normal distribution with specified mean and standard deviation stdev. If unspecified, the mean defaults to 0
and the standard deviation defaults to 1
.
# vega.quantileNormal(probability[, mean, stdev]) <>
Returns the quantile value (the inverse of the cumulative distribution function) for the given input probability, for a normal distribution with specified mean and standard deviation stdev. If unspecified, the mean defaults to 0
and the standard deviation defaults to 1
.
# vega.sampleLogNormal([mean, stdev]) <>
Returns a sample from a univariate log-normal probability distribution with specified log mean and log standard deviation stdev. If unspecified, the log mean defaults to 0
and the log standard deviation defaults to 1
.
# vega.cumulativeLogNormal(value[, mean, stdev]) <>
Returns the value of the cumulative distribution function at the given input domain value for a log-normal distribution with specified log mean and log standard deviation stdev. If unspecified, the log mean defaults to 0
and the log standard deviation defaults to 1
.
# vega.densityLogNormal(value[, mean, stdev]) <>
Returns the value of the probability density function at the given input domain value, for a log-normal distribution with specified log mean and log standard deviation stdev. If unspecified, the log mean defaults to 0
and the log standard deviation defaults to 1
.
# vega.quantileLogNormal(probability[, mean, stdev]) <>
Returns the quantile value (the inverse of the cumulative distribution function) for the given input probability, for a log-normal distribution with specified log mean and log standard deviation stdev. If unspecified, the log mean defaults to 0
and the log standard deviation defaults to 1
.
# vega.sampleUniform([min, max]) <>
Returns a sample from a univariate continuous uniform probability distribution over the interval [min, max). If unspecified, min defaults to 0
and max defaults to 1
. If only one argument is provided, it is interpreted as the max value.
# vega.cumulativeUniform(value[, min, max]) <>
Returns the value of the cumulative distribution function at the given input domain value for a uniform distribution over the interval [min, max). If unspecified, min defaults to 0
and max defaults to 1
. If only one argument is provided, it is interpreted as the max value.
# vega.densityUniform(value[, min, max]) <>
Returns the value of the probability density function at the given input domain value, for a uniform distribution over the interval [min, max). If unspecified, min defaults to 0
and max defaults to 1
. If only one argument is provided, it is interpreted as the max value.
# vega.quantileUniform(probability[, min, max]) <>
Returns the quantile value (the inverse of the cumulative distribution function) for the given input probability, for a uniform distribution over the interval [min, max). If unspecified, min defaults to 0
and max defaults to 1
. If only one argument is provided, it is interpreted as the max value.
Distribution Objects
Methods for sampling and calculating probability distributions. Each method takes a set of distributional parameters and returns a distribution object representing a random variable.
Distribution objects expose the following methods:
- dist.sample(): Samples a random value drawn from this distribution.
- dist.pdf(value): Calculates the value of the probability density function at the given input domain value.
- dist.cdf(value): Calculates the value of the cumulative distribution function at the given input domain value.
- dist.icdf(probability): Calculates the inverse of the cumulative distribution function for the given input probability.
# vega.randomNormal([mean, stdev]) <>
Creates a distribution object representing a normal (Gaussian) probability distribution with specified mean and standard deviation stdev. If unspecified, the mean defaults to 0
and the standard deviation defaults to 1
.
Once created, mean and stdev values can be accessed or modified using the mean
and stdev
getter/setter methods.
# vega.randomLogNormal([mean, stdev]) <>
Creates a distribution object representing a log-normal probability distribution with specified log mean and log standard deviation stdev. If unspecified, the log mean defaults to 0
and the log standard deviation defaults to 1
.
Once created, mean and stdev values can be accessed or modified using the mean
and stdev
getter/setter methods.
# vega.randomUniform([min, max]) <>
Creates a distribution object representing a continuous uniform probability distribution over the interval [min, max). If unspecified, min defaults to 0
and max defaults to 1
. If only one argument is provided, it is interpreted as the max value.
Once created, min and max values can be accessed or modified using the min
and max
getter/setter methods.
# vega.randomInteger([min,] max) <>
Creates a distribution object representing a discrete uniform probability distribution over the integer domain [min, max). If only one argument is provided, it is interpreted as the max value. If unspecified, min defaults to 0
.
Once created, min and max values can be accessed or modified using the min
and max
getter/setter methods.
# vega.randomMixture(distributions[, weights]) <>
Creates a distribution object representing a (weighted) mixture of probability distributions. The distributions argument should be an array of distribution objects. The optional weights array provides proportional numerical weights for each distribution. If provided, the values in the weights array will be normalized to ensure that weights sum to 1. Any unspecified weight values default to 1
(prior to normalization). Mixture distributions do not support the icdf
method: calling icdf
will result in an error.
Once created, the distributions and weights arrays can be accessed or modified using the distributions
and weights
getter/setter methods.
# vega.randomKDE(values[, bandwidth]) <>
Creates a distribution object representing a kernel density estimate for an array of numerical values. This method uses a Gaussian kernel to estimate a smoothed, continuous probability distribution. The optional bandwidth parameter determines the width of the Gaussian kernel. If the bandwidth is either 0
or unspecified, a default bandwidth value will be automatically estimated based on the input data. KDE distributions do not support the icdf
method: calling icdf
will result in an error.
Once created, data and bandwidth values can be accessed or modified using the data
and bandwidth
getter/setter methods.
Regression
Two-dimensional regression methods to predict one variable given another.
# vega.regressionLinear(data, x, y) <>
Fit a linear regression model with functional form y = a + b * x for the input data array and corresponding x and y accessor functions. Returns an object for the fit model parameters with the following properties:
- coef: An array of fitted coefficients of the form [a, b].
- predict: A function that returns a regression prediction for an input x value.
- rSquared: The R2 coefficient of determination, indicating the amount of total variance of y accounted for by the model.
# vega.regressionLog(data, x, y) <>
Fit a logarithmic regression model with functional form y = a + b * log(x) for the input input data array and corresponding x and y accessor functions.
Returns an object for the fit model parameters with the following properties:
- coef: An array of fitted coefficients of the form [a, b].
- predict: A function that returns a regression prediction for an input x value.
- rSquared: The R2 coefficient of determination, indicating the amount of total variance of y accounted for by the model.
# vega.regressionExp(data, x, y) <>
Fit an exponential regression model with functional form y = a + eb * x for the input data array and corresponding x and y accessor functions. Returns an object for the fit model parameters with the following properties:
- coef: An array of fitted coefficients of the form [a, b].
- predict: A function that returns a regression prediction for an input x value.
- rSquared: The R2 coefficient of determination, indicating the amount of total variance of y accounted for by the model.
# vega.regressionPow(data, x, y) <>
Fit a power law regression model with functional form y = a * xb for the input data array and corresponding x and y accessor functions. Returns an object for the fit model parameters with the following properties:
- coef: An array of fitted coefficients of the form [a, b].
- predict: A function that returns a regression prediction for an input x value.
- rSquared: The R2 coefficient of determination, indicating the amount of total variance of y accounted for by the model.
# vega.regressionLinear(data, x, y) <>
Fit a quadratic regression model with functional form y = a + b * x + c * x2 for the input data array and corresponding x and y accessor functions. Returns an object for the fit model parameters with the following properties:
- coef: An array of fitted coefficients of the form [a, b, c],
- predict: A function that returns a regression prediction for an input x value.
- rSquared: The R2 coefficient of determination, indicating the amount of total variance of y accounted for by the model.
# vega.regressionPoly(data, x, y, order) <>
Fit a polynomial regression model of specified order with functional form y = a + b * x + … + k * xorder for the input data array and corresponding x and y accessor functions. Returns an object for the fit model parameters with the following properties:
- coef: An (order + 1)-length array of polynomial coefficients of the form [a, b, c, d, …].
- predict: A function that returns a regression prediction for an input x value.
- rSquared: The R2 coefficient of determination, indicating the amount of total variance of y accounted for by the model.
# vega.regressionLoess(data, x, y, bandwidth) <>
Fit a smoothed, non-parametric trend line the input data array and corresponding x and y accessor functions using loess (locally-estimated scatterplot smoothing). Loess performs a sequence of local weighted regressions over a sliding window of nearest-neighbor points. The bandwidth argument determines the size of the sliding window, expressed as a [0, 1] fraction of the total number of data points included.
# vega.sampleCurve(f, extent[, minSteps, maxSteps]) <>
Generate sample points from an interpolation function f for the provided domain extent and return an array of [x, y] points. Performs adaptive subdivision to dynamically sample more points in regions of higher curvature. Subdivision stops when the difference in angles between the current samples and a proposed subdivision falls below one-quarter of a degree. The optional minSteps argument (default 25), determines the minimal number of initial, uniformly-spaced sample points to draw. The optional maxSteps argument (default 200), indicates the maximum resolution at which adaptive sampling will stop, defined relative to a uniform grid of size maxSteps. If minSteps and maxSteps are identical, no adaptive sampling will be performed and only the initial, uniformly-spaced samples will be returned.
Statistics Routines
Statistical methods for bandwidth estimation, bin calculation, bootstrapped confidence intervals, and quartile boundaries.
# vega.bandwidthNRD(array[, accessor]) <>
Given an array of numeric values, estimates a bandwidth value for use in Gaussian kernel density estimation, assuming a normal reference distribution. The underlying formula (from Scott 1992) is 1.06 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power, along with special case handling in case of zero values for the interquartile range or deviation. An optional accessor function can be used to first extract numerical values from an array of input objects, and is equivalent to first calling array.map(accessor)
.
Determine a quantitative binning scheme, for example to create a histogram. Based on the options provided given, this method will search over a space of possible bins, aligning step sizes with a given number base and applying constraints such as the maximum number of allowable bins. Given a set of options (see below), returns an object describing the binning scheme, in terms of start
, stop
and step
properties.
The supported options properties are:
- extent: (required) A two-element (
[min, max]
) array indicating the range over which the bin values are defined. - base: The number base to use for automatic bin determination (default base
10
). - maxbins: The maximum number of allowable bins (default
20
). There will often be fewer bins as the domain is sliced at “nicely” rounded values. - span: The value span over which to generate bin boundaries. The default is
extent[1] - extent[0]
. This parameter allows automatic step size determination over custom spans (for example, a zoomed-in region) while retaining the overall extent. - step: An exact step size to use between bins. If provided, the maxbins, span, and steps options will be ignored.
- steps: An array of allowable step sizes to choose from. If provided, the maxbins option will be ignored.
- minstep: A minimum allowable step size (particularly useful for integer values, default
0
). - divide: An array of scale factors indicating allowable subdivisions. The default value is
[5, 2]
, which indicates that the method may consider dividing bin sizes by 5 and/or 2. For example, for an initial step size of 10, the method can check if bin sizes of 2 (= 10/5), 5 (= 10/2), or 1 (= 10/(5*2)) might also satisfy the given constraints. - nice: Boolean indicating if the start and stop values should be nicely-rounded relative to the step size (default
true
).
vega.bin({extent:[0, 1], maxbins:10}); // {start:0, stop:1, step:0.1}
vega.bin({extent:[0, 1], maxbins:5}); // {start:0, stop:10, step:2}
vega.bin({extent:[5, 10], maxbins:5}); // {start:5, stop:10, step:1}
# vega.bootstrapCI(array, samples, alpha[, accessor]) <>
Calculates a bootstrapped confidence interval for an input array of values, based on a given number of samples iterations and a target alpha value. For example, an alpha value of 0.05
corresponds to a 95% confidence interval An optional accessor function can be used to first extract numerical values from an array of input objects, and is equivalent to first calling array.map(accessor)
. This method ignores null, undefined and NaN values.
# vega.dotbin(sortedArray, step[, smooth, accessor]) <>
Calculates dot plot bin locations for an input sortedArray of numerical values, and returns an array of bin locations with indices matching the input sortedArray. This method implements the “dot density” algorithm of Wilkinson, 1999. The step parameter determines the bin width: points within step values of an anchor point will be assigned the same bin location. The optional smooth parameter is a boolean value indicating if the bin locations should additionally be smoothed to reduce variance. An optional accessor function can be used to first extract numerical values from an array of input objects, and is equivalent to first calling array.map(accessor)
. Any null, undefined, or NaN values should be removed prior to calling this method.
# vega.quantiles(array, p[, accessor]) <>
Given an array of numeric values and array p of probability thresholds in the range [0, 1], returns an array of p-quantiles. The return value is a array the same length as the input p. An optional accessor function can be used to first extract numerical values from an array of input objects, and is equivalent to first calling array.map(accessor)
. This method ignores null, undefined and NaN values.
# vega.quartiles(array[, accessor]) <>
Given an array of numeric values, returns an array of quartile boundaries. The return value is a 3-element array consisting of the first, second (median), and third quartile boundaries. An optional accessor function can be used to first extract numerical values from an array of input objects, and is equivalent to first calling array.map(accessor)
. This method ignores null, undefined and NaN values.