Class EmpiricalDistribution
- java.lang.Object
-
- org.apache.commons.math4.distribution.AbstractRealDistribution
-
- org.apache.commons.math4.distribution.EmpiricalDistribution
-
- All Implemented Interfaces:
java.io.Serializable,RealDistribution,ContinuousDistribution
public class EmpiricalDistribution extends AbstractRealDistribution implements ContinuousDistribution
Represents an empirical probability distribution -- a probability distribution derived from observed data without making any assumptions about the functional form of the population distribution that the data come from.
An
EmpiricalDistributionmaintains data structures, called distribution digests, that describe empirical distributions and support the following operations:- loading the distribution from a file of observed data values
- dividing the input data into "bin ranges" and reporting bin frequency counts (data for histogram)
- reporting univariate statistics describing the full set of data values as well as the observations within each bin
- generating random values from the distribution
EmpiricalDistributionto build grouped frequency histograms representing the input data or to generate random values "like" those in the input file -- i.e., the values generated will follow the distribution of the values in the file.The implementation uses what amounts to the Variable Kernel Method with Gaussian smoothing:
Digesting the input file
- Pass the file once to compute min and max.
- Divide the range from min-max into
binCount"bins." - Pass the data file again, computing bin counts and univariate statistics (mean, std dev.) for each of the bins
- Divide the interval (0,1) into subintervals associated with the bins, with the length of a bin's subinterval proportional to its count.
- Generate a uniformly distributed value in (0,1)
- Select the subinterval to which the value belongs.
- Generate a random Gaussian value with mean = mean of the associated bin and std dev = std dev of associated bin.
EmpiricalDistribution implements the
USAGE NOTES:RealDistributioninterface as follows. Given x within the range of values in the dataset, let B be the bin containing x and let K be the within-bin kernel for B. Let P(B-) be the sum of the probabilities of the bins below B and let K(B) be the mass of B under K (i.e., the integral of the kernel density over B). Then setP(X < x) = P(B-) + P(B) * K(x) / K(B)whereK(x)is the kernel distribution evaluated at x. This results in a cdf that matches the grouped frequency distribution at the bin endpoints and interpolates within bins using within-bin kernels.- The
binCountis set by default to 1000. A good rule of thumb is to set the bin count to approximately the length of the input file divided by 10. - The input file must be a plain text file containing one valid numeric entry per line.
- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.apache.commons.statistics.distribution.ContinuousDistribution
ContinuousDistribution.Sampler
-
-
Field Summary
Fields Modifier and Type Field Description static intDEFAULT_BIN_COUNTDefault bin count-
Fields inherited from class org.apache.commons.math4.distribution.AbstractRealDistribution
SOLVER_DEFAULT_ABSOLUTE_ACCURACY
-
-
Constructor Summary
Constructors Constructor Description EmpiricalDistribution()Creates a new EmpiricalDistribution with the default bin count.EmpiricalDistribution(int binCount)Creates a new EmpiricalDistribution with the specified bin count.
-
Method Summary
Modifier and Type Method Description ContinuousDistribution.SamplercreateSampler(UniformRandomProvider rng)Creates a sampler.doublecumulativeProbability(double x)For a random variableXwhose values are distributed according to this distribution, this method returnsP(X <= x).doubledensity(double x)Returns the probability density function (PDF) of this distribution evaluated at the specified pointx.intgetBinCount()Returns the number of bins.java.util.List<SummaryStatistics>getBinStats()Returns a List ofSummaryStatisticsinstances containing statistics describing the values in each of the bins.double[]getGeneratorUpperBounds()Returns a fresh copy of the array of upper bounds of the subintervals of [0,1] used in generating data from the empirical distribution.protected ContinuousDistributiongetKernel(SummaryStatistics bStats)The within-bin smoothing kernel.doublegetMean()Gets the mean of this distribution.StatisticalSummarygetSampleStats()Returns aStatisticalSummarydescribing this distribution.doublegetSupportLowerBound()Gets the lower bound of the support.doublegetSupportUpperBound()Gets the upper bound of the support.double[]getUpperBounds()Returns a fresh copy of the array of upper bounds for the bins.doublegetVariance()Gets the variance of this distribution.doubleinverseCumulativeProbability(double p)Computes the quantile function of this distribution.booleanisLoaded()Property indicating whether or not the distribution has been loaded.booleanisSupportConnected()Indicates whether the support is connected, i.e.voidload(double[] in)Computes the empirical distribution from the provided array of numbers.voidload(java.io.File file)Computes the empirical distribution from the input file.voidload(java.net.URL url)Computes the empirical distribution using data read from a URL.doubleprobability(double x)For a random variableXwhose values are distributed according to this distribution, this method returnsP(X = x).-
Methods inherited from class org.apache.commons.math4.distribution.AbstractRealDistribution
getSolverAbsoluteAccuracy, logDensity, probability, sample
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.commons.statistics.distribution.ContinuousDistribution
logDensity, probability
-
-
-
-
Field Detail
-
DEFAULT_BIN_COUNT
public static final int DEFAULT_BIN_COUNT
Default bin count- See Also:
- Constant Field Values
-
-
Constructor Detail
-
EmpiricalDistribution
public EmpiricalDistribution()
Creates a new EmpiricalDistribution with the default bin count.
-
EmpiricalDistribution
public EmpiricalDistribution(int binCount)
Creates a new EmpiricalDistribution with the specified bin count.- Parameters:
binCount- number of bins. Must be strictly positive.- Throws:
NotStrictlyPositiveException- ifbinCount <= 0.
-
-
Method Detail
-
load
public void load(double[] in) throws NullArgumentExceptionComputes the empirical distribution from the provided array of numbers.- Parameters:
in- the input data array- Throws:
NullArgumentException- if in is null
-
load
public void load(java.net.URL url) throws java.io.IOException, NullArgumentException, ZeroExceptionComputes the empirical distribution using data read from a URL.The input file must be an ASCII text file containing one valid numeric entry per line.
- Parameters:
url- url of the input file- Throws:
java.io.IOException- if an IO error occursNullArgumentException- if url is nullZeroException- if URL contains no data
-
load
public void load(java.io.File file) throws java.io.IOException, NullArgumentExceptionComputes the empirical distribution from the input file.The input file must be an ASCII text file containing one valid numeric entry per line.
- Parameters:
file- the input file- Throws:
java.io.IOException- if an IO error occursNullArgumentException- if file is null
-
getSampleStats
public StatisticalSummary getSampleStats()
Returns aStatisticalSummarydescribing this distribution. Preconditions:- the distribution must be loaded before invoking this method
- Returns:
- the sample statistics
- Throws:
java.lang.IllegalStateException- if the distribution has not been loaded
-
getBinCount
public int getBinCount()
Returns the number of bins.- Returns:
- the number of bins.
-
getBinStats
public java.util.List<SummaryStatistics> getBinStats()
Returns a List ofSummaryStatisticsinstances containing statistics describing the values in each of the bins. The list is indexed on the bin number.- Returns:
- List of bin statistics.
-
getUpperBounds
public double[] getUpperBounds()
Returns a fresh copy of the array of upper bounds for the bins. Bins are:
[min,upperBounds[0]],(upperBounds[0],upperBounds[1]],..., (upperBounds[binCount-2], upperBounds[binCount-1] = max].Note: In versions 1.0-2.0 of commons-math, this method incorrectly returned the array of probability generator upper bounds now returned by
getGeneratorUpperBounds().- Returns:
- array of bin upper bounds
- Since:
- 2.1
-
getGeneratorUpperBounds
public double[] getGeneratorUpperBounds()
Returns a fresh copy of the array of upper bounds of the subintervals of [0,1] used in generating data from the empirical distribution. Subintervals correspond to bins with lengths proportional to bin counts.
Preconditions:- the distribution must be loaded before invoking this method
In versions 1.0-2.0 of commons-math, this array was (incorrectly) returned by
getUpperBounds().- Returns:
- array of upper bounds of subintervals used in data generation
- Throws:
java.lang.NullPointerException- unless aloadmethod has been called beforehand.- Since:
- 2.1
-
isLoaded
public boolean isLoaded()
Property indicating whether or not the distribution has been loaded.- Returns:
- true if the distribution has been loaded
-
probability
public double probability(double x)
For a random variableXwhose values are distributed according to this distribution, this method returnsP(X = x). In other words, this method represents the probability mass function (PMF) for the distribution.- Specified by:
probabilityin interfaceContinuousDistribution- Overrides:
probabilityin classAbstractRealDistribution- Parameters:
x- Point at which the PMF is evaluated.- Returns:
- zero.
- Since:
- 3.1
-
density
public double density(double x)
Returns the probability density function (PDF) of this distribution evaluated at the specified pointx. In general, the PDF is the derivative of theCDF. If the derivative does not exist atx, then an appropriate replacement should be returned, e.g.Double.POSITIVE_INFINITY,Double.NaN, or the limit inferior or limit superior of the difference quotient.Returns the kernel density normalized so that its integral over each bin equals the bin mass.
Algorithm description:
- Find the bin B that x belongs to.
- Compute K(B) = the mass of B with respect to the within-bin kernel (i.e., the integral of the kernel density over B).
- Return k(x) * P(B) / K(B), where k is the within-bin kernel density and P(B) is the mass of B.
- Specified by:
densityin interfaceContinuousDistribution- Parameters:
x- Point at which the PDF is evaluated.- Returns:
- the value of the probability density function at
x. - Since:
- 3.1
-
cumulativeProbability
public double cumulativeProbability(double x)
For a random variableXwhose values are distributed according to this distribution, this method returnsP(X <= x). In other words, this method represents the (cumulative) distribution function (CDF) for this distribution.Algorithm description:
- Find the bin B that x belongs to.
- Compute P(B) = the mass of B and P(B-) = the combined mass of the bins below B.
- Compute K(B) = the probability mass of B with respect to the within-bin kernel and K(B-) = the kernel distribution evaluated at the lower endpoint of B
- Return P(B-) + P(B) * [K(x) - K(B-)] / K(B) where K(x) is the within-bin kernel distribution function evaluated at x.
- Specified by:
cumulativeProbabilityin interfaceContinuousDistribution- Parameters:
x- Point at which the CDF is evaluated.- Returns:
- the probability that a random variable with this
distribution takes a value less than or equal to
x. - Since:
- 3.1
-
inverseCumulativeProbability
public double inverseCumulativeProbability(double p) throws OutOfRangeExceptionComputes the quantile function of this distribution. For a random variableXdistributed according to this distribution, the returned value isinf{x in R | P(X<=x) >= p}for0 < p <= 1,inf{x in R | P(X<=x) > 0}forp = 0.
ContinuousDistribution.getSupportLowerBound()forp = 0,ContinuousDistribution.getSupportUpperBound()forp = 1.
Algorithm description:
- Find the smallest i such that the sum of the masses of the bins through i is at least p.
-
Let K be the within-bin kernel distribution for bin i.
Let K(B) be the mass of B under K.
Let K(B-) be K evaluated at the lower endpoint of B (the combined mass of the bins below B under K).
Let P(B) be the probability of bin i.
Let P(B-) be the sum of the bin masses below bin i.
Let pCrit = p - P(B-)
- Return the inverse of K evaluated at
K(B-) + pCrit * K(B) / P(B)
- Specified by:
inverseCumulativeProbabilityin interfaceContinuousDistribution- Overrides:
inverseCumulativeProbabilityin classAbstractRealDistribution- Parameters:
p- Cumulative probability.- Returns:
- the smallest
p-quantile of this distribution (largest 0-quantile forp = 0). - Throws:
OutOfRangeException- Since:
- 3.1
-
getMean
public double getMean()
Gets the mean of this distribution.- Specified by:
getMeanin interfaceContinuousDistribution- Returns:
- the mean, or
Double.NaNif it is not defined. - Since:
- 3.1
-
getVariance
public double getVariance()
Gets the variance of this distribution.- Specified by:
getVariancein interfaceContinuousDistribution- Returns:
- the variance, or
Double.NaNif it is not defined. - Since:
- 3.1
-
getSupportLowerBound
public double getSupportLowerBound()
Gets the lower bound of the support. It must return the same value asinverseCumulativeProbability(0), i.e.inf {x in R | P(X <= x) > 0}.- Specified by:
getSupportLowerBoundin interfaceContinuousDistribution- Returns:
- the lower bound of the support.
- Since:
- 3.1
-
getSupportUpperBound
public double getSupportUpperBound()
Gets the upper bound of the support. It must return the same value asinverseCumulativeProbability(1), i.e.inf {x in R | P(X <= x) = 1}.- Specified by:
getSupportUpperBoundin interfaceContinuousDistribution- Returns:
- the upper bound of the support.
- Since:
- 3.1
-
isSupportConnected
public boolean isSupportConnected()
Indicates whether the support is connected, i.e. whether all values between the lower and upper bound of the support are included in the support.- Specified by:
isSupportConnectedin interfaceContinuousDistribution- Returns:
- whether the support is connected.
- Since:
- 3.1
-
createSampler
public ContinuousDistribution.Sampler createSampler(UniformRandomProvider rng)
Creates a sampler.- Specified by:
createSamplerin interfaceContinuousDistribution- Overrides:
createSamplerin classAbstractRealDistribution- Parameters:
rng- Generator of uniformly distributed numbers.- Returns:
- a sampler that produces random numbers according this distribution.
-
getKernel
protected ContinuousDistribution getKernel(SummaryStatistics bStats)
The within-bin smoothing kernel. Returns a Gaussian distribution parameterized bybStats, unless the bin contains only one observation, in which case a constant distribution is returned.- Parameters:
bStats- summary statistics for the bin- Returns:
- within-bin kernel parameterized by bStats
-
-