When data is available, an empirical distribution can be used in a Monte Carlo simulation. In this article we'll cover the basics of the empirical distribution and how to use it in your models.
Empirical vs. Parametric Distributions
Before we get into details, let's look at the big picture for a minute. A parametric probability distribution is a mathematical function whose shape is governed by parameters. For example, the normal distribution has two parameters: mean and standard deviation.
On the other hand, an empirical distribution is not based on a mathematical function. It is based on data itself.
When we build a model to simulate, some of the inputs are random variables (otherwise we wouldn't be simulating). If a parametric distribution is used to represent a variable, we do one of two things. Either we come up with the parameters through "expert" estimates, or by fitting a distribution to data and estimating the parameters.
When representing a variable through an empirical distribution, we are not making judgements about the distribution, we are saying it follows the data.
There are advantages and disadvantages to both methods, so using an empirical distribution every time there is data may not make sense as we'll see later.
Empirical Distribution Defined
For a value x, the probability ≤ X is:
P(x) = (number of data points ≤ x) / N
Where N is the total number of data points.
Let's say we have the following data that has been sorted from lowest to highest. The data set is small for illustrative purposes.
1 |
2 |
2 |
4 |
4 |
5 |
5 |
5 |
6 |
7 |
7 |
8 |
9 |
9 |
10 |
The empirical distribution of this data is plotted below. Note how cumulative probability steps up at each new unique value. The more data points at a unique value, the bigger the step. Also notice how each vertical line corresponds to a unique value in the data.
Sampling
Sampling from a distribution comes in two flavors: discrete and continuous.
If the nature of what we're modeling is discrete, it means that when we sample from the distribution, the variable can only assume specific values. In our example data set, if the variable is discrete, it could only assume a value contained in the data.
On the other hand, if the variable can assume any value between the minimum and maximum data points, the variable is continuous. In this case we need to do a little extra work to sample.
The first step in any sample is to generate a random number between 0 and 1. Why 0 and 1? This represents the probability of our sample. With the probability, we go into the distribution and find the sample value that corresponds to the probability. How we map the random number to a value depends on the type of data.
Discrete Sampling
Discrete sampling is relatively straightforward. We simply find the random number value on the vertical axis and go to the right until intersecting the distribution curve. Then go down to the horizontal axis and record the value. The value is the sample.
In the example below, we generated a random number of 0.44. The sample value is 5.
Continuous Sampling
Continuous sampling is a little more complicated since we need to interpolate between data points. One way to do this is to use percentiles. To do this, we calculate the value at a given percentile. The random number is the percentile, as a fraction of 1. The chart below has the original empirical distribution as well as the interpolated values calculated as percentiles.
Notice how data values with more than one instance will return that value exactly for some range of probability (vertical lines). For example, if the random number is .3752 to .5, the sample will be exactly 5.
In the example, let's say we generate a random number of 0.77. Moving to the right until we intersect the green line, and then going down to the horizontal axis we get a sample value of 7.78.
Simulation Master RVEMPIRICAL Function
Simulation Master is an Excel® add-in for Monte Carlo simulation. It allows for empirical sampling with the RVEMPIRICAL function. Each time the spreadsheet is calculated, the function samples from the data and returns a new value.
An empirical distribution can be added to the model using the empirical distribution function. The function is typed in the desired cell as shown below.
=RVEMPIRICAL (range, type)
Where:
-
- range is the cell range containing the data upon which the distribution depends.
- type is the type of data, either "continuous" or "discrete". Type must be entered in quotation marks.
The data range must conform to the following:
- All cells must be numeric.
- The range must be one column.
- The range must be sorted ascending (smallest to largest) with the smallest value in the top cell and the largest value in the bottom cell.
Discrete will return only values included in the data range. Continuous uses interpolation to return values between data points. The Excel PERCENTILE.INC function is used to generate the continuous values.
The function must evaluate to a numeric value or an error will result.
Limitations
Using empirical data instead of a parametric random variable function has some limitations:
- The data may not be a good representation of the underlying probability distribution.
- Values outside the range of data cannot be sampled.
- For large data sets, simulation time may increase due to the nature of sampling from data versus calculation of a parametric distribution.
Excel is a registered trademark of Microsoft Corporation. Used with permission from Microsoft.