When there is historical data for a random variable, the distribution fitting tool can be used to find an appropriate probability distribution and estimate the distribution parameters.
The first step in distribution fitting is to determine if the data is discrete or continuous. Discrete data can only assume specific values while continuous data can assume any value within a range.
Once the data type is determined, select some appropriate probability distributions to attempt a fit. When selecting distributions, consider if the data is positive, symmetrical, or skewed.
Simulation Master uses maximum likelihood estimation to fit distributions to data. Some distribution’s parameters can be estimated analytically, such as the Normal distribution. Other distribution’s parameters must be found numerically which will take longer to fit.
This tutorial consists of three parts:
- Basic distribution fitting
- Fitting with fixed parameters
- Parametric bootstrapping of parameters and goodness of fit p-values.
Part I - Basic Distribution Fitting
In Part I we will cover the basics of performing distribution fitting. This will be the basis for Parts II and III. For the tutorial, we will be fitting 5000 data points that were placed in a worksheet.
Click the Fit Distribution button on the ribbon.
The Fit Distribution form will appear.
Select Data to Fit
Enter the cell range containing the data to fit. The range can be typed in directly or selected in the worksheet using the minimize button next to the box.
Data outside a range can be filtered (excluded) by selecting lower and upper bounds of data to include when fitting a distribution. Data that equals the lower or upper bounds will be included. If all data is to be included, select -Infinity and +Infinity.
If a distribution’s parameters must be solved numerically, intensity is used to determine how much optimization is used to determine distribution parameters. A higher intensity will perform a more exhaustive search for parameters and will take longer to complete.
Distributions to Fit
Select the type of data. If the data is continuous, a list of continuous distributions will appear. If the data is discrete, a list of discrete distributions will appear. Select distributions that you want to attempt to fit to the data.
The Fit Distribution form will look like this:
Click the Fit Data button to start fitting.
Distribution Fitting Results
Once the fitting process is complete the Distribution Fit Results form will appear.
Simulation Master uses information criteria to provide a relative ranking of distribution fits. The three information criteria are:
Bayesian Information Criterion (BIC)
Akaike Information Criterion (AIC)
Hannan-Quinn Information Criterion (HQIC)
A lower information criterion value indicates a better relative fit.
A given information criterion does not say anything about how well a distribution fits the data, it is only used as a comparison relative to other candidate distributions. Information criteria should not be the only factor in selecting a distribution. How well a distribution matches the data statistics and how well the distribution fits the data visually are also important things to consider when selecting a distribution.
The upper left box of the Distribution Fit Results form lists the top three distributions for each information criterion. They are sorted according to BIC. Clicking on each distribution will show parameters, statistics, and the data histogram with a curve of the fitted distribution as an overlay.
The Normal distribution has the best fit.
The messages box will show if an error occurred while fitting or if a distribution’s support does not match the data. For example, if the data contains negative values and the lognormal distribution was selected for fitting, a message will appear that lognormal is not supported <= 0.
The parameters for the selected distribution will appear in this box. If parametric bootstrapping was selected (covered in Part III), the confidence interval for each parameter will also appear.
Goodness of Fit
For continuous distributions, the Anderson-Darling, Kolmogorov-Smirnov, and Chi Square goodness of fit statistics are calculated and displayed in this box. For discrete distributions, only the Chi Square statistic is calculated. If p-value bootstrapping is selected (covered in Part III) then the p-values will also be displayed.
The statistics box shows statistics for the data and statistics for the fitted distribution based on the parameters in the parameters box. This is useful for comparing the actual data to the distribution to see if the fit makes sense.
A histogram of the data and a curve of the distribution with the fitted parameters allows for a visual inspection of fit.
Write to Cell
A random variable formula can be inserted into the workbook using the Write to Cell button. The selected distribution’s RV formula and the fitted parameters will be inserted into the selected cell.
A fit report can be created by click on the Create Report button. The Fit Reports form will appear as shown below.
Select any distributions for which to create a report. In our example we will just create a report for the Normal distribution.
Select/deselect any charts to include in the report. We will include all charts.
Charts & Data
There are two ways to include charts in a report. Live charts with data will add all data used to create the charts in the report worksheet and then create Excel charts that can be edited. Charts as images, no data will create charts in a temporary worksheet and add the charts to the reports as images. There will not be any chart data placed in the report worksheet. We will use live charts with data.
The report worksheets can be created in the current workbook or in a new workbook. We will create the report in the current workbook.
When ready, click OK to create the report(s). The report is shown below.
Part II - Fitting with Fixed Parameters
If one or more parameters are known, the parameters can be fixed and the software will find the unknown parameters. In this part we'll fit the Normal distribution using a fixed value for mean. Assume we've opened the Fit Distribution form and entered information on the Data page as was done in Part I, except that only the Normal distribution is selected for fitting.
With the Distribution Fitting form open, click Fixed Parameters on the menu. The Fixed Parameters page will appear.
Select the Normal distribution in the list box. The parameters for the distribution will appear. Enter the fixed parameter(s) in the appropriate box and leave the rest of the boxes blank.
You can enter fixed parameters for more than one distribution by selecting each distribution and entering any fixed parameters for each one.
In our example earlier we found the Normal distribution parameters mean = 99.92 and standard deviation = 10.0588. Let's say we know that mean is 100, but want the software to find standard deviation. To use a fixed mean we enter the following in the Fixed Parameters page.
When finished, click the Fit Data button to start fitting.
After running, the Distribution Fit Results will appear as shown below. Notice in the parameters box that mean is fixed at 100, and the software has found that standard deviation is 10.0591.
Part III - Parametric Bootstrapping of Parameters and Goodness of Fit p-values
With parametric bootstrapping we can obtain a possible range of parameter values given a confidence level. We can also obtain p-value estimates through simulation. Both of these processes are very time consuming and the best practice is to find the best fitting distribution among several candidates and then perform bootstrapping on the best distribution.
Assume we've opened the Fit Distribution form and entered information on the Data page as was done in Part I, except that only the Normal distribution is selected for fitting. With the Distribution Fitting form open, click Other Settings on the menu. The Other Settings page will appear.
This pane controls the parametric bootstrapping of parameter ranges. Enter a confidence level and the number of simulations (bootstraps) to perform. In our example, we will use a 95% confidence level and 500 bootstraps.
We will estimate p-value for the Anderson-Darling goodness of fit statistic with 500 simulations.
When finished, click the Fit Data button to start fitting. After running, the Distribution Fit Results form will appear as shown below. Now the 95% confidence intervals for each parameter are shown in the parameters box. The p-value for the Anderson-Darling statistic is shown in the goodness of fit box.