Simulation in R, Part 2

CataractLake

In Simulation in R, Part 1, I introduced the role of simulation in Operations Research. I’ll continue the discussion here and attempt to provide a realistic answer to the estimate for my hypothetical home project. As a refresher, here’s what the estimates for the home project were in the last post:

Design:  4 week estimate with a range of 2 to 8 weeks

Build:  8 week estimate with a range of 4 to ? weeks

Finish:  2 week estimate with a range of 1 to 3 weeks

Install:  1 week estimate with a range of 1 to 2 weeks

Each step has a degree of uncertainty that ultimately may result in a best guess of 15 weeks being significantly off. My goal is to build a model of the uncertainty and then use a simulation to provide a more realistic estimate.

A critical part of the simulation is determining how to model the uncertainty in each step. This is done by choosing the appropriate probability density function (PDF). The PDF models the likelihood that a particular result might happen.

One of the simplest PDFs is the uniform distribution. An example of a uniform distribution is a six sided die. If you roll the die many times, you will get a uniform split between all six sides. The example below pulls 60,000 random samples between 0 and 6 from a uniform distribution function in R. As expected, the results are uniformly distributed.

clip_image001

The Install phase of my home project maps to a uniform function. I noted in the earlier post that is was equally likely to be 1 or 2 weeks in duration. This is the same as flipping a coin to see how long this phase will take.

Another common PDF used in simulations is the triangle function. The Design and Finish steps in my project fit this model. For example, the Design phase had a minimum of 2 weeks, a maximum of 8 weeks, and a most likely result of 4 weeks. As the name implies, the distribution shown below from a 10,000 point sample is triangular. Of 10,000 samples, about 3,000 are very close to 4 weeks. But many other results also exist.

clip_image002

That leaves me with the Build phase where I have minimum and most likely estimates, but there’s a ‘long tail’ of potential maximum estimates. A good way to model this is with a log normal distribution. Log normal is a close relative to the normal distribution (the ‘bell shaped curve’ that describes true randomness and is the basis for many statistical techniques). The major advantage of log normal for modeling a time estimate is that it’s one sided and never negative.

The plot below simulates my Build phase. Over 60% of the samples happen in 10 weeks or less, but there are possibilities that stretch from 1-2 years. This seems about right (sad but true!).

clip_image003

OK, the hard part is over. I have models for all four of my input steps. Now it’s just a few lines of code to perform a 10,000 sample simulation of my project.

# Create vectors of 10,000 random samples using PDF function
Design <- triangle::rtriangle(n=10000, a=2, b=8, c=4)
Build <- rlnorm(n=10000, meanlog=log(8), sdlog=log(2))
Finish <- triangle::rtriangle(n=10000, a=1, b=3, c=2)
Install <- ceiling(runif(n=10000, min=0, max=2)) 

If you’re new to R, a vector is like a one dimensional array. Here are the first three values of each of the vectors above. These represent random samples from the PDF for each project step. For example, remember that Design is a triangle PDF in the range of 2 to 8.

> Design[1:3]
[1] 3.721639 3.724241 2.290769
> Build[1:3]
[1] 29.701198 7.580633 2.523138
> Finish[1:3]
[1] 1.750079 1.875441 1.936408
> Install[1:3]
[1] 1 1 1

R is built to easily perform vector operations. The following statement does an element by element addition of the 10,000 samples in each vector. You can see this by verifying that the first three values of TotalTime are the sum of the values above. In just these three samples, the project time range varies from less than 8 weeks to over 36 weeks.

# Sum each of the random samples

TotalTime <- Design + Build + Finish + Install

> TotalTime[1:3]
[1] 36.172916 14.180314 7.750316

Now I have 10,000 simulated results for my project generated from randomized values for each step. The summary function will provide some basic statistics of these results. Plotting a histogram is a nice way to visualize the variation.

> summary(TotalTime)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.979 13.167 16.278 18.214 20.966 155.142

hist(TotalTime, main=”Distribution of home project estimates”)

clip_image004

The results provide a realistic understanding of how long my “15 week” project might take. On average, it will take 18 weeks and not 15 weeks. It will be completed between 13 and 21 weeks about half of the time (25th and 75th percentiles). While neither the 6 week minimum nor the 155 week maximum is very likely, these extremes give me an additional idea of how this project could go.

While this turned out to be a long post, it was seven lines of R code that performed a 10,000 simulation of a 4 step process along with descriptive statistics and a plot of the result. This is why I like R so much for this type of analysis… the language capabilities are optimized for statistical programming. You can do a lot with very few steps, but it takes much longer to explain what you did!

Picture details:  Cataract Lake (Maplewood State Park), 11/22/2018, iPhone 7, f1.8, 1/40 s, ISO 25