Accurate estimates of IT work effort are critical for deciding where in technology a business should invest.

Lacking experience with similar projects, the business is often at a loss for hard data. In this article, we describe our benefit from the power and convenience of R in the elicitation task, or, in other words, in quantifying the uncertainty around IT project lifespans using probability distributions. We show how R's built in functionality makes the elicitation task painless, while demonstrating how the methodology can be implemented in a user-friendly format.

## Introduction

An **IT estimate** attempts to approximate the cost of a roughly defined set of work. It is a difficult problem. In some cases, the scope may not be fully articulated when the estimate is requested. In other cases, the items being estimated may represent entirely new capabilities never attempted before. Unlike the mass production of a 'widget', there is no prior experience with the effort required.

Regardless of prior experience with a task, actual units of work will undoubtedly miss the estimate provided. The resulting uncertainty has historically been both difficult to quantify and difficult to communicate. The cost of building a new capability in-house is not as certain as the cost of buying a software license or a piece of hardware. Decision makers, as humans, tend to focus on the number given: if the estimate is 5,000 hours, the decision on whether to proceed tends to be based on the cost of * exactly * 5,000 hours of effort.

At Nationwide Insurance, we are evolving the discipline of IT estimation to include sound statistical techniques that directly address this uncertainty. Although the technical leads lack fully articulated scope and prior experience, they possess knowledge of the technical skills required and the similarity of the new work to previously completed work. Individuals with IT experience know that the actual cost can vary both above and below the estimate, and they can acknowledge that the total effort will at best fall within a given range.

## Eliciting Distributions with R

**Elicitation** is the process of translating an expert’s knowledge and uncertainty about an unknown quantity into a probability distribution. The need to elicit probability distributions arises naturally in Bayesian statistical applications, where prior probability distributions must be specified for all unknown quantities.

Here we attempt the elicitation task in the context of IT estimation, where the unknown quantity is the number of hours (assumed to be proportional to dollars) that a project will take. Specifically, we wish to interview the project's technical lead, one with extensive knowledge in the IT domain of interest.

Since our data are time (in hours) until a project ends, we use a popular distribution in the analysis of survival data, the Weibull distribution, to characterize project lifetime uncertainty. Define T to be the number of hours a particular project will take to complete. Thus, we assume

**T ~ Weibull(λ,k)**.

The distribution function F_T(t) = Pr(T ≤ t) contains all information about the distribution of T, and since we've restricted ourselves to the Weibull family, F_T(t) is entirely specified when λ,k are known.

In R, the distribution of the Weibull function is conveniently built in:

# Demonstration of the pweibull() function k <- 3; lambda <- 10; t <- c(1:15) pweibull(t, shape = k, scale = lambda)

In the example above, the Weibull shape parameter, k, is set to three; the Weibull scale parameter, λ, is set to ten. With the distribution now fully specified, the code above computes F_T(t) = Pr(T ≤ t) for t = 1,...,15.

The problem of finding the "best" distribution function has been reduced to the problem of finding the "best" values for λ and k. While we live in a world where nonparametric techniques do not confine the number of descriptive parameters, this two-parameter approach is chosen for the sake of simplicity and convenience, at the cost of confinement to the Weibull class.

In their review of elicitation methodologies, Garthwaite et. al. (2005) discuss the *quantile method,* where for a given value of q between 0 and 1, the user must guess the value of the unknown quantity, v, such that its distribution function F_T(v) equals q. In our application, this would correspond to guessing the hour t such that the project will be completed with probability q (with q provided).

The quantile method, applied directly, has the advantage of being neutral to the distribution, as all real-valued random variables have a cumulative distribution function. However, we seek questions that fit comfortably in everyday conversation, and instead ask questions about probabilities of events. Consequently, our questions are phrased in the following form:

*"What are the chances that you will finish the project in under [t] hours (from .00 to 1.00)?"*

This direction, however, creates one complication: a reasonable set of questions that asks about F_T(t) *does* depend on the specifics of the distribution. For instance, it would be wholly unsatisfactory to inquire only about F_T(1.5), F_T(4.1), and F_T(9.0) if the expected value of T is 1000 hours!

Our solution is a two-stage survey, where the first question is used to generate future questions. It is

*"In an over-under betting scenario, what hour would be a cutoff for making a fair bet on how long the project will last?"*

From this point on, we invite the reader to imagine a personal project (e.g., a project at work, doing taxes, or even cleaning the house), and take part in the survey as would our project managers:

User.Median.Guess <- as.numeric( winDialogString("In an over-under betting scenario, what hour would be a cutoff for making a fair bet on how long the project will last?", default = "1000") )

This is, of course, the median, and a direct application of the quantile method. Since the level to exposure to the concept of a "fair bet" is relatively high, this meets our standard of conversational questioning.

With the two unknown parameters of the Weibull distribution, the median alone does not uniquely specify k and λ values for the Weibull distribution. However, the case k = 1 is special. Here, the Weibull distribution simplifies to the exponential distribution with scale parameter λ. Taken on faith, it would imply that projects don't "age," and that new projects are just as likely to end as old ones. Considering the reality of unforeseen circumstances and scope creep, this simplification may not be entirely unrealistic.

Since the theoretical median of an exponential distribution is ln(2)/λ, the user’s guess of the fair betting point gives us an initial value for scale parameter λ=[Median Guess] /ln(2). In R,

lambda.initial <- User.Median.Guess / log(2) k.initial <- 1

The initial value of k=1 and the expert’s implied λ comes in handy as a way to dynamically "spread out" the rest of the questions. The built-in R function qweibull() quickly gets us percentiles 1-99 from this initial distribution:

percentile <- seq(from = .01, to = .99, by = .01) Weibull.percentiles <- sapply( percentile, function(i) qweibull(i, shape=k.initial, scale=lambda.initial) )

We now ask the user about the probabilities of finishing the project in less than t1, t2, …, tk hours, where the ti are appropriately spread out percentiles of the initial distribution. Specifically, we used the 10th, 25th, 75th, 90th and 95th percentiles, in addition to the median (50th percentile) asked in the "fair bet" form.

By running the code below, we invite the reader to answer these questions with the same personal project in mind. The defaults, when entered, produce the output shown throughout the remainder of this article.

User.P10.Guess <- as.numeric( winDialogString( paste("What are the chances that you will finish the project in under", floor(Weibull.percentiles[10]), "hours?") , ".01") ) User.P25.Guess <- as.numeric( winDialogString( paste("What are the chances that you will finish the project in under", floor(Weibull.percentiles[25]), "hours?") , ".07") ) User.P75.Guess <- as.numeric( winDialogString( paste("What are the chances that you will finish the project in under", floor(Weibull.percentiles[75]), "hours?") , ".80") ) User.P90.Guess <- as.numeric( winDialogString( paste("What are the chances that you will finish the project in under", floor(Weibull.percentiles[90]), "hours?") , ".97") ) User.P95.Guess <- as.numeric( winDialogString( paste("What are the chances that you will finish the project in under", floor(Weibull.percentiles[95]), "hours?") , ".995") )

At the point that the user finishes her questions, we create the R data frame *user.summary*, consisting of only two columns: the user's estimates that the project will be finished by the hours asked about, and the hours themselves.

user.summary <- data.frame( user.cdf = c(User.P10.Guess, User.P25.Guess, .50, User.P75.Guess, User.P90.Guess, User.P95.Guess), hour = c(floor(Weibull.percentiles[10]), floor(Weibull.percentiles[25]), User.Median.Guess, floor(Weibull.percentiles[75]), floor(Weibull.percentiles[90]), floor(Weibull.percentiles[95]) ) )

With data in hand, we now remove the restriction that k = 1 and free λ from [Median Guess] /ln(2). In fact, these two previous values make excellent starting points for the non-linear regression below, effortlessly implemented in R using the nls() function.

nls.CDF <- nls( user.cdf ~ 1 - exp(-(hour/lambda)^k) , data = user.summary, start = list(lambda = lambda.initial, k = k.initial) ) lambda.final <- summary(nls.CDF)$parameters["lambda","Estimate"] k.final <- summary(nls.CDF)$parameters["k","Estimate"]

We can visualize the fit of the Weibull distribution to the user summaries by plotting the guesses as points alongside the Weibull CDF. The code below defines a visualization window using the qweibull() function and the Weibull CDF using the pweibull() function.

#Define a sequence covering the middle 99% of the final Weibull distribution hour <- seq(from = qweibull(.005, shape = k.final, scale=lambda.final), to = qweibull(.995, shape = k.final, scale=lambda.final), by = .5) #Plot the final cumulative distribution function over the previous sequence plot(pweibull(hour, shape=k.final, scale=lambda.final) ~ hour, col = "blue", t = "l", ylab = "Probability", ylim = c(0,1), xlab = "Hour into Job", main = "CDF of Best Fit Weibull Distribution", sub = paste("lambda =", round(lambda.final, 2), ", k =", round(k.final, 2) ) ) #Overlaying points to represent the user's probability guesses points(user.summary$user.cdf~user.summary$hour, col = "red")

Furthermore, we can visualize the resulting density using R's dweibull() function:

plot(dweibull(hour, scale=lambda.final, shape = k.final) ~ hour, col = "blue", t = "l", ylab = "Density", xlab = "Hour into Job", main = "Density of Best Fit Weibull Distribution", sub = paste("lambda =", round(lambda.final, 2), ", k =", round(k.final, 2) ) )

Again, using the defaults results in

Finally, we simulate data from the final Weibull distribution, with the sample size as chosen by the user. The rweibull() function allows this to be done in a single line of code.

> print(sim.data) [1] 1203.2629 1697.6471 345.0274 1418.3952 2038.9148 216.3513 505.0067 2282.9990 563.4299 1410.0922 231.5188 735.0479 [13] 1217.1852 483.2952 966.5491

The simulated data serves two purposes. First, since there is no convenient form for the sum of Weibull random variables, the user may estimate the distribution of time to complete a multi-task project using Monte Carlo simulation. Second, seeing a list of “possible universes” in which the project plays out is a different way of absorbing information from the elicited distribution, perhaps one that is particularly enlightening for someone without formal training in probability.

## Discussion

Assigning probabilities to human beliefs is an aggressive application of probability theory. Yet the internally consistent framework of probability theory can provide a quantification of uncertainty where it would otherwise not exist. Newer and more sophisticated methods than ours exist for eliciting probability distributions. For instance, Oakley and O’hagan (2007) allow the elicited distribution function to vary non-parametrically about a known class. We would be delighted to see their method implemented in R.

The power of R's probability toolbox allowed us to rapidly prototype an application which transported the basic concepts of elicitation to the IT project management space. By continuing to refine the system's interview process, underlying elicitation methodology, and user output, we hope to improve the process by which technology investments are evaluated and decided upon.

## References

Garthwaite, P. H., Kadane, J.B. and O’Hagan, A. (2005). “Statistical Methods for Eliciting Probability Distributions.” Journal of the American Statistical Association. Vol. 100, No. 470, pp. 680-701.

Oakley, J.E. and O’hagan, A. (2007). “Uncertainty in prior elicitations: a nonparametric approach.” Biometrika. Vol 94, No. 2, pp. 427-441.

R Development Core Team (2011). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.

Contact us: ogorekb@nationwide.com, terrys@nationwide.com