Suppose a random sample x = (x1,x2,...xn) from an unknown probability distribution F has been observed and we wish to estimate a parameter of interest
= t(F) on the basis of x. For this purpose, we calculate an estimate
= s(x) from x. How accurate is
? The bootstrap1 was introduced in 1979 as a computer-based method for estimating the standard error of
.
The bootstrap is a data-based simulation method for statistical inference. It allows scientists to explore data and draw valid statistical inferences without worrying about mathematical formulas and derivations. The bootstrap parameter estimate is available no matter how mathematically complicated the estimator
= s(x)
may be. In its non-parametric form, the bootstrap provides standard errors and confidence intervals without the usual normal-theory assumptions.
The bootstrap method draws repeated samples (with replacement) from the observed sample itself to generate the sampling distribution of a statistic (a data set of size n has 2n-1 nonempty subsets).
Bootstrapping of a statistic
= s(x) consists of the following steps:
is computed for each bootstrap sample, that is
(b) = s(x*b) for b = 1,2,...B.
Implementation of these steps in a computer language is not difficult. A necessary ingredient for any bootstrap program is a high quality uniform number generator. It is important to keep in mind that the bootstrap (and associate methods) are not tools that are used in isolation but rather are applied to other statistical techniques. For this reason, they are most effectively used in an integrated environment for data analysis. In such an environment, a bootstrap procedure has the ability to call other procedures with different sets of inputs (data) and then collect them together and analyze the results. The S, S-PLUS2, Gauss, and Matlab packages are examples of integrated environment. In this report, we use the S-PLUS function summary.bootstrap( ). For each data-column, 1000 samples of the data (with replacement) or replicates were generated and the mean of these samples was calculated. The 2.5th and 97.5th empirical percentiles for the replicates of the parameter estimate (sample mean) are the lower and upper bounds of the data set, respectively.