CHAPTER 3

SIMULATION OUTPUT DATA ANALYSIS

Simulation output data analysis involves computing estimates of parameters of the model using is data generated by the simulation model. This procedure is very important, because good analysis can produce reliable estimates from the simulation and make reliable prediction to the real world case. In the JSIM simulation package, the output data is typically stochastic, because the simulation model gets some input value from various random number generators.

3.1 Nature of Simulation Output

In most case, the output data from model is used to estimate a parameter q of the model in which the user is interested. From a statistical point view, it is necessary to have the point estimate and an interval estimate of q . If the simulation output data is of the form {Y1, Y2, …, Yn}, then the output is discrete-time data. If the output process Y1, Y2, … is stationary and q is the mean of the process, then the point estimate of q based on data {Y1, Y2, …, Yn} is:

 

An interval estimate of q with confidence coefficient 1 - a is given by:

where ta /2,f is the 100(1 - a ) percentage point of a t distribution with f degrees of freedom.

If Y1, Y2, … are independent and identically distributed, then the estimate of the variance can be calculated by the following two equations:

and the appropriate number of degrees of freedom is f = n – 1.

To analyze output data from stochastic model, there are two distinct types of simulations. One is the terminating or transient simulation and the other is stead-state simulation. A transient simulation is one that runs for some duration of time TE where E is a specified event (or set of events) which stops the simulation. Such a simulation starts at time 0 and ends at time TE. A stead-state simulation is a simulation whose objective is to study long run, or steady state, behavior of a non-terminating system. [Banks, et al. 1996].

To estimate the steady state mean q , the following limit equation shows that the sample mean is an appropriate point estimator:

3.2 The Initial Conditions

The difficult part is how to estimate q from the data, collected from simulation, which satisfies the steady-state condition. Two things the simulation analyst must decide are the running length TE of the simulation, and the bias in the point estimator due to artificial or arbitrary initial condition.

There are at least two ways to get around the bias initial condition. If the system exists, the simulation analyst can collect the typical initial conditions from the system and use these conditions to initialize the simulation. The other method to reduce the effects of the initial conditions is to separate simulation run into two phases. The first phase start from time 0 to T0. In this phase the output data is discarded. The system is warm up during phase one. The data collected from the system from T0 to TE is used to estimate q . The choice of T0 is very important; the state of the system at T0 should be more representative of steady-state behavior than the original initial conditions at time 0. If T0 is too small, the system has not warmed up and the estimator will be biased. A large T0 will waste a lot useful data.

3.3 Independent Replications

For most of simulations, the observed process is non-stationary and autocorrelated (self-related). The equations 2, 3 and 4, which associate with the techniques of classical statistical analysis of independent identical distribution (i.i.d.), can not be used directly [Law, Kelton, 1982]. To get around this problem, several methods have been suggested in output data analysis literature. The widely accepted methods are the independent replications and batch means methods. Both methods try to avoid autocorrelation by breaking the data into "independent" segment. The means of these segments are considered i.i.d. and used to calculate confident interval.

In independent replications method, several independent runs of simulation are made. Each run, with length of n observations, is made from scratch by using different stream of random numbers. Let k represent the number of runs and n denote the number of observations in each run so the total number of observations is k * n.

To eliminate the bias from the initial conditions, the simulation analyst can choose to delete the first d observations from the output data. Let

be the sample mean of the n observations in the ith run without first d observations. These means are i.i.d. random variables since the runs themselves are independent.

The point estimator is

The standard error is estimated by the sample variance

and the standard error is given by

The 100(1 - a )% confidence interval for q , based on the t distribution, is given by

3.4 Batch Means

To reduce the initial bias, the independent replications analysis method deletes substantial amount data. This means the waste of useful data. The batch means method tries to make use of all the data collected from simulation after a single initial transient period. In the batch means method, one long replication run is conducted. The data from the long run is divided into several equal size segments or batches. The means of these batches are calculated and used to compute point and interval estimates of the simulation.

These batch means can generate reliable results only if the means of the batches are approximately uncorrelated. However, the means of contiguous batches are usually autocorrelated. To get rid of these problems, lots of research has been done in these areas. By proper selection of batch size and having a sufficiently long simulation run, the batch mean can become uncorrelated from each other.

The bias caused by initial conditions still exists. Data deletion can be applied here but because only one replication run is made during the simulation, the amount of deletion is deduced substantially, as waste of useful data is cut to minimum.

The following tips summarize the proper approach in batch means data analysis.

First, for fixed sample size, a plot of the batch means is a very useful tool for checking the effects of initial conditions, non-normality of batch means, and existence of correlation between batch means. Second, for a long run, the fixed number of batches can get better results. In fixed number of batch method, the batch size bn grows as the sample size n. bn = n / k where k is number of batches. Third, for sample size n, choosing both batch and number of batch to be square root of sample size, i.e.

will have a better chance to have uncorrelated batch means. Fishman and Yarberry [Fishman and Yarberry, 1997] have designed and implemented ABATCH and LBATCH algorithms. This algorithm is believed to be a very effective way of simulation output data analysis by batch means method.

3.5 Sequential Estimation

The assessment of good data is usually measured by precision or relative precision of the data. That is an estimated m within a tolerance d, where d is specified by user. More formal, one can make k runs

Where a is level of confidence.

One sequential procedure is to run one replication at a time and stop at run k* such that

(12)

For relative precision:

Where

Stopping rules 12, and 14 are used to stop the simulation when user specified precision or relative precision is satisfied [Alexopoulos and Seila, 1998].

Consequentially, to make better use of simulation output data, the proper statistical analysis methods must be used to encapsulate the main proprieties of the simulation. Independent replications and batch means are proven to be two effective methods. To remove initial bias, reasonable deletion of initial data is very helpful. For automation of simulation output data analysis, suitable stopping rules are necessary. Relative precision is a widely acceptable criterion to measure the quality of the data.