Computing Global Sensitivity Measures from Given Samples

Elmar Plischke

Institute of Disposal Research

Clausthal University of Technology

This is handcrafted (X)HTML+MathML. You have been warned if your browser chokes on it.

If you're using IE+MathType, you might want to look at this page with xht suffix instead.

Global Sensitivity Measures

We are investigating computationally effective methods for Sensitivity Analysis. (Saltelli et al., 2006) defines this as "Sensitivity analysis is the study of how the variation on the output of a model (numerical or otherwise) can be apportioned, qualitatively or quantitatively, to different sources of variation, and of how the given model depends upon the information fed into it."

Methods for the Sensitivity Analysis of Model Outputs

This is the current favourable choice for computing an importance measure.

This approach may be computationally expensive as there are model-evaluations in the loop!

Sensitivity Analysis in Disposal Research

In Disposal Research, there is the concept of a Safety Case which might be described as a systematic collection of arguments demonstrating the safety of a repository for hazardous or radioactive waste in the deep underground.

For this demonstration, model calculations predicting the risk are needed. But We need a sensitivity analysis to support the safety case to show the robustness of our proposed strategy against uncertainties.

General approach

A deterministic model of nuclide transport is embedded into a probabilistic framework to study the variability in the inputs and outputs.

Deterministic model Probabilistic framework
This approach can be found in many applied disciplines where simulation models are used and their outcomes have to be judged which forms part of a larger decision process.

Condensing Scatterplots into Numbers

Given the input sources and the outputs (model runs or measurements) we consider the problem of measuring the "degree of information" available in a scatterplot of the data for one input parameter (or more) plotted against the output data. As a goal, we want to use one and the same sample for computing sensitivity measures. This sample may already be available from an uncertainty analysis, requiring only minimal assumptions about the sample design. This breaks with the tradition of a sensitivity analysis with a simulation model in the loop.

A sensitivity analysis is a reduction of a functional or statistical dependency to a number: It is always a question which kind of information is lost during this analysis step. In most cases, our visual intuition is able to judge this question.

Example: The Ishigami test function

This test function is widely used in sensitivity analysis. It is given by Y=sin X1 + 7 sin2 X2 + 0.1X34 sin X1 where the parameters Xi U(-π,π) are independently uniformly distributed. We use it with a fourth factor which is a dummy.



Scatterplots for the Ishigami function (including a dummy parameter)

In the following we will attribute different sensitivity indicators to these scatterplots.

Linear Regression

A widely used measure of association is given by Pearson's product-moment correlation coefficient ρ2= i xi -x yi-y 2 i xi - x 2 yi - y 2 where
xiyi i=1,,n pairs of data
x mean of xs
y mean of ys

The correlation coefficient ρ2 is the product of the linear regression coefficients (gradients) β1 xy and β1 yx . In the example, the linear trend is only a suitable measure for the first factor. Factor 2 and 3 show nonrandom influences on the output that are not captured by the zero indicator value.

Parameter 1 2 3 4
ρ2 0.1936 0.0000 0.0026 0.0026

Today, linear regression is often used as a sensitivity indicator. However, as (Pearson, 1912) notes

"Nothing can be learnt of association by assuming linearity in a case with a regression line (plane, etc.) like A, much in a case like B."

Measures of dependency stronger than linear correlation

We can strengthen this approach to derive sensitivity indicators by using We discuss variance and density sensitivity methods.

Correlation Ratios

Knowing of the deficiencies of linear regression based indicators, (Pearson, 1905) introduced the Correlation Ratio (CR). Given data pairs xiyi i=1,,n of input and output values, the CR is obtained by the following steps. In particular, let use define yr = 1 nr xj Cr yj with nr = xj Cr 1.

Then the CR is given by η2= r=1s nr yr - y 2 i yi - y 2 .

The Correlation Ratio is using a regression model which consists of step functions on a given class partition.

Plotting the class means

As a visual feedback, we plot the class means into the scatterplot. This is an approximation of the non-parametric regression curve.



Parameter 1 2 3 4
η2 0.3191 0.4138 0.0188 0.0148

First Order Effect

(Kolmogorov, 1933) identifies the Correlation Ratio as an estimate of the quotient between the variance of the conditional expectation of the output given the input and the unconditional variance of the output, η2= 𝕍[ 𝔼 [ Y|X] ] 𝕍[Y] where X is the input random variable and Y is the output random variable.

This quantity is called first order effect, main effect, or Sobol' index .

It is an example of a variance-based global sensitivity measure.

Estimators for the First Order Effect

If φ(x) is an estimate of the regression curve 𝔼 [ Y|X=x] (local mean, "backbone" of the scatterplot) then the first order effect can be estimated by η2= j=1n φ(xj) - y 2 j=1n yj - y 2 .

Hence this quantity measures the goodness-of-fit for the (non-linear) model prediction φ.

How to choose φ?

For smoothing the scatterplots we are free to choose a suitable regression model from the toolbox for non-linear/non-parametric regression: Clearly, a linear regression model would yield ρ2 as an estimator of η2.

Example - Moving Averages

Let us try a moving window to compute a local mean for φ.


Parameter 1 2 3 4
η2 0.3499 0.4564 0.0616 0.0420
The window for the moving average is too small: the regression curve is jiggling.

Example - Orthogonal Polynomials

Instead of fitting a linear dependency, we can fit a polynomial model. If we use a orthogonal family of polynomials then the computation of the conditional variances is simplified by Parseval's theorem (which is a Pythagoras theorem for functions): We just have to form the sum of squares of the computed/selected coefficients.


Parameter 1 2 3 4
η2 0.3107 0.4500 0.0280 0.0070

Orthogonal Polynomials: Higher order effects

This polynomial regression model can also contain interacting terms. With their help, higher order sensitivity indices can be computed which are based on the following regression model.

Y=φi(Xi) +φj(Xj) +φi,j (Xi,Xj) +ε

The combination of x1 and x3 contributes 0.2442 of output variance.

Fourier Amplitude Sensitivity Test (FAST)

The state of the art of computing first order, higher order and/or total effects consists of using one of All these methods need special sampling schemes or repeated model evaluations.

(Cukier et al., 1973) introduced the FAST for Sensitivity Analysis.

Each input factor i is associated with a certain frequency ωi in the specially design input sample. All resonances to this frequency in the corresponding output sample are then attributed to the first order effect of this factor.

FAST with sample size 8192



The Ishigami function analysed with FAST: Here factor 1 is associated with frequency ω1=1, factor 2 with ω2=9, and factor 3 with ω3= 92=81. The middle plot shows the output response and the lower plot its frequency spectrum. Blue frequencies contribute to first order effects, green to second order effects and red to third order effects. For the first order effects of factor 1 we find two active coefficients, the first one with approx. 20% being the only one identified by linear regression. An additional coefficient of approx. 10% would pass the linear regression unnoticed. For the second factor, the contributions are only visible in the fourth higher harmonic, hence a linear regression predicts a zero influence. Factor 3 shows no first order effects, but second order effects with factor 1.

First and selected second order effects from FAST

Parameter 1 2 3 4 1,3
η2 0.3076 0.4428 0.0000 0.0000 0.2435

The sample constructed for FAST is not reusable. As the sum is almost 100%, all of the output variance accounted for. Some noteworthy points from FAST:

Computationally efficient FAST-inspired Methods

How to combine the pro's from FAST into a computationally efficient method?

An example for such a FAST-inspired method could be based on the following decisions:

Artificial timescale Sort the output data using the input data of interest as key
Suitable orthogonal transformation bisect the data using Haar wavelet decomposition
Regression curve First few coefficients of the transformed, reordered output values

We don't need a back-transformation because of orthonormality of the transformation.

The Haar wavelet decomposition transforms an adjacent data pair into midpoint and (signed) radius and repeats this recursively for the midpoints.

MATLAB: Haar wavelets

With only a few lines of MatLab we can implement the program sketched above.
M = 5; 				% Higher harmonics
[xr,index] = sort(x); yr = y(index);
allcoeff = wavetrafo(yr);
Vi = sum(allcoeff(1+1:(2^M),:).^2);
V  = sum(allcoeff(2:end,:).^2);
etai = Vi./V;

function y=wavetrafo(x)
%WAVETRAFO Haar wavelet transformation.
[n,k]=size(x);
 if(n==1), y=x;  else
   sums = x(1:2:end,:)+x(2:2:end,:); 
   diffs= x(1:2:end,:)-x(2:2:end,:);
   y=[ wavetrafo(sums); diffs]/sqrt(2);
 end

Example: Haar Wavelet on the Ishigami test data



Parameter 1 2 3 4
η2 0.3191 0.4138 0.0188 0.0148


These are exactly the same results as for CR. So was all of this brain strain for nothing? No, using the Haar wavelet we can refine the partition by including further wavelet coefficients.

The regression curves are still step functions: Do we find a better orthogonal transformation? Yes: The discrete cosine transformation (DCT) which is used for image compression in JPEG uses continuous base functions.

For example, the coefficient associated with the first base function, namely half a cosine wave (green line), almost provides the same information as a linear regression.

Example DCT



Parameter 1 2 3 4
η2 0.3121 0.4333 0.0162 0.0079

The DCT is a special case of the DFT with certain symmetry properties. We can also directly apply the Fourier transformation by sorting and shuffling the input values (Plischke, 2010): By rearranging the order of realisations in the sample a triangular shape can be constructed. Again, using this sorting procedure an artificial timeframe is imposed on the sample. This method is termed EASI.

The rearranged input will show a triangular peak ("Mount Toblerone") so that resonances in the accordingly rearranged output can be detected like in FAST and then attributed to the input factor.

Example EASI

We obtain two regression curves, one from even indices and one from odd indices which gives us a visual feedback on the quality of the fit and/or the data



Parameter 1 2 3 4
η2 0.3172 0.4371 0.0218 0.0149

Second order effects (and higher if you dare)

Given data triplets { xi1 xi2 yi } (as realisations of some RVs), we want to compute the joint influence of X1 and X2 on Y

The Method in a nutshell

Plow Track Search Curve: Code the xi1 xi2 position by the length of the search curve going back and forth through a checkerboard partition of the data.

This curve has a detectable frequency behaviour per dimension, it can easily extended to higher dimensions.

Ping-Pong Search Curve: This is an alternative search curve design with large freedom of choosing the frequencies. Checkerboard sides have to be incommensurable to reach all boxes.

The quality of this length-assignment can be improved when using information on the exact position of the realisation within the box.

Examples of search curve approaches

Rearranging the given realisations "like pearls on a string" lets us perform a FAST analysis. Note that in contrast to the above analysis, only a subset of the parameters is considered and the assignment to different frequencies is not fixed.



Parameter 1 3 1,3
η2 0.2864 0.0163 0.3154


Parameter 1 3 1,3
η2 0.2424 0.0161 0.2327

Collecting the example results

The number of digits reported in the following table is a total overkill, as the Monte Carlo error is approx. 1 n 3%
1 2 3 4 1,3
ρ2 0.19 0 0 0
Linear regression 0.1936 0.0000 0.0026 0.0026
η2 0.3139 0.4424 0 0 0.2437
Correlation ratios 0.3191 0.4138 0.0188 0.0148
Moving average 0.3499 0.4564 0.0616 0.0420
HDMR 0.3107 0.4500 0.0280 0.0070 0.2442
FAST 0.3076 0.4428 0.0000 0.0000 0.2435
Haar wavelet 0.3191 0.4138 0.0188 0.0148
DCT 0.3121 0.4333 0.0162 0.0079
Fourier 0.31720.43710.02180.0149
Plow-track 0.28640.01630.3154
Ping-pong0.24240.01610.2327

Note that the FAST method in this table uses a specially designed sample, while the others always consider the same sample.

Open problems

The presented methods are computationally effective, but some questions remain open:

Estimating first order effects for first factor of Ishigami, 25 runs per sample size.

Density-based Sensitivity Measures

The variance of the conditional expectation is not a good indicator for multi-modal or highly skewed conditional probability distributions.

Instead of comparing local means with the global mean, a moment-independent measure should compare conditional densities with the unconditional one.

(Borgonovo, 2007) proposed to use the mean distance between the conditional densities and the unconditional density

δ(X,Y)= 12 fX(x) fY(y)- fY|X( y|x)dy dx = 12 fX(x) fY(y) - fX,Y( x,y) dy dx

The δ-importance measure is symmetric in X and Y. It is a L1-distance of the product of the marginal pdfs and the joint pdf.

An estimator for δ

An efficient way of computing δ was until now out of reach.

Comparison with variance based sensitivity measure shows that instead of the conditional mean 𝔼 [ Y|X=x] the conditional density fY|X( y|x) is needed for the estimation of density-based methods.

This requires to switch from pointwise distances to functional distances where these functions have to be estimated from the given data.

We apply the same idea as for correlation ratios:

Replace the pointwise conditional density by a partition based fY( y|Cr) , for suitable classes Cr , r=1,, s by using kernel density estimates from the class data yi xi Cr

Example: Ishigami function, parameter 2



Needs for the computation of δ

We discuss the highlighted items in the following.

Kernel density estimators

With a given kernel K we define

f^Y (y)= 1n i=1n 1α K y-yiα

f^Y|r (y)= 1n xi Cr 1αr K y-yi αr


The parameters α, α are called bandwidths or smoothing parameters

Suitable Kernels K(x) with mean 0 and variance 1 are:
Normal 1 2π exp -x22
Triangle 16 max 6 -|x| 0
Epanechnikov 320 5 max 5 -x2 0
Biweight 15784 7 max 7- x2 2 0
Uniform { 123, |x | <3 0, |x | 3

However, when performing numerical experiments one notices that it is not the shape which is important, but the bandwidth selection for which there is a vast body of literature.

Kernel bandwidth

Results on optimal bandwidths are mostly available under the assumption of a Gaussian distribution. For this, it is advantageous that δ is invariant under monotonic transformations. Hence we can reshape the unconditional pdf to be Gaussian by first passing from the original output data to the empirical cdf and then applying the inverse normal cdf to the empirical cdf of the data. The optimal bandwidth derived for the unconditional pdf is then also applied to compute the kernel density of the conditionals.

Difference of two densities

A theorem of [Scheffé 1947] shows that for all densities f and g, we have f-g=2maxB B f-g . Choosing B= f>g shows that there is as much positive mass as there is negative mass in f-g.

This theorem links the δ measure to the Kolmogorov-Smirnov distance (B restricted to half-rays) and the discrepancy measure (B restricted to balls).

For kernel density estimates, if f^Y (y) and f^Y|r (y) are close to each other then we expect many zeros crossings. Hence it is difficult to apply advanced quadrature to the class separation

S^r= f^Y (y) - f^Y|r (y) dy

Hence we use the trapezoidal quadrature rule for numerical integration.

When are the densities the same?

The asymptotic Two-Sample Kolmogorov-Smirnov Test helps us in deciding whether two densities are sufficiently close to each other to yield a zero contribution: If F^Y (y) and f^Y|r (y) denote the empirical cumulative distributions then the contribution to δ is insignificant if

maxy F^Y (y) - F^Y|r (y) 1n+ 1nr Kα

where Kα is the (upper) α-quantile of the Kolmogorov distribution.

Typical values for Kα are K0.90=1.22, K0.95=1.36, K0.99=1.63, especially, Kα2 for all α.

The test is good in eliminating numerical dirt. Moreover, the test statistics can be replaced with 0.5 S^r using the above-mentioned theorem.

Ishigami Example, 8192 Quasi-MC samples

The following plots show the unconditional density of the output (fat) compared to the densities conditioned on the inputs (low x values: blue, high x values: red). The bottom plot shows the separation between them.



Generally, conditioning alters the density. Conditioning on factor 1 gives a shift in the density, factor 2 alters the shape completely, factor 3 does not lead to a change in the conditional mean, but only to the variance - a fact not detectable by variance-based first order sensitivity indices. Factor 4 as a dummy only changes the shape slightly. This slight changes are effectively eliminated by the Kolmogorov-Smirnov test (Its cutoff value is marked with dots in the lower plot). However, also some parts of factor 3 are effected by this cut-off.

Parameter 1 2 3 4
δ 0.2037 0.3918 0.1392 0
η2 0.3132 0.4365 0.0000 0.0000

Example: Letter B

The data for the letter B was obtained by a rejection method. There are no linear and no functional dependencies in the data, however, the plot definitely shows a structure.

δ(X,Y)=0.1729

δ(Y,X)=0.1887

Ideas, Problems, Open Issues

Conclusions

Computationally effective methods for the estimation of global sensitivity indicators

Search curves in action

References

(Saltelli et al., 2006) Sensitivity Analysis. Wiley&Sons, Chichester

(Pearson, 1905) On the General Theory of Skew Correlation and the Non-Linear Regression. Dulau&Co, London

(Pearson, 1912) On the General Theory of the Influence of Selection on Correlation and Variation. Biometrika 8, 437-443

(Kolmogorov, 1933) Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer, Berlin

(Plischke, 2010) An effective algorithm for computing global sensitivity indices (EASI). Reliability Engineering&System Safety 95, 354-360

(Cukier et al., 1973) Study of the sensitivity of coupled reaction systems to uncertainties in rate coefficients. J. Chem. Phys. 59, 3873-3878 3879-3888

(Borgonovo, 2007) A new uncertainty importance measure. Reliability Engineering&System Safety 92, 771-784

(Rabitz&Alis, 1999) General foundations of high-dimensional model representations. J. Math. Chem. 25, 197-233

(Doksum&Samarov, 1995) Nonparametric estimation of global functionals and a measure of the explanatory power of covariates in regression. The Annals of Statistics 23, 1443-1473