Verification and Validation
Go to the ANGEL web site and complete Reading Assignment 1. Come prepared to discuss what you've read.
Finish HW3and Start HW4
The title of this lecture may seem redundant to
you. However, in the field of numerical
simulation, each of these words has a very distinct definition. Although there isn’t universal agreement on
the details of these definitions, by checking other references you’ll see that there is fairly
on their usage. The simplest definitions
that I’ve seen are given in Roache’s
book on Verification and Validation (V&V). To paraphrase him,
Verification demonstrates that you are solving the equations right. Validation demonstrates that you are solving
the right equations.
If at any point you use computer simulations to understand the behavior of a physical system, you need to give some serious thought to V&V. You will need to have confidence that the computer code you use was properly verified, and it's physical models validated for the range of conditions that are important to you. You will also need to verify the input model that you provide to the computer code, and do whatever is possible to validate the results for your particular system simulation. To prepare for these tasks, I recommend that you read Roache's book and Sandia Lab reports by William Oberkampf and his colleagues. Oberkampf’s SANDIA reports are free via the DOE Information Bridge. A condensed form of Roache’s book on V&V is available as Chapters 18 and 19 of his book "Fundamentals of Computational Fluid Dynamics" in the PSU Engineering Library. The full book can be obtained from Hermosa Publishers and is worth the price. A discussion with recent references can also be found in "Best Practice Guidelines for the use of CFD in Nuclear Reactor Safety Applications," by Mahaffy et al.
Verification must precede Validation. Don’t expect reasonable conclusions when comparing to experimental results if you’ve got serious errors due to:
2. Selection of a convergence criteria for iterative equation solution that are too loose;
3. Programming errors in the computer code;
4. Errors in the specification of input for the simulation of the experiment;
5. Errors in understanding of code output.
The presence of such errors is a particular problem if a finite volume, finite difference, or finite element code is used in the process of determining values for coefficients (a.k.a. fudge factors) in engineering correlations used to model specific physical processes (e.g. heat transfer coefficients, turbulence models, …). This process is sometimes called “tuning” or “calibration”. I have seen a number of instances where input models were tuned to operate well on a “reasonable” spatial discretization, but later when good mesh convergence studies were done, comparisons to experimental data became unacceptable. In such cases, the code developers have canceled discretization errors with errors in one or more physical models. I have also seen cases where engineering correlations were adapted without change from the literature and produced poor matches to data on a “standard” spatial discretization, but performed very well on an adequately refined mesh.
When considering Validation, an important fine point is that you validate a simulation code over a specific (and limited) set of conditions. For each new result that you produce, you should document the fact that the physical conditions fall within the range of conditions already validated. If they don’t, you need to qualify your presentation of results clearly noting where you are extrapolating beyond the existing validated region. Ideally when the bounds of existing validation are exceeded, additional experiments are analyzed (and if necessary designed and performed).
People have been using the predecessors and relatives of TRACE for thirty years to successfully analyze experiments and perform licensing analysis for nuclear power plants Why at this stage of the game do I put such a strong emphasis on V&V? Much of the answer is in Mahaffy’s first law of human nature:
Nobody’s perfect, and most people drastically underestimate their distance from that state.
Some errors persist in simulation codes and code
input models for decades, and physical models implemented for one class
of reactor designs may not cover the range of conditions present in a
new design, or a newly imagined accident scenario for an old
design. Any new or modified input model of a power plant or
experiment is subject to errors that must be removed or quantified.
Another fundamental human
trait feeding the need for V&V
is that we see what we want to see (Mahaffy’s Fourth Law of Human Nature). This is not always a bad trait, when it's working for
people, we refer to them as visionaries. However, when it's not
working, people end up in metaphorical dead end alleys, or their
victims simply end up dead. Humanity, has come up with various
ways of dealing with this behavior. I view Science
as the collection of
procedures and knowledge developed over millennia to overcome this
trait. It permits us to see what is really
rather than what we want to see. To make
progress you need a balance of seeing what you want to see, and
checking what you thought you saw with good scientific practice.
At its heart V&V is just good science. However, neither Verification nor Validation are simple processes. This lecture is just meant to give you a brief introduction, and hopefully keep you from getting into too much trouble.
As with Verification and Validation, specific meanings are attached to terms used in the discussion of error. Generally the word “error” is only used to describe a source of inaccuracy that can in principle be corrected or limited to any desired level. The five items listed above fall into that category. These errors are frequently subdivided into recognized errors (hopefully just 1 and 2 in the above list) and unrecognized errors (items 3-5 above).
Inaccuracies in physical model implementations are normally due to lack of knowledge of underlying processes, the fundamentally stochastic nature of those processes (e.g. turbulence), or low precision experimental measurements for key quantities. In this instance the term “uncertainty” is applied in discussions rather than error. If you’ve ever looked at data underlying various engineering correlations, you can appreciate this problem. However, even state quantities such as conductivity have an experimental basis and associated experimental uncertainty. At some point it is important to determine the sensitivity of key outputs from a simulation to these uncertainties.
I will focus the remainder of this section on what can be done about the five sources of error listed above, and those resulting from a specific choice of a physical model. My discussion relies on over 30 years of personal experience, and insights derived from Roache’s V&V book and numerous SANDIA National Laboratory Reports by Oberkampf and his colleagues.
Mesh and time step sensitivity studies lead to an estimate of error associated with discretization in space and time, and are also important in procedures used to detect software errors. Roache and Oberkampf have good discussions of this error analysis based upon Richardson Extrapolation.
Most rigorous studies to quantify error associated with the selected mesh or time step sizes are based on Richardson Extrapolation. This started as a means of improving the accuracy of numerical solutions to differential equations, but also can be used as a basis for estimating errors associated with selection of the mesh and time step size. Without understanding these errors, speculation on the quality of various physical models associated with a reactor safety code is on shaky ground. If your mesh is reasonably fine, and you know the order of accuracy of your method, you can use Richardson Extrapolation and results from two different grids (or two different time steps) to say something about the error. If you don't know the methods accuracy, or don't have confidence that your mesh is fine enough, you need a study with at least three different spatial divisions (or time step sizes).
Mesh and time step sensitivity studies lead to an estimate of error associated with discretization, and are also important in procedures used to detect software errors. Roache and Oberkampf have good discussions of this error analysis. It basically boils down to fitting a curve to a sequence of results and extrapolating beyond those results to estimate the limiting answer with zero mesh length or time step. Consider a sequence of three mesh lengths or time step sizes (from smallest to largest) h1, h2, and h3. Normally the sequence is generated with a constant refinement ratio:
h2 = h2/h1.
Let f1, f2,
and f3 be
the computed results at the same point in space and time for the three
corresponding values of h. Taking a clue
from truncation error analysis, we look for an expression for f as a
of h in the form:
subtracting the equations in pairs gives
Given a value of p, equations for the two finest meshes can be solved for the remaining unknowns.
As a result the error on the finest mesh can be estimated as:
Note that if you have faith in the value of p
You do not attempt Validation until you have demonstrated the level of error associated with your time and space discretization.
Check to see that the convergence criteria are reasonable. For example on an iterative method with a slow convergence rate, look directly at the equation residuals rather than change in independent variables. The simplest study of sensitivity to iteration convergence is to drop all convergence criteria by an order of magnitude and measure changes in key simulation state variables. If a
If you are a code developer, these are the things that give you nightmares. However, there are systematic things that you can do, using review, careful programming practice, and testing, to cut the number of errors and get a good night’s sleep.
Although the sections below are primarily oriented toward the actual simulation software (e.g. TRACE), the discussion of Quality Assurance is also relevant to limiting errors in input models.
The first thing to remember about programming errors is that they occur regardless of programming practices. Testing procedures must be in place to minimize the number of bugs that survive for any significant time. Quality Assurance (QA) procedures are one way to control the introduction of bugs and formalize a test procedure used to localize bugs. However, I don’t recommend rigorous adherence to international standards for QA programs. At some point the system becomes rigid enough that the best scientist/programmers leave to find a better work environment, and the project under QA is doomed to mediocrity at best.
The three components of QA are documentation, testing, and review. Written standards for these components should be established at the beginning of a project and accepted by all involved. Documentation of a new simulation capability usually begins with a simple statement of requirements for what must be modeled, what approximations are and are not acceptable, and the form of implementation. A complete written description of the underlying mathematical model provides a basis for Verification activities. A clear description of any experimental basis for the model aids Validation. Good validation testing compares against data beyond the set originally used to generate the model. A clear description of any uncertainties in the model can be valuable in later studies of sensitivity of results to model uncertainties.
Basic documentation should also include a clear written description of the model’s software implementation. This aids later review or modifications of the programming. Relatively little effort is normally expended here on the coding implementing the model itself. More time should be spent documenting flow of data, revisions to data structures, and definitions of important variables.
The final piece of the basic documentation is a test plan. Here a careful explanation is provided of a set of test problems that clearly exercise all new and revised programming. The tests should cleanly isolate individual features of the new capability, and demonstrate that the software correctly implements the underlying model. For revisions to physical models, relevant tests against experiments should also be specified.
This documentation should be generated in two drafts. The first precedes actual implementation of the software, and the second is issued as a final report including the final form implemented and results of all proposed tests. It should be accompanied by two phases of independent review, the first focusing on the viability of the proposed approach, and the second focusing on the completeness of testing. My experience has been that even without review, generation of this documentation significantly cuts the number of programming errors introduced into the final product. The act of describing implementation with words, forces a careful review of the software. More importantly, a systematic written description of a test procedure insures that very little can slip through the testing process.
Documentation must also exist at a more automated level via a source code configuration management procedure. This starts with a systematic record of all changes, dates of change and individuals responsible for the changes. When under software control this level of code management lets you remove old updates from a program if they are found to be inappropriate, and maintain specialized versions of a base code. These capabilities have been used for a long time on large software projects. The current favorite configuration control tool is CVS, which is GNU open source software, and free. The project under which I do most of my research uses CVS at it’s heart, but extends capabilities via a web page that provides links to all accepted versions of the software, related documentation, test problems, and supporting scripts for version generation and execution of test problems.
The act of bringing a new simulation capability under configuration control (creating a new code version), should provide the most rigorous review for code errors. However, this is largely a function of the individual appointed to be the configuration control manager. Success of a software project often depends on the quality of individual doing that job. He or she must have the breadth of technical experience to understand all documentation associated with updates. He or she should also be well versed in testing procedures and basic scientific method in order to judge the completeness of the test sets submitted with each update.
Any new problems submitted with an update should be included in a regression test suite. This is also a major line of defense against introduction of coding errors. In a complicated simulation code, its much easier than you might think to introduce your own amazing improvement, and unintentionally cripple another portion of the program. However, if that portion went through the documentation and testing procedure that I’ve described, its specific test problems were embedded into the regression test set. By running the regression test set for each new change, bugs affecting older capabilities are detected very quickly, and corrected before being accepted into the official program. The project were I do most of my software development started seven years ago with a regression test set of about 50 problems. It’s now up over 1300 problems, taking about 3 hours to run on a high end Intel based workstation. The rate of increase of computer speed and adaptation to use of parallel clusters will keep our testing productive and growing through the useful life of the software.
Roache and Oberkampf are both advocates of the method of manufactured solutions as a way to verify coding. I’ve tried it and also consider it to be very valuable. The idea is fairly simple. Start with the basic PDE (or system of PDE’s) in the mathematical model for your problem. For example a 1-D transient conduction problem.
The next thing that you do is pick a solution T(x,t) that you like and run it through the differential operators.
So all I’ve got to do is set
declare initial conditions T(x,0)=300, boundary conditions T(3,t)=300 and T(-3,t)=300, and I’ve got a conduction problem that I can feed to my finite difference or finite volume code with a known answer. The particular function above is a nice choice for testing methods that are at least first order accurate in time and second order accurate in space. When such methods are functioning correctly, they will reproduce the solution to machine accuracy.
If you want to test more aggressively with non-zero derivatives at all orders, move away from polynomials. For a conduction problem, there are simple analytic solutions available. If I go for a solution with q=0, then its going to be a Fourier expansion. I can isolate to one term in the expansion
by judicious choice of the boundary conditions and initial conditions
This particular choice could be considered as testing against an analytic solution. However, I have manipulated initial and boundary conditions to manufacture a solution to the conduction problem.
When checking against solutions like this one, it is important to perform mesh and time step convergence studies as described above . Subtle bugs may be hidden on a coarse mesh or with a time step that is too large.
The best way to avoid coding bugs that I know is through the practice of evolutionary programming. There is next to nothing completely new under the sun. Whether you know it or not, any simulation tool that you are likely to write will be an extension of something already in existence. If you can obtain source code that implements a large subset of your goal, I recommend that you either gradually change that software to meet your goals, or start the creation of your new product so that it should in principle match results with the older program for some set of test problems.
If you start with new code, first check to make certain that you can reproduce the results of the older code to within machine round-off error. You should not expect to match results exactly, because your programming will probably implement expressions that are formally identical, but use difference ordering of arithmetic operations to get the result. This generally produces results that differ in the low order bits. To get a quick feel for the level of impact to be expected from this change in round-off error, compile either the old or new programs to produce codes that are both fully optimized and unoptimized.
From this point either the new or adapted code approach follow the same path. Add new features in a way that separates three classes of changes:
The first change class is easy to test and debug, although you may need to suppress compiler optimization to see the exact match. The second is more difficult, in that you really do need to confirm that differences in results can be attributed to differences in round-off error. The third is where you apply techniques described above such as the Method of Manufactured Solutions for changes to numerical methods, and rigorous validation to check changes in physical models.
As you review discrepancies between results of new and old programs, remember that you may find bugs in the old program. You have no guarantee that it is perfect.
As a final suggestion for detection of programming errors, always create at least one version of your program that generates elements of the Jacobian matrix numerically for comparison against any analytic expressions for the same elements. When creating analytic derivatives use features available in Mathematica, Maple, or MacSyma to do the necessary symbolic differentiation and to automatically convert the results to Fortran or C expressions. Also consider the use of automatic differentiation tools such as ADIFOR (http://www-fp.mcs.anl.gov/autodiff/).
Input model errors fall into two general categories. The first is due to entry of incorrect information into the input file. In my experience this is the largest source of correctable errors in a simulation. As a code developer, I have dealt with huge number of code bug reports, at least 90% of which have turned out to be errors in the input file. Mostly these are typographic errors introduced as the input model is created. They are also frequently related to failure to read guidelines for creation of the input model. At times these errors are a secondary result of errors in documentation of the system being modeled or errors in documentation of the code’s input file structure. Input to any simulation tool can in a broad sense be viewed as a programming language, and this type of error is fully analogous to a software error in the simulation code itself. Procedures above (except for MMS and Numerical Jacobians) are applicable to minimizing or locating input errors. In particular good QA for each input file and separately for the simulation code’s input manual are crucial.
The second error category is uncertainty in system geometry and initial and boundary conditions. These are often difficult to quantify, but when the uncertainty bounds are known the impact can be assessed during the validation process.
In my experience this is a relatively small source of problems. Usually this is simply a case of assuming the wrong units for a value, or associating a value with the wrong point in space or time. The root cause is normally failure to read documentation or errors in documentation.
I don’t claim to be an expert on validation, and will keep my comments on the subject very brief. If you need to perform serious validation of results from a computer simulation, start by reading work by William Oberkampf and his associates, and by Patrick Roache. It’s also worth doing a search of the validation literature in the Nuclear Reactor Safety (NRS) community. Nuclear plant regulators (e.g. U. S. Nuclear Regulatory Commission) have taken validation very seriously for a very long time. One valuable lesson from NRS is that new experiments should be analyzed before experimental results are published. There are too many ways to use a modern simulation code as a curve fitting tool. Ideally release of most experimental results would be delayed for some period after the test, so that system configuration and measurements needed for boundary conditions to a calculation can be provided as accurately as possible without revealing other measurements of state information within the system.
Validation of a simulation must be performed at several levels of detail. Appropriate separate effects experiments must be modeled to capture the quality of local modeling capabilities (e.g. nucleate boiling, specific chemical reactions). Tests on individual components of a system should also be used in the validation process (e.g. pump, wing). Full system capabilities should be checked against scaled test facilities (e.g. scale model of a plane in a wind tunnel), and whenever possible data from the full scale system. As you start to build a list of all processes, components, and states of the full system that may need testing, you will realize that neither the budget nor the time exists to fully validate every modeling capability that might influence the results of your simulation. This is the point at which you gather a group of experts and construct a Phenomena Identification and Ranking Table (PIRT). In this process you take the long list and rank phenomena by importance (high, medium, or low) to the particular scenario being simulated and validated.
In addition to ranking importance of phenomena, the PIRT panel of experts documents the adequacy of models within the simulation code for each phenomena, the adequacy of the code verification and validation test sets, the adequacy of existing experimental data for validation and needs for additional data. Metrics for quantitative judgments of adequacy of the simulation are also reviewed and if necessary alternate metrics suggested.
I have participated in one PIRT process and reviewed others. At first glance the process seemed to me to be too qualitative to be effective. However, the structure of the process does work very well and produces a validation plan that is very effective in the real world of deadlines and finite resources. One key to success with PIRT is that it an iterative process. During the validation process, results must be reviewed with the understanding that later analysis of experimental results may increase the ranking of a phenomenon, and require revisions to recommendations.
Validation metrics are a subject of ongoing research and debate. Researchers try to come up with ways to measure quality of predicted results. I believe that the key to success here is the use of the plural metrics. Although I may come up with a single final metric, it will be a weighted average of other metrics that account for various aspects of match to data. The predictive computer codes that I use, as do most practitioners of CFD and structural analysis are deterministic. If I repeat a calculation with the same initial and boundary conditions, I get the same answer. I want to start with a metric that considers the match of deterministic calculation using my best understanding of initial and boundary conditions to the data (maybe some variant of an L2 norm applied to the vector of data points). In the process, I will consider experimental error, but I need to distinguish between experimental error due to some systematic bias in the measurements and random experimental error. That part of experimental error resulting from bias, needs to be identified, and used to quantify how seriously I take my initial metric. If I’ve got a set of error bars that bound systematic bias in an experiment, I would make certain that I understood the nature of the bias. If appropriate, I could draw one “experimental” curve through the top of the error bars, and a second through the bottom, apply the same metric(s) to each that I used for comparison to the data as measured. Presentation of the best and worst of the three metrics would tell me something useful.
I could do something similar for random experimental error but with less faith in the meaning of the results. Take a look at the comparison of a computer simulation to experimental data below. The mean data values and calculation have different trends, which will be reflected to some degree in the construction of metrics just mentioned. However, if the error bars reflect truly random error, I’d conclude that I’ve potentially got a pretty good match. A metric involving percentage of the time that the calculation is outside the data error bounds might be appropriate.
Although the simulation code is deterministic, you need to account for the fact that the calculation with the code is not. The next level of metrics account for uncertainty in the experiment’s boundary conditions, initial conditions, and geometry. In some CFD applications you might be surprised at the impact of small deviations between reported and actual geometries. To the extent that these input uncertainties are available, a statistically based sensitivity study should be performed to put uncertainty bounds on calculated results. Metrics could be developed to say something about the overlap of the region within the experimental error bounds and the region within the calculation’s uncertainty bounds.
There is another aspect of a calculation’s uncertainty that should be treated in addition to input. Physical models within the computer program contain parameters derived from experiments and subject to uncertainty. For a really important prediction like whether or not the core of a nuclear reactor will melt, I need to determine the impact of these internal uncertainties on prediction of critical state variables (such as maximum metal temperatures in a reactor core). In this instance I’m not trying to make a direct statement about the quality of a simulation code and result. I need to construct an envelope for highly probable states, and then determine if any unacceptable results fall within that envelope. A number of statistical methodologies have been developed to minimize the number of simulations that must be done to construct such an envelope. I recommend that you look at papers authored by D’Auria, de Crecy, and G. E. Wilson for some specific approaches and books by Gamerman and Cullen for general information. D’Auria has an unusual approach that has a great deal of merit when the number of uncertain model parameters is very large, or the level of uncertainty in these parameters can’t be adequately specified (very frequently the case).