Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Model Verification - Guiding Principles

Updated: March 02, 2026

Stan Benjamin - 2015, updated Nov 2016, Feb 2017, 2023 (in intro), 2025 (a few small changes)

The original Google Doc format is available here

Introduction

Verification can be used to assess forecast accuracy and to compare between different versions of forecast models.

Behind verification, there must be a hypothesis: A proposed innovation (model component? Data assimilation variation?) produces equal or improved forecast accuracy than a previous treatment (the control). The hypothesis should include the changes to how representation of physical processes (for model) should modify verification scores.

This hypothesis may be supported or rejected using verification - Do verification results agree with what was hypothesized? Before using statistical verification to examine an experiment, a case study should be conducted to also examine the same hypothesis - do the spatial and temporal patterns of experiment results support the hypothesis?

Verification for models is usually best applied with well-designed controlled experiments with isolatable changes. Of course, experiments need to eventually be conducted for a combined set of many changes. But these combined-change experiments cannot give good scientific answers on cause-and-effect. There should be hypotheses given for combined-change experiments based on results already available from a set of more tightly controlled experiments regarding the individual components contributing to the combined change.

Results from a variety of different verification assessment measures (vs. different kinds of observations: e.g., upper-air, surface, radiation, precipitation) for a given pair of experiments must be intercompared to look for consistency to draw a conclusion. Use of a single measure without these intercomparisons can often lead to an incorrect assessment. Examples of needed verification combinations are shown below.

Key verification consistency principles

In general, an innovation to a model/assimilation system should provide some kind of improved forecast skill in some verification area (e.g., precipitation, 2m temperature, clouds/radiation) and do no harm in other areas (e.g., upper-air). This means that innovations need to be evaluated over different geographic areas, times of day, and different seasons, and against different observing systems and pass this test: provide some benefit in at least some conditions and do no harm in other conditions. (This criterion is shared by NWS/NCEP and for other international NWP centers.)

There are 2 breakdowns below for verification principles:

Familiarity with both areas of principles is important for coming up with an appropriate set of verification tests to test an NWP change hypothesis.

These verification focus areas below are related to key phenomena addressed by different model niches. (GSL model verification webpage (Model Analysis Tool Suite) - gsl.noaa.gov/mats )

Verification Tips by Model

Regional/RAP-scale

Upper-air

Upper-air verification shows dynamics/physics “backbone” accuracy

  1. RMS error is generally more important than bias, although bias for temp and RH at 00z or 12z (but not together) can be revealing especially for accuracy of boundary-layer processes. Biases can also indicate possible issues with radiation or cumulus or aerosols or momentum mixing aloft or gravity-wave drag.

    1. Bias - temperature and RH (better, relative at different levels. Mixing ratio not helpful with exponential behavior.)

      1. Look at by time of day - 12z vs. 00z - do not combine.

      2. Use both raobs (2x/day) and aircraft (at all times of day, at least over certain areas like US)

    2. Use 3 sources

      1. In situ obs - raobs and aircraft

      2. Gridded data - e.g. GFS analyses or RAP or HRRR analyses

      3. If results from all observations (raob, aircraft, grids) give approximately the same answer, then the result is more certain. (question - can we look at significance with multiple obs)

    3. NOTE: Recent research by Stan Benjamin and Dave Turner have revealed that raobs appear to have a low RH bias. RH bias should also be assessed using aircraft (AMDAR) observations.

Surface

Reveals physics accuracy, esp. over eastern US, where there is limited non-physical contamination by better agreement in elevation between surface observations and surface elevation in models

  1. 2m temp/dewpoint and especially, bias, are most important due to their impact on convective environment.

    1. Note: These variables are sensitive to the diagnostic method used. It turns out that the linearly interpolated 2-m dewpoint (used unfortunately in RRFSv1) gave an unrepresentative value compared to the usually used flux-based diagnostic (used in HRRR, RAP, RUC).

  2. 10m wind is not as important as 2m T/Td, although a high wind speed bias at night can be (not necessarily) related to a warm bias.

Clouds

Reveals physics accuracy, also post-processing and DA effects

  1. Ceiling

    1. Is model geographical coverage and even cloud base level correct for CSI and bias values? Look at HSS (more appropriate for rare events and for penalty for high bias than TSS.)

    2. Use event contingency verification, not mean or RMS error in ceiling height (which gives too much effect from errors in mid- or high-level cloud less relevant to transportation).

    3. Critical for aviation users (IFR/MVFR/VFR levels are important for aviation activities), but revealing for PBL behavior. Deficient or excessive cloud coverage will lead to (or are at least associated with) boundary-layer temp/RH biases.

  2. GOES-GCIP cloud coverage

    1. Is model mean cloudiness correct? Is there an under- or overforecast?

  3. SURFRAD/SOLRAD - downward shortwave radiation

    1. Is model mean downward shortwave radiation reaching surface correct?

    2. Is model clear-air downward SW radiation correct?

    3. Is mean absolute error for downward SW rad more accurate?

    4. Are there geographical variations in downward SW forecast accuracy?

Precipitation

  1. Best treated by event-based contingency verification at different precipitation thresholds.

  2. 0-1h precipitation accuracy is of special importance, since the evolution of land-surface fields depends critically on this first hour of precipitation for the hourly updated models (e.g., HRRR, RRFS, RAP, RUC).

HRRR - storm-scale

  1. Radar data (especially in warm-season)

    1. Reflectivity - CSI with obs/model scale averaged up to 20km or 40km, bias without upscaling

    2. (secondary - from radial wind) - updraft helicity tracks

  2. Surface (biases, in particular, are critical for assessing accuracy of storm/cloud/precip environment. Year-round)

  3. Upper-air (Are the dynamics/larger-scale fields, also important for storm prediction, accurate in the HRRR or other convection-allowing model? Also year-round)

  4. Clouds (Downward shortwave radiation, ceiling)

  5. Precipitation Look at bias in different thresholds.

Global

  1. Upper-air - rawinsonde

    1. vertical profile of RMS/bias for wind, RH, temp

    2. scorecard levels - 250hPa, 850hPa, 850-100 hPa average

  2. Anomaly correlation coefficient - 500 hPa heights - a close second place

  3. Surface (a distant third place)

  4. Clouds

    1. Simulated visible + IR imagery (as from SOS) can give good qualitative validation

    2. CERES

  5. Precipitation

Subseasonal

Useful to evaluate the same physics suite used for regional or short-medium range NWP but evaluation here within a coupled global model

  1. Precipitation - anomalies

  2. MJO predictability

  3. 2m temperature

  4. 500h behavior - blocking frequency, mean heights, etc.

  5. Stratospheric warming events

Hints for all verification

Verification Tips by Observation Type

Upper-air / raob verification

Surface verification

Aircraft

Clouds

Reflectivity

Precipitation (against gridded QPE)

Precipitation - by stations or SYNOPs

Anomaly correlation

Other observation datasets that can be used for verification

Technical Considerations

Also behind proposed model/assimilation changes and related verification are these issues:

Examples of misunderstood verification

References

References
  1. Turner, D. D., Hamilton, J., Moninger, W., Smith, M., Strong, B., Pierce, R., Hagerty, V., Holub, K., & Benjamin, S. G. (2020). A Verification Approach Used in Developing the Rapid Refresh and Other Numerical Weather Prediction Models. Journal of Operational Meteorology, 39–53. 10.15191/nwajom.2020.0803
  2. Benjamin, S. G., James, E. P., Turner, D. D., Balmes, K. A., Sedlar, J., Lantz, K. O., Jensen, A. A., Riihimaki, L. D., & Augustine, J. A. (2025). Excessive Downward Shortwave Radiation in the HRRR and RAP Weather Models and Testing Strategies for Improvements. Monthly Weather Review, 153(11), 2279–2293. 10.1175/mwr-d-25-0094.1
  3. Dorninger, M., Friederichs, P., Wahl, S., Mittermaier, M. P., Marsigli, C., & Brown, B. G. (2018). Editorial: Forecast verification methods across time and space scales – Part I. Meteorologische Zeitschrift, 27(6), 433–434. 10.1127/metz/2018/0955