Statistical methods as per draft of ICH Q1 – part 1

The article presents statistical methods that can be used in the evaluation of stability studies in accordance with the preliminary version of the ICH Q1 document published on April 11, 2025. The ICH document is a consolidated version that is intended to replace ICH Q1A-F and Q5C guidelines. It contains additional recommendations on the principles of conducting stability studies, including new and updated recommendations on the statistical evaluation of stability studies. This article is the first part of a series and covers statistical methods that can be used in the evaluation of single batch stability studies. The second part will discuss the statistical approach to the evaluation of multiple batches using a fixed-effects model. The next part will present the analysis of multiple batches based on a mixed-effects model.

Recommendations

The basic requirements for statistical evaluation of stability studies are contained in Chapter 13 and Annex 2. According to the recommendations, each primary batch, stored under the long-term conditions, may be evaluated individually to establish the re-test period or shelf life. Where there are differences in stability observed among batches or among other factors or factor combinations that preclude the combining of data, the proposed re-test period or shelf life should not exceed the earliest time (worst-case) period supported by any batch, other factor, or factor combination. For quantitative attributes expected to change with time following a linear pattern or log transformed data that follow a linear pattern at the recommended storage condition, an approach for evaluating the data is by linear regression analysis. The appropriateness of the assumed linear relationship over time and normal distribution of the variables may be supported by evaluation of the residuals for the regression line (goodness of fit). Analyses of a quantitative attribute can be performed by determining the earliest time at which the 95% percent confidence limit for the mean, intersects the proposed acceptance criterion. For attributes with upper and lower acceptance criteria, a two-sided 95% confidence limit is recommended. For attributes with only a lower or an upper acceptance criterion a one-sided 95% confidence limit is recommended. 

Statistical evaluation of data is recommended for characteristics that show significant change over time and/or significant variability for a given factor or between factors (e.g., between doses or types of packaging). If a parameter shows little or no change over time and at the same time little or no variability, statistical evaluation is not required. However, the omission of statistical evaluation should be justified in the stability study report.

Population and sample

When evaluating a single batch, the object (population), being analysed, is all units of that batch, e.g., tablets. The population under study should be homogeneous, meaning the individual units should be subject to the same systematic causes with regard to the parameter (e.g., assay), and variability within the defined population should result only from random causes. Each characteristic of the general population is referred to as a population parameter and is denoted using Greek letters, such as the mean (µ) or standard deviation (σ). According to the ICH approach, the population/batch mean value, µ, should meet the predefined acceptance criteria at the end of the shelf life, e.g., after 24 months. The null hypothesis, H₀, may assume that the population mean does not meet the specified acceptance criteria, while the alternative hypothesis, Hₐ, assumes that it does. Therefore, the objective of the study is defined by the alternative hypothesis. In hypothesis testing, it is essential to determine the probability of a false positive decision regarding compliance with quality specifications (patient risk, type I error, alpha) — that is, the acceptable risk of incorrectly rejecting a true null hypothesis. The same applies to the probability of a false negative decision (manufacturer’s risk, type II error, beta). From the perspective of a regulatory agency, the primary concern remains a false positive conclusion regarding the compliance of the population mean with quality requirements.


Since it is physically impossible to analyse the entire batch at specific time points, only a sample consisting of randomly selected units from the batch is taken for stability testing. This sample should be representative of the batch being tested, i.e., it should reflect the properties of the population being tested. The values determined on the basis of a random sample are called sample statistics and are denoted by Latin letters, e.g., mean , standard deviation s. However, the compliance of the mean value determined on the basis of the sample with the assumed acceptance criteria does not provide a high degree of confidence that the true mean value of the population under test will comply with the specification limits. To ensure a high degree of confidence, according to ICH at least 95%, a confidence interval should be constructed around the mean value. A 95% confidence level means that if 20 random samples are taken from the same population, it can be expected that 19 of the 20 confidence intervals will contain the population value. Therefore, if the confidence interval falls within the assumed criteria, the H0 hypothesis can be rejected in favor of the alternative. The purpose of statistical analysis is therefore to determine the confidence interval around the mean value at the end of the shelf life, e.g., after 24 months of storage of the medicinal product under the recommended conditions.

Procedure

The procedure for statistical evaluation of stability data may include five basic steps:

Step 1 Visual evaluation

Step 2 Model selection

Step 3 Model significance verification

Step 4 Model adequacy verification

Step 5 Shelf life determination

Step 1 – Visual evaluation

Let’s start by determining the shelf life for an individual batch. First step is to collect data for a given parameter at appropriate time intervals with verification of compliance with specification limits, visual evaluation of the parameter’s change over time, its variability and the presence of any outlying points. However, the visual assessment does not provide a high degree of confidence expected by ICH. In the case shown in the figure, a gradual decrease in the active substance content over time is visually evident after 18 months of storage.

Step 2 – Model selection

Step 2 involves the selection of an appropriate statistical model describing the dependence between a given parameter and time. In general, quantitative chemical attributes such an assay, degradation products) can be assumed to follow zero-order kinetics during long-term storage. Therefore, the relationship between those attributes and time is assumed to be linear. The regression coefficients β0 intercept and β1 slope can be estimated using the least squares method. In the analysed case, a linear model was fitted. The slope coefficient is 0.33 and indicates the rate of degradation of the active substance per unit of time. The average content determined by the regression line is above the specification limit within the proposed shelf life of 24 months. However, this does not provide a high degree of confidence (min. 95%) that the true mean of the parameter will meet the acceptance criteria, in the sense that the regression line determined for another sample may cross the specification limit before 24 months.

Step 3 – Model significance verification

Step 3 answers the question of whether the fitted model of the parameter’s dependence on time is statistically significant. To assess the significance of the model, a t-test can be used to verify the statistical significance of the regression coefficients. The null hypothesis assumes that the β0 and β1 are equal to zero, while the alternative hypothesis assumes that they are different from zero. The t-statistic, which is the ratio of the regression coefficient to the standard error of the coefficient estimate, is used to verify the hypothesis. The table contains the estimated intercept and slope coefficients, the standard error of the regression coefficient estimates, the value of the t-statistic, and the probability p-value. If the p-value is less than the assumed significance alpha level 0.05, null hypothesis should be rejected, which means that the slope of the line is statistically significant. In the analysed case, the probability value p is 0.04 and is below the significance level alpha 0.05. The dependence of the parameter on time is statistically significant at the significance alpha level of 5%.

To verify statistical significance of the model the F-test can be used as well, which decomposes total variability into variability explained by the assumed model and variability resulting from random error. The null hypothesis assumes no dependence of the parameter on time, while the alternative hypothesis assumes a linear dependence on time. The F statistic is used to verify the hypothesis. It is the ratio of the variance explained by the assumed model to the random variance, not explained by the model. The table contains the values of variability explained by the model and not explained by the model, the value of the F statistic, and the probability level p. If the p-value is less than the assumed significance alpha level of 0.05, the null hypothesis should be rejected, which means that the assumed model of the parameter’s dependence on time is statistically significant. Both the t-test and the F-test are equivalent with the same p-value and lead to the same conclusion.

The statistical assessment is supplemented by the determination of goodness-of-fit indicators. These include:

  • standard error of estimation is a measure of the dispersion of points around the model
    (the lower the value, the better the model fit).
  • coefficient of determination indicates how much of the variability is explained by the model (the higher the value, the better the model fit).
  • PRESS coefficient describes the predictive power of the model
    (the lower the value, the greater the predictive ability of the model).
  • AIC – Akaike information criterion is a measure of model quality
    (the lower the value, the better the model).

Based on the assessment of the goodness-of-fit indicators, you can select the appropriate model. In our case, the linear regression model doesn’t seem to be the best model for describing the variability of the parameter over time. A better model is the model with a logarithmic time transformation, which has a lower standard error of estimation, a higher coefficient of determination, and better predictive power. We will see the consequences of model selection in a moment, but let’s stick with simple linear regression for now.

Goodness-of-fit measures, as well as the t-test and F-test, do not answer the question of whether the selected linear regression model is correct, adequate. For instance, if null hypothesis is rejected, this may mean that the linear model is indeed correct (A), but it may also mean that there is a more complex model (B). On the other hand, not rejecting the null hypothesis may indeed mean that there is no dependence of the parameter on time (C), but it may also mean that there is a more complex model (D).

Step 4 Model adequacy verification

Step 4 in regression analysis is therefore to assess the adequacy of the selected statistical model using the ICH-recommended analysis of residuals, which are the differences between observations and values calculated from the fitted model. The analysis may be performed on raw or scaled residuals, e.g., standardized residuals. The latter should fall within the range of +/-3. Standardized residuals, falling outside this range, may indicate the presence of outliers. The analysis of residuals can be done using simple tools: a plot of residuals as a function of values predicted by the model and the normality plot of the residual distribution.       A linear model can be considered adequate if the residuals are randomly distributed around 0, do not show systematic trends or patterns, have similar dispersion across the entire range and have a distribution close to normal. If the assumptions are met, the linear regression procedure allows for the estimation of regression coefficients with the smallest variance, free from systematic bias, enabling a reliable statistical assessment of the significance of the regression coefficients and the reliable estimation of confidence intervals and, consequently, the shelf life. On the other hand, significant violations of the assumptions regarding errors ε can lead to an unstable model in the sense that a different sample may lead to a completely different model, leading to opposite conclusions. It should be noted that for a small amount of data, it may be difficult to assess the distribution of the residuals. In our case, the residuals are randomly distributed around 0, do not show any systematic trends/patterns, have a similar spread across the entire range, and have a distribution similar to normal. The model can be considered acceptable.

 

What if the residual analysis indicates that the model is incorrect and the parameter exhibits a non-linear dependence on time? In this case, a time transformation can be applied to linearize this dependence. The transformation recommended by ICH is logarithmic transformation of time. After applying the transformation, the previous steps of the analysis should be repeated with verification of the model adequacy again. The figure shows an example of logarithmic transformation of time. As mentioned before the initial evaluation of the models indicated that the model with log transformation better describes the dependence of the parameter on time than the simple regression model without transformation. Although the linear regression model is not the best model, the ICH allows for this situation provided that the simple linear regression model is a worse case. On the other hand, the inadequate model may make the statistical assessment unreliable, including the covariance analysis, confidence level and as a consequence the estimated shelf life. Let’s see what consequences this will have for the estimated shelf life.

Step 5 Shelf life determination, individual batch

Step 5 is to determine the shelf life based on the established model. According to ICH, the shelf life is determined based on the earliest point of intersection of the 95% confidence interval with the specification limits. The confidence interval built around linear regression is the range of values that may contain the true mean value of the parameter at a given time point, 24 months, and at a given confidence level. In other words, the true regression line may fall within this confidence interval at a given confidence level. The intersection of the confidence interval with the specification limit after the proposed shelflife, e.g., 24 months, provides a high degree of confidence that the true mean value of the parameter will meet specification at the end of the shelflife. For our data, the lower limit of the 95% confidence interval intersects the lower specification limit in month 19, and the estimated shelf life is unfortunately shorter than the proposed shelf life of 24 months,  for the simple linear regression model. However, if we choose a model which better describes the change of a parameter over time then the estimated shelf life may be extended. In our case, the estimated shelf life for the model after time transformation is 53 months, which is longer than the proposed one.

 

Summary

In this way, the shelf life can be estimated for each batch individually and, if it is longer than expected, no further statistical analysis is necessary. However, if the estimated shelf life is shorter than expected for at least one batch, data from several batches/factors/combinations of factors can be combined to estimate a common shelf life for all batches tested. ICH recommends two approaches. The first approach is to use a covariance analysis, where the batch is a fixed factor (fixed effects model), and the second approach is a mixed effects model, where the batch is treated as a random factor. The first approach is applicable in cases where the number of batches is limited, i.e. 3 typical registration batches, while the second approach can be used when there are more than 5 batches. The first approach with the batch as a fixed factor will be discussed in the second, while the second approach will be presented in the third part of the series.

Under discussion

Comments on the draft version of the ICH Q1 document highlighted the need to clarify the recommendations, including:

  • Whether to use two one-sided 95% confidence intervals or a two-sided 90% confidence interval for parameters with a two-sided specification limit?
  • Should confidence intervals be determined for each batch based on separately estimated variance or based on pooled variance from three batches?
  • Should a confidence interval (population mean), prediction interval (single future population result), or tolerance interval (population proportion) be used to determine the shelf life?

The analyses were performed using the StatSoft PS application. For more information about the application itself and the systems used for stability studies management and trending, visit https://www.statsoft.pl/en/solutions/pharma/.

Author: Dr. Marek Skowronek, Head of Quality and Pharmaceutical Solutions

 

Back to news

Do you have questions?

Get in Touch!

Our team is ready to help with any questions you might have. Just fill out the form, send us a message, or give us a call, and we’ll get back to you as soon as we can!

    Skip to content