## 1. Introduction

Toxicological investigation of environmental chemical mixtures is evolving, with attention focused on defined mixtures, involving a limited number of chemicals, and complex environmental mixtures, involving a large number of chemicals and, typically, an unidentified fraction. Dosing regimens are being developed to evaluate the toxicity of complex mixtures using approaches consistent with human environmental exposures that are typically much lower than those used in traditional toxicology studies. Consequently, newer studies are designed to evaluate the toxicity of complex mixtures (1) with the inclusion of low, environmentally relevant dose levels; (2) with the relative proportions of component chemicals similar to those measured in environmental samples; and, (3) with approaches that maintain the chemically unidentified components of the mixture [

1,

2]. This type of study design differs from many traditional defined-mixture studies, conducted mostly on binary mixtures at relatively high dose levels where adverse effects are more readily detected if present, but whose relevance to human health risks associated with low environmental exposures is often unclear. This newer type of study design has been proposed specifically for application to drinking water disinfection by-products (DBPs) [

3–

5].

Disinfection of drinking water for microbial contamination provides an essential public health benefit in reduction of water-borne disease. However, oxidizing disinfectants react with materials in the source water resulting in the formation of a wide variety of DBPs. DBP mixtures are highly complex, containing numerous chemicals not routinely measured and many that are unknown; approximately 50% of the total organic halide compounds formed when water is disinfected remains unidentified [

6–

8]. Some epidemiologic studies report adverse developmental effects associated with exposure to DBPs, including low birth weight at term, small for gestational age [

9,

10], and stillbirths [

11,

12]. Epidemiologic evidence is mixed regarding associations of DBPs with spontaneous abortion [

13,

14]. Toxicity bioassays have been conducted on approximately 35 individual DBPs and a limited number of DBP mixtures [

15–

19]; some of the tested DBPs and DBP mixtures were shown to be reproductive or developmental toxicants in experimental animals.

Because concerns identified from epidemiologic studies on whole DBP mixtures cannot be readily addressed by investigating either individual DBPs or simple, defined DBP mixtures, scientists from four of the national laboratories and centers of the U.S. EPA’s Office of Research and Development have developed and, along with extramural partners, undertaken a research project (the Four Lab Study) that integrates toxicological and chemical evaluation of environmentally realistic complex mixtures of DBPs [

4,

5,

20]. For complex DBP mixtures formed by chlorination, improved understanding of

in vivo reproductive/developmental toxicology is a priority; consequently, the Four Lab study included a multigenerational reproductive/developmental toxicity rodent bioassay. Preparation for this experiment included: conducting a phased series of experiments involving water concentration, toxicology, and chemistry [

8,

21–

25]; development of new risk assessment methodology [

26]; and conducting developmental toxicity screens on sodium, sulfate, and concentrated DBPs [

27].

The resulting database of toxicological and analytical chemistry data on the whole DBP mixture provides important information for health risk assessment of DBPs [

26]. Risk assessment investigations include the analysis of associations between positive assay results (e.g., delays in attaining puberty) and dose level of specific DBPs (e.g., dichloroacetic acid) or groups of DBPs (e.g., trihalomethanes) identified within the whole mixture. Using existing dose-response data on individual chemicals, statistical models can then be used to test whether the observed toxicity of the whole mixture can be attributed to known DBPs, their joint toxic action, or to the unidentified fraction of the DBP mixture [

26]. These distinctions are important for informed decisions with regard to controlling the levels of specific DBPs or groups of DBPs in finished drinking water, resulting in the production of clean, safe water. To allow for expanded use of the results, statistical and toxicological criteria have been developed to determine the “sufficient similarity” of DBP mixture composition and toxicity potential among finished drinking waters [

28]. Using such criteria, it may be possible to extrapolate Four Lab Study results on the chlorinated concentrates to evaluate the potential for health effects from different DBP mixtures that could arise from: (1) treatment with other types of disinfectants (e.g., chloramination), (2) differences among source waters (e.g., differences in NOM) or (3) differences in treatment practices across treatment plants or over time within the same treatment plant.

Experimental constraints considered in the design of the multigenerational bioassay included: the number of dams that could be accommodated at one time (maximum of 100); the extent that water could be concentrated while retaining palatability and conserving organics (a concentration factor of 136× for total organic carbon was achieved for use in the multigenerational bioassay [

29]); and, the quantity of concentrate that would be produced (~1,500 liters). Designing a meaningful study given these constraints required careful attention to statistical power.

In the multigenerational bioassay, timed-pregnant Sprague-Dawley rats comprising the parental (P_{0}) generation would be assigned randomly to either a control group or a treatment group which would consume chlorinated water concentrate. Each P_{0} dam was to deliver a litter (F_{1} generation). An issue addressed in the present work was whether to breed one or two females from each F_{1} litter to a non-sibling F_{1} male from the same exposure group to produce the F_{2} generation. Priority study endpoints were prenatal loss (number of uterine implantation sites minus number of live pups at birth, divided by implantation sites) and pup birth weight. In comparison with epidemiologic endpoints of concern, prenatal loss is analogous to spontaneous abortion, whereas reduced pup weight is analogous to small for gestational age and term low birth weight.

U.S. EPA testing guidelines for reproductive toxicity call for 20 pregnant females per group as the standard protocol for single chemical bioassays [

30]. The guidelines also state “the highest dose level should induce toxicity in the parental (P) animals and intermediate dose levels should produce minimal observable toxic effects. The lowest dose level should not produce any evidence of toxicity” [

30]. Under these guidelines, the requirement for toxicity in the high and middle dose levels ensures adequate statistical power to detect effects, but not necessarily at the low dose. In contrast, in the multigenerational bioassay whose design was the goal of this paper, pregnant animals were to be exposed to much lower (

i.e., environmentally relevant) DBP concentrations than the guidelines suggest. Thus, study design, particularly estimation of statistical power, was essential to optimize sample sizes within the experimental constraints, so that effects, if present, would be detected, increasing the utility of the experimental results for risk assessment. This article describes the development and application of the methodology used for calculating statistical power for non-independent observations for this bioassay with chlorinated water concentrates in the Four Lab Study, taking into account the multigenerational bioassay design as well as constraints on sample size, concentration factor, and sample volume.

## 3. Model

For analyzing pup weight, male and female pups were considered separately as well as combined; the results presented in this paper are for the combined male and female pups. The litter was treated as the experimental unit, with each pup within a litter representing a repeated measurement. A linear model was assumed for the data, so that:

where y_{ijk} is the weight of the k^{th} live pup in the j^{th} litter of the i^{th} group (where i = 1 for the control group and i = 2 for the treatment group):

and:

In this model, the pup weights, y_{ijk}, were assumed to follow a Gaussian distribution with mean β_{i} and common variance-covariance matrix ∑ within litters. A constant correlation was assumed between the weights of all pups from the same litter (i.e., a compound symmetric correlation structure), while weights of pups from different litters were assumed to be uncorrelated (independent).

An additional key endpoint, prenatal loss, was also examined with respect to power. For prenatal loss, litter again represented the experimental unit, with each implantation site representing a repeated measurement. The endpoint was then coded as:

where i equals 1 (control) or 2 (treatment).

If π_{i} denotes the mean prenatal loss probability among dams in the i^{th} group, then under a linear model:

where x_{1}_{j} and x_{2}_{j} are defined as above. Alternatively, when a linear logistic model was assumed for the data, then:

where x_{i}_{j} and x_{2}_{j} are defined as above.

The linear and linear logistic models differ based upon the alternative hypotheses being tested. In both the linear and linear logistic models for prenatal loss a binomial error distribution was assumed for the model. In the linear model, the alternative hypothesis H_{α}:β_{1} ≠ β_{2} represents the difference between the prenatal loss proportions of the control and treatment groups (e.g., the prenatal loss for the treatment group is 7.1 percentage points greater than the prenatal loss for the control group). In contrast, under the linear logistic model, the alternative hypothesis H_{α}:β_{1} ≠ β_{2} represents the difference between the log odds of the control and treatment groups, which for small π_{i} is approximately equal to the ratio of the prenatal loss proportions of the treatment and control groups (e.g., the prenatal loss for the treatment group is approximately twice the prenatal loss for the control group). Therefore, with respect to the biological interpretation of the endpoint, the specific question of interest, i.e., an absolute increase over control vs. a proportional increase relative to control, is important to the appropriate model specification. The results for the two tests should be asymptotically equivalent under the null hypothesis, but sensitivity may differ under different alternative hypotheses. Results for both models are presented here.

For both the pup weight data and the prenatal loss data, an explicit relationship exists between the over-dispersion parameter, which represents the proportion of the observed variability that is due to the correlation among the observations, and the intra-litter correlation. For both distributions, the over-dispersion parameter, ψ, can be expressed as a function of the intra-litter correlation, ρ:

where

n_{ij} is the number of pups in the

j^{th} litter of the

i^{th} group. From this equation it follows that

ψ_{ij} = 1 if and only if

ρ = 0 or

n_{ij} = 1. For each scenario examined, the intra-litter correlation

ρ was specified based on the pilot data, and the dispersion parameter

ψ was calculated according to the above equation. By doing this, the estimates for

ρ given

ψ closely matched those observed from modeling the Narotsky

et al. [

27] data.

#### 3.1. Calculations and Assumptions

The number of P_{0} dams available for study was limited to 100 dams per block where experimental blocks were logistically constrained to being evaluated sequentially over time. Therefore, the problem was that of determining whether the study would have sufficient power to detect the treatment effects of interest (i.e., a 0.6 g difference in mean pup weight and a 7.1 percentage point difference in mean prenatal loss or 1.9 treatment-to-control ratio in mean prenatal loss), with respect to the key endpoints of pup weight and prenatal loss. The targeted level of power was 80%, while maintaining a 0.05 significance level.

In each block, 100 dams would be divided into two groups: a control group and a treatment group receiving water concentrate containing a complex mixture of DBPs. In this study, one treatment group would be used due to the limited amount of water concentrate available.

The expected control group and treatment group means were estimated from the Narotsky

et al. [

27] data (

Table 1); the variances and intra-litter correlations were also estimated from these pilot data. The modeled estimates used to develop power calculations differed only slightly from the statistics calculated from the raw data (

Table 1). Litter size and the number of implantation sites were set equal to the average litter size and number of implantation sites based on the preliminary study data.

Having equal numbers of repeated measurements per experimental unit was a necessary assumption for the implementation of Rochon’s [

31] methodology. Therefore, it was assumed that an equal number of live pups would be produced in each litter for all scenarios considered with respect to pup weight. Similarly, for prenatal loss, it was assumed that all females would produce an equal number of implantation sites. These assumptions are unlikely to be realized in practice, but were expected to have little impact on the observed power of the study.

#### 3.2. Initial Designs

#### Single-Block Design, One F_{1} Female per Litter

The first design considered was a single-block design in which one F

_{1} female rat per litter would be bred. In this design, a maximum of 100 dams would be available for assignment to the two groups (control and treatment). The compound symmetric correlation structure was used for pups within litters, and pups from different litters were assumed to be independent. The correlation matrix is given in

Figure 1a. This design has the advantage of simplicity, both of execution and analysis.

The cohort size of 100 dams was sufficient to achieve the desired 80% power to detect a 0.6 g difference in average pup weight between the control and treatment groups at a significance level of 0.05 for all cases considered except one. With this simple design, greater than 99% power can be achieved with 50 dams assigned to each of the control and treatment groups, or with a 40:60 control:treatment group ratio assignment.

The single-block design failed to produce a sufficient level of power to detect a treatment effect on prenatal loss across the effect size, litter size, and correlation scenarios considered. Only when the intra-litter correlation was assumed to be zero (a very unlikely assumption) would 100 dams be sufficient to achieve the desired level of power. When the intra-litter correlation was assumed to be non-zero, the maximum achievable power, regardless of whether equal or unequal allocation of the dams occurred, was substantially lower than the desired 80% for the prenatal loss endpoint for either the linear (36%) or the logistic (35%) model. Because the single-block design showed such poor performance with respect to prenatal loss, other design options were examined.

#### Single-Block Design, Two F_{1} Females per Litter

In an attempt to increase the power of the experiment without increasing the number of P

_{0} dams, a design in which two females per F

_{1} litter (

i.e., sisters) are bred was examined. A major complication with this unconventional design is that the pups from the litters of the two related F

_{1} females would not be independent. The resulting correlation structure is presented in

Figure 1b.

Rochon’s [

31] sample size methodology is flexible enough to accommodate such a correlation structure. However, this correlation structure requires additional assumptions. First, it must be assumed that the correlation between litters of sisters is the same for all sister pairs. Likewise, it must be assumed that correlation between any pup from one litter and any pup from the related litter is the same for any such pair of pups.

In addition, the size of the inter-litter correlation relative to the intra-litter correlation must be estimated. It is a reasonable assumption that the inter-litter correlation would be smaller than the intra-litter correlation, though it is unclear how much smaller. To address this last issue, a sensitivity analysis could be conducted. The true nature of the correlation structure for such a design is uncertain, and a violation of any one of these assumptions could affect the results. Validation of these assumptions is not possible, because Narotsky

et al. [

27] does not provide data on litters of related females. Moreover, this type of design, breeding more than one F

_{1} female per litter, is unconventional and was not encountered in the scientific literature. Despite the likelihood that this design would not be usable for the current study, it was examined to determine its potential usefulness for future studies; if large increases in expected power can be realized with such a design, then pilot data could be collected to provide the required estimates and support validation of the assumptions for future work.

Based on a single value for the relative value of the inter-litter correlation to the intra-litter correlation (δ = 0.75), the results were encouraging. As expected based on the results above with one F_{1} female, power for the pup weight endpoint was greater than 99% with a control:treatment group ratio of either 50:50 or 40:60. Power for the prenatal loss endpoint continued to fall below the desired 80% level, but increased using the linear (43%) and logistic (42%) models, respectively, and with unequal allocation of the dams, as stated above for the one female per litter design.

Despite the increased power using two females per litter, this design was not further considered. This design required more water concentrate than the alternative two-block design (discussed below). In addition, the project team lacked confidence in the method of handling the correlation between related litters and was doubtful that all necessary assumptions would be met. Nonetheless, it is important to note that this approach,

i.e., using multiple offspring per litter, can enhance the sensitivity of reproductive toxicity testing for several endpoints (e.g., onset of puberty, estrous cyclicity) [

32,

33] and warrants further exploration.

#### 3.3. Final Design

#### Two-Block Design, with One F_{1} Female per Litter

To achieve the desired level of power without the complication of inter-litter correlation, a design with two blocks of 100 P_{0} dams per block (total of 200 dams) was examined. Although a single-block design lacks the complication of a blocking factor, managing 200 dams in a single block exceeded the technical capability available to conduct the study. A design with two blocks (i.e., replicates) of 100 dams each was logistically more feasible and was expected to produce increases in power dependent upon the size of the group × block interaction effects.

If dams are examined in two blocks, a factor must be included in the model to account for differences between the two blocks. The blocks would be treated as a random factor in the analysis of the data from the multi-generational study being powered here. The model under consideration for pup weight was revised as:

where y_{ijkl} is the weight of the l^{th} live pup in the k^{th} block in the j^{th} litter of the i^{th} group:

where i equals 1 (control) or 2 (treatment). τ_{k} is the random effect of the k^{th} block, with τ_{k} ~ N (0, σ_{βτ}^{2}), and (βτ)_{ik} is the random effect of the i^{th} treatment group by k^{th} block interaction, with (βτ)_{ik} ~ N(0, σ_{βτ}^{2}). The experiment was designed to include replication so that both the block and the treatment group by block interaction can be estimated.

Using the same notation, the revised model for prenatal loss was:

Three different scenarios representing the experimental outcomes were examined in this analysis: (1) the block effect is zero (i.e., no significant block effect), (2) the main effect of block is significant, and (3) the main effect of block and the group × block interaction are significant.

To determine the power achieved for a two-block design, the methodology described by Rochon [

31] was used. Implementation of the methodology was similar to that described above for the single-block case, with the same correlation structure used. Parameter estimates required for the algorithm were based on the complex mixture screening-level experiment (

Table 1) [

27], with the exception of estimates for the block effects, which were not available and, therefore, varied.

To calculate sample size and power for a given scenario with fixed parameters, the algorithm calculations needed to be performed only once. However, because of the random nature of the block effect, a single fixed parameter estimate for the block effect could not be used. Treating the blocking factor as random, the model assumed that the observed block effect was a random observation from the distribution of block effects and that the observed group × block interaction effect was a random observation from the distribution of group × block interaction effects.

To incorporate the random block main effect into the calculations, random noise was generated according to a N (0, σ_{τ}^{2}) distribution and added to the group means by block. Random noise was also generated according to a N (0, σ_{βτ}^{2})and added to each group mean to incorporate the random interaction effect. As stated above, the values for σ_{τ}^{2} and σ_{βτ}^{2} were varied since no estimates were available. The ranges for these random block effects were selected to provide sufficiently realistic variability without overwhelming the true effects of interest in the model. Power was calculated 500 times and the average power reported.

## 5. Discussion

This article describes the methodology used for calculating statistical power for non-independent observations in a two-block design for the multigenerational reproductive/developmental toxicity rodent bioassay in the Four Lab Study. It takes into account the multigenerational bioassay design as well as constraints on sample size, water concentrate volume and concentration factor.

Designing this bioassay under these constraints necessitated thoughtful consideration of statistical power. Determining power and sample size for multiple block designs is complicated, because it must account for the interaction of groups with blocks. Though the effect of the block is not of inherent interest, it might influence the group (treatment) effect if the DBP levels within the concentrate change during the course of the study. Using developmental toxicity screening data [

27], power calculations were made for prenatal loss and pup weight. These endpoints were considered of primary importance because they correspond to effects reported in positive epidemiologic studies [

9,

10,

13,

14].

While several rodent developmental toxicity investigations have been conducted using exposures to concentrated tap waters [

15,

27,

34,

35], only Uriu-Hare

et al. [

35] reported power calculations. They calculated the sample size necessary to provide 80% power to detect a 50% increase in the number of dams with at least one resorption. The authors subsequently found a significant increase in the incidence of dams with one or more resorptions among those dams drinking unconcentrated tap water compared to dams drinking deionized water. The power calculations developed for this multigenerational bioassay are more detailed than those of Uriu-Hare

et al. [

35]. Other DBP toxicology studies to date show a lack of adverse developmental effects or only marginal or subtle effects on rodent development, however, without estimates of power, these results are difficult to interpret [

15,

24,

34,

35].

Based on the results of the power analyses and the physical constraints of the study, the Four Lab team selected a two-block design for the multigenerational bioassay, assigning 40 and 60 timed-pregnant rats to the control and treatment groups in each block, respectively. The two-block design achieved greater than the desired 80% power at a significance level of 0.05 with respect to pup weight for all the scenarios examined; more than half of the scenarios for the two-block design achieved 100% power. This design ensured that at least one sensitive endpoint (

i.e., pup weight) had maximum power to detect an effect at the low-response region of the dose-response curve. Fetal or pup weight is known to be a sensitive developmental endpoint in animal studies that is often observed at doses below those causing other developmental effects [

36]. This endpoint is also relevant to human health, as low birth weight children have been shown to be at increased risk for developing several chronic sequelae [

37,

38].

For this research, detecting a prenatal loss effect, if present, also was desirable. The two-block design, that optimizes the power for pup weight loss, also provides the most power for prenatal loss from among the possible designs, given the study constraints. This analysis shows that the two-block design provides a modest amount of power (

i.e., at most, 57–58% using the linear model) to detect differences in prenatal loss. Analyses showed that the simplest design, consisting of a single block, would have insufficient power. A second possible single-block design, which bred two females per F

_{1} litter, was eliminated from consideration due to the uncertainty surrounding the inter-litter correlation, and the need for larger quantities of water concentrate. In general, the power associated with the two-block design was approximately twice that of the single-block designs considered. The somewhat large discrepancy between the power for pup weight and prenatal loss was expected and is inevitable in a reproductive toxicity study. This is because, relative to their respective means, the variance for prenatal loss is generally much larger than for pup weight (e.g.,

Table 1). Thus, based on this analysis, the two-block design was considered the best design option for the Four Lab multigenerational bioassay.

The constraints imposed by conducting toxicological investigations with highly complex environmental mixtures in an environmentally relevant medium at environmentally relevant dose levels are not unique to the Four Lab Study. The methodology described here may be applied to appropriately design other toxicology studies with environmentally realistic complex mixtures, as similar constraints likely will be encountered.

This work highlights the importance of considering statistical power in the design of bioassays that evaluate health effects of chemical mixtures in the low-response region of the dose-response curve. Such biostatistical analyses provide meaningful quantitative insights into the trade-offs inherent in the design of studies conducted in the low-response region and provide a clear and logical rationale for choice of study design. These analyses and insights lead to toxicological studies in the low-response region that provide meaningful results and allow for appropriate interpretation of experiments when no observable adverse effect is detected. The conduct of such toxicological studies is critical for improved dose-response assessments of complex chemical mixtures, because they increase understanding of the potential human health effects from exposure to chemical mixtures near environmental exposure levels, which are of increased relevance to human health risk assessment [

39].