We will simulate some data to illustrate the relationship between heritability estimates derived from a regression of offspring phenotype on mid-parent value and those derived from analysis of half-sibs. To do the simulation, weâ€™ll assume that there are a large number of loci that affect the trait and that each locus has a small effect. With this assumption, we can approximate the genotype of each individual as a random variable drawn from a normal distribution. The `R`

code below implements the folloowing simple algorithm:

Select a genotype for the sire at random from a normal distribution with mean \(mu\) and variance \(\sigma^2_g\). Call that genotype \(x_p\).

Select a genotype for the dam at random from a normal distribution with the same mean and variance. Call that genotype \(x_m\).

Calculate the mid-parent genotypic value as \(x_{mp} = \frac{x_m + x_p}{2}\).

Set the offspring genotype to \(x_{mp}\).

Repeat steps #2 and #3

`n_dams`

times.Repeat steps #1-#5

`n_sires`

times.Set phenotypes for the maternal parent, the paternal parent, and the offspring by drawing them from normal distributions with the appropriate means and varience \(\sigma^2_e\)

Record the paternal ID, the genotype of each parent, the mid-parent genotypic value, the phenotype of each parent, the mid-parent phenotypic value, and the offspring phenotype.

NOTE: I am using `set.seed(1234)`

to ensure that if you run this code on your own, youâ€™ll get the same results as I do. If you erase or comment out that line and rund the code, your results will differ slightly from mine.

`library(tidyverse)`

```
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
â”€â”€ Attaching packages â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ tidyverse 1.3.1 â”€â”€
âœ“ ggplot2 3.3.5 âœ“ purrr 0.3.4
âœ“ tibble 3.1.4 âœ“ dplyr 1.0.7
âœ“ tidyr 1.1.3 âœ“ stringr 1.4.0
âœ“ readr 2.0.1 âœ“ forcats 0.5.1
â”€â”€ Conflicts â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ tidyverse_conflicts() â”€â”€
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
```

```
rm(list = ls())
set.seed(1234)
n_sires <- 100
n_dams <- 10
mu <- 50
sigma_2_g <- 25
sigma_2_e <- 9
dat <- tibble(pat_ID = NA,
ind_ID = NA,
mat_g = NA,
pat_g = NA,
mid_g = NA,
mat_p = NA,
pat_p = NA,
mid_p = NA,
off_p = NA)
ct <- 0
for (i in 1:n_sires) {
patg <- rnorm(1, mu, sqrt(sigma_2_g))
patp <- rnorm(1, patg, sqrt(sigma_2_e))
for (j in 1:n_dams) {
ct <- ct + 1
matg <- rnorm(1, mu, sqrt(sigma_2_g))
matp <- rnorm(1, matg, sqrt(sigma_2_e))
midg <- (matg + patg)/2
offp <- rnorm(1, midg, sqrt(sigma_2_e))
dat <- add_row(dat,
pat_ID = i,
ind_ID = ct,
mat_g = matg,
pat_g = patg,
mid_g = midg,
mat_p = matp,
pat_p = patp,
mid_p = (matp + patp)/2,
off_p = offp)
}
}
dat <- dat %>%
filter(!is.na(pat_ID)) %>%
mutate(pat_ID = factor(pat_ID),
ind_ID = factor(ind_ID))
pat_g_var <- var(unique(dat[, c("pat_ID", "pat_g")])$pat_g)
pat_p_var <- var(unique(dat[, c("pat_ID", "pat_p")])$pat_p)
variances <- tibble(Parent = c("Sire", "Dam"),
Genotypic = round(c(pat_g_var, var(dat$mat_g)), 3),
Environmental = round(c(pat_p_var - pat_g_var,
var(dat$mat_p) -
var(dat$mat_g)), 3),
Phenotypic = round(c(pat_p_var, var(dat$mat_p)),
3))
knitr::kable(variances)
```

Parent | Genotypic | Environmental | Phenotypic |
---|---|---|---|

Sire | 26.680 | 5.934 | 32.614 |

Dam | 23.884 | 9.868 | 33.752 |

As you can see the simulated genotypic and environmental variances are pretty close to what we specified, i.e., \(\sigma^2_g = 25\) and \(\sigma^2_e = 9\).^{1} You can also see that the phenotypic variance is pretty close to \(\sigma^2_g + \sigma^2_e\), as we expect.

If youâ€™re familiar with regression in `R`

, you know about `lm()`

and `glm()`

. Iâ€™m going to use a similar function from `rstanarm`

, namely `stan_glm()`

to run a Bayesian linear regression. In addition to providing an estimate of the slope of the regression of offspring on mid-parent value, it provides a sense of how precise that estimate is in the form of 90 percent credible intervals.^{2}

`library(rstanarm)`

```
Loading required package: Rcpp
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
This is rstanarm version 2.21.1
- See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
- Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
- For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores())
```

```
## This line allows rstanarm to run several chains at the same time
## instead of running them sequentially
options(mc.cores = parallel::detectCores())
off_r <- stan_glm(off_p ~ mid_p,
data = dat,
refresh = 0)
summary(off_r, digits = 3)
```

```
Model Info:
function: stan_glm
family: gaussian [identity]
formula: off_p ~ mid_p
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 1000
predictors: 2
Estimates:
mean sd 10% 50% 90%
(Intercept) 12.970 1.323 11.273 12.986 14.608
mid_p 0.740 0.027 0.708 0.740 0.775
sigma 3.425 0.076 3.328 3.425 3.524
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 49.663 0.149 49.474 49.662 49.853
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.020 1.000 4331
mid_p 0.000 1.000 4335
sigma 0.001 1.000 3749
mean_PPD 0.002 1.000 3919
log-posterior 0.028 1.002 1827
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
```

We see that the slope of the regression, the `mid_p`

line, is 0.740 (0.708, 0.775), while we expect to see \(\frac{\sigma^2_g}{\sigma^2_g + \sigma^2_p} = \frac{25}{34} = 0.735\). Thatâ€™s not bad at all. Now letâ€™s look at the half sib analysis.

Besides the fact that `stan_glm()`

provides both point estimates and estimates of uncertainty it also sets us up nicely for using `stan_glmer()`

, which sets us up for doing the same thing with variance components. Thatâ€™s what weâ€™ll be doing here. We are fitting a model in which each offspringâ€™s phenotype is drawn from a normal distribution with a mean that depends on `pat_ID`

.^{3}

```
off_h <- stan_glmer(off_p ~ (1|pat_ID),
data = dat,
refresh = 0)
summary(off_h, digits = 3,
pars = c("sigma", "Sigma[pat_ID:(Intercept),(Intercept)]"))
```

```
Model Info:
function: stan_glmer
family: gaussian [identity]
formula: off_p ~ (1 | pat_ID)
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 1000
groups: pat_ID (100)
Estimates:
mean sd 10% 50% 90%
sigma 3.891 0.090 3.775 3.891 4.006
Sigma[pat_ID:(Intercept),(Intercept)] 5.531 1.027 4.310 5.429 6.907
MCMC diagnostics
mcse Rhat n_eff
sigma 0.001 1.000 4085
Sigma[pat_ID:(Intercept),(Intercept)] 0.030 1.001 1142
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
```

`sigma`

is the standard deviation within a sibship, \(\sigma_w\), and `Sigma[pat_ID:(Intercept),(Intercept)]`

is the variance among half-sib families, \(\sigma^2_{hs}\). The variance among half-sib families is the variance among mothers, which is the genetic variance. There are, however, 10 offspring in each half-sib family. Thus, we can estimate the broad-sense heritability with this little function.^{4}

```
heritability <- function(sigma_w, sigma_hs, n_off) {
h_2 = sigma_hs*n_off/(sigma_hs*n_off + sigma_w^2)
return(h_2)
}
off_h_mat <- as.data.frame(off_h)
h_2 <- heritability(off_h_mat$sigma,
off_h_mat$`Sigma[pat_ID:(Intercept),(Intercept)]`,
n_dams)
off_h_mat$h_2 <- h_2
library(ggplot2)
p <- ggplot(off_h_mat, aes(x = h_2)) +
geom_histogram(bins = 100, alpha = 0.4) +
geom_vline(xintercept = mean(off_h_mat$h_2),
color = "red",
linetype = "dashed") +
xlab("Heritability") +
theme_bw()
p
```

As you can see, in this simulation the half-sib estimate is pretty close to what we expect to see, 0.781.^{5}

As part of her dissertation, Nora Mitchell measured several traits in two closely related species of *Protea*

*Protea punctata* is a large, upright shrub.