This vignette demonstrates three major functions in the `brokenstick`

package: `brokenstick()`

, `predict()`

and `plot()`

. We also need `dplyr`

and `ggplot2`

.

```
require("brokenstick")
require("dplyr")
library("ggplot2")
```

For more elaborate documentation, visit https://growthcharts.org/brokenstick/articles/manual/manual.html.

The `smocc_200`

data in the `brokenstick`

package contain the heights of 200 Dutch children measured on 10 visits at ages 0-2 years.

```
<- brokenstick::smocc_200
data head(data, 3)
```

```
## # A tibble: 3 × 7
## id age sex ga bw hgt hgt_z
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 10001 0 female 40 3960 52 0.575
## 2 10001 0.0821 female 40 3960 55.6 0.888
## 3 10001 0.159 female 40 3960 58.2 0.797
```

Figure 1 dispays the data from the first 500 rows as a set of growth curves of Dutch children. Curves are steeper during the first few months, so child growth is faster for young infants. Note also there are more cross-overs during the first half year, whereas fewer occur later. This means that the relative positions have been settled by the age of 2 years, or - put differently - that the correlation between time points at those ages is high.

```
ggplot(data[1:500, ], aes(x = age, y = hgt_z, group = id, color = as.factor(id))) +
geom_line(size = 0.1) +
geom_point(size = 0.7) +
scale_colour_viridis_d(option = "cividis") +
xlab("Age (years)") +
ylab("Length SDS") +
theme_light() +
theme(legend.position = "none")
```

Figure 2 dispays the same data, but with the vertical axis changed to Standard Deviation Scores (SDS), or \(Z\)-score. The \(Z\)-score is the height corrected for age relative to the Dutch height reference from the Fourth Dutch Growth Study. The \(Z\)-score transformation takes away the major time trend, so all curves are more or less flat. This allows us to see a more detailed assessment of individual growth.

The plots also show how the measurements are clustered around ten ages: birth, 1, 2, 3, 6, 9, 12, 15, 18 and 24 months. While the design was followed rigorously in the study, some variation in timing is inevitable because of weekends, holidays, sickness, and other events. The timing variation poses a problem because we cannot directly compare the measurement between different children (especially for figure 1). Also, we cannot easily construct the “broad” matrix with 10 time point per child.

Of course, we can divide the time axis into ten age groups, and treat all point within the same age group as being measured at the same point. This is probably a good strategy for nicely looking data - as we have here -, but this approach is problematic in data with irregular time intervals, of there are multiple measurement per age group, if the measurement schedules vary by child, or in data combined from studies that employed different designs.

The `brokenstick`

package contains tools to approximate the observed data by a series of connecting straight lines. When these lines closely follow the data, we may replace each trajectory by its values at the breakpoints. The statistical analysis can then be done on the regularised trajectories, which is easier than working with the observed data.

We fit a trivial broken stick model with just one line anchored at the minimum and maximum age, and plot the trajectories of three selected children as follows:

```
set.seed(123)
<- brokenstick(hgt ~ age | id, data)
fit <- c(10001, 10005, 10022)
ids plot(fit, group = ids, what = "all",
xlab = "Age (years)", ylab = "Length (cm)")
```

The following plot displays the same data, but in standardised units so as to increase the analytic resolution:

```
<- brokenstick(hgt_z ~ age | id, data)
fit0 plot(fit0, group = ids, what = "all",
xlab = "Age (years)", ylab = "Length (SDS)")
```

Note that both approximations are quite bad.

The *broken stick model* describes a trajectory by a series of connected straight lines. We first calculate a model with two connected lines. The first line starts at birth and end at the age of exactly 1 years. The second line spans the period between 1 to 2 years. In addition, the lines must connect at the age of 1 year. We estimate and plot the model as follows:

```
<- brokenstick(hgt_z ~ age | id, data = data, knots = 0:3)
fit2 plot(fit2, group = ids, xlab = "Age (years)", ylab = "Length (SDS)")
```

The `fit2`

object holds the parameter estimates of the model:

` fit2`

```
## Class brokenstick (kr)
## Variables hgt_z (outcome), age (predictor), id (group)
## Knots 0 1 2 3
## Mean resid 0.1821095
## R-squared 0.8667319
```

The console output lists the knots of the model, including the left and right boundary knots at 0 and 3. The row of `means`

correspond to the fixed effect estimates of the linear mixed model. We may interpret these as the global means. Next, the output lists the variance-covariance matrix of the random effects. The model contains 4 random effects (3 visits + 1 end knot). Finally, the residual variance is the model error at the subject level. These three set of parameters are enough to reconstruct the broken stick model.

The plot shows that the two-line model is still fairly crude. We refine the model in the first two years by adding a knot for each age at which a visit was scheduled. This model can be run as

```
<- round(c(0, 1, 2, 3, 6, 9, 12, 15, 18, 24)/12, 4)
knots <- brokenstick(hgt_z ~ age | id, data = data,
fit9 knots = knots, boundary = c(0, 3))
```

This optimization problem is more difficult, so it takes longer to run. In addition, it is common that the optimization routines issue a number of warnings related to the number of random effects relative to the number of observations. While these may appear a little discomforting, we have found that the warnings are generally somewhat at the conservative side, so the parameter estimates are probably still OK.

The nine-line broken stick model fits the observed data very well.

The `predict()`

function allows us to obtain various types of predictions from the broken stick model. The simplest call

```
<- predict(fit2)
p1 head(p1)
```

```
## .pred
## 1 0.7183916
## 2 0.6590848
## 3 0.6036788
## 4 0.5344755
## 5 0.3544602
## 6 0.1745172
```

`identical(nrow(data), nrow(p1))`

`## [1] TRUE`

produces a `tibble`

with the one column called `.pred`

for each row in `data`

. We can bind column `.pred`

to `data`

for further processing.

Sometimes, we also want the prediction at the knot values, for example, to create graphs that contain observed and modelled trajectories. We obtain predictions at the knots by the special `x = "knots"`

argument, e.g.

```
<- predict(fit2, x = "knots")
p2 head(p2)
```

```
## .source id age sex ga bw hgt hgt_z .pred
## 1 added 10001 0 <NA> NA NA NA NA 0.718391617
## 2 added 10001 1 <NA> NA NA NA NA -0.003981157
## 3 added 10001 2 <NA> NA NA NA NA 0.042961066
## 4 added 10001 3 <NA> NA NA NA NA 0.478864260
## 5 added 10002 0 <NA> NA NA NA NA -0.230643864
## 6 added 10002 1 <NA> NA NA NA NA -0.381720863
```

`nrow(p2)`

`## [1] 800`

The output is more verbose and includes the grid of knots for each child (`id`

, `age`

). The column `.source`

is equal to `added`

as rows are added to the data.

Note there are also knots at ages 0.00 and 3.00 years. These are boundary knots, and added by the `brokenstick()`

function. By default, the boundary knots span the age range in the data. The estimate for the knot at the maximum age has no useful interpretation, and should be ignored.

If we wish to obtain estimates at both the knots and the observed data use:

```
<- predict(fit2, x = "knots", strip_data = FALSE)
p3 table(p3$.source)
```

```
##
## added data
## 800 1942
```

This return 1942 rows for the data and 800 rows for the knots.

The proportion of the variance of the outcome explained by the two-line model is

`get_r2(fit2)`

`## [1] 0.8667319`

For the second model we get

`get_r2(fit9)`

`## [1] 0.9846596`

so the nine-line broken stick model explains about 98 percent of the variance of the height SDS.

Suppose we are interest in knowing the effect of sex, gestational age and birth weight on the height SDS at the age of 2 years. This is an analysis at the subject level. Let us first extract the subject-level data with variables that vary over subjects only.

```
<- data %>%
subj select(id, sex, ga, bw) %>%
group_by(id) %>%
slice(1)
head(subj, 3)
```

```
## # A tibble: 3 × 4
## # Groups: id [3]
## id sex ga bw
## <dbl> <chr> <dbl> <dbl>
## 1 10001 female 40 3960
## 2 10002 male 38 3210
## 3 10003 female 40 4170
```

We also need the outcome variable. We take it from the broken stick estimates from the nine line solution and append it to the subject level data.

```
<- predict(fit9, x = "knots", shape = "wide")
bs <- bind_cols(subj, select(bs, -id))
data head(data, 3)
```

```
## # A tibble: 3 × 15
## # Groups: id [3]
## id sex ga bw `0` `0.0833` `0.1667` `0.25` `0.5` `0.75` `1`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10001 female 40 3960 0.571 0.871 0.737 0.662 0.233 -0.191 0.130
## 2 10002 male 38 3210 -0.182 -0.330 -0.254 -0.320 -0.216 -0.279 -0.341
## 3 10003 female 40 4170 1.16 2.12 1.98 2.03 2.04 1.83 1.33
## # … with 4 more variables: `1.25` <dbl>, `1.5` <dbl>, `2` <dbl>, `3` <dbl>
```

The names of the columns in `bs`

correspond to the knot values.

The effect of the subject’s sex, gestational age and birth weight on the height SDS at the age of 2 years (here denoted by the variable named `2`

) can be estimated as

```
<- lm(`2` ~ sex + ga + I(bw / 1000), data = data)
fit1_lm summary(fit1_lm)
```

```
##
## Call:
## lm(formula = `2` ~ sex + ga + I(bw/1000), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.96585 -0.46305 0.06656 0.63552 2.05131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.30186 1.48019 0.880 0.38020
## sexmale 0.01451 0.12930 0.112 0.91078
## ga -0.07418 0.04359 -1.702 0.09038 .
## I(bw/1000) 0.50933 0.14295 3.563 0.00046 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8849 on 196 degrees of freedom
## Multiple R-squared: 0.06722, Adjusted R-squared: 0.05294
## F-statistic: 4.708 on 3 and 196 DF, p-value: 0.003386
```

Note that the analysis shows there is a substantial effect of birth weight. Of course, it might be that birth weight is directly related to height at the age of 2 years. Alternatively, the relation could be mediated by birth length. The following model adds birth length (the variable named `0`

) to the model:

```
<- lm(`2` ~ sex + ga + I(bw / 1000) + `0`, data = data)
fit2_lm summary(fit2_lm)
```

```
##
## Call:
## lm(formula = `2` ~ sex + ga + I(bw/1000) + `0`, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.00784 -0.50862 0.05078 0.62563 1.89610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.427186 1.673835 2.048 0.0419 *
## sexmale 0.003071 0.127536 0.024 0.9808
## ga -0.093100 0.043585 -2.136 0.0339 *
## I(bw/1000) 0.131026 0.202911 0.646 0.5192
## `0` 0.236379 0.091229 2.591 0.0103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8723 on 195 degrees of freedom
## Multiple R-squared: 0.09826, Adjusted R-squared: 0.07976
## F-statistic: 5.312 on 4 and 195 DF, p-value: 0.0004414
```

The effect of birth length on length at age 2 is very strong. There is no separate effect of birth weight anymore, so this analysis suggests that the relation between birth weight and length at age 2 can be explained by their mutual associations to birth length.

This vignette illustrated the use of the `brokenstick()`

, `plot()`

and `predict()`

functions. Other vignettes highlight various other capabilities of the package.