---
title: "Beta Regression in R"
author:
- name: Francisco Cribari-Neto
affiliation: Universidade Federal de Pernambuco
address: Brazil
orcid: 0000-0002-5909-6698
email: cribari@ufpe.br
url: https://www.de.ufpe.br/~cribari/
- name: Achim Zeileis
affiliation: Universität Innsbruck
address: Austria
orcid: 0000-0003-0918-3766
email: Achim.Zeileis@R-project.org
url: https://www.zeileis.org/
format:
html:
html-math-method: mathjax
toc: true
number-sections: true
abstract: >
This introduction to the R package **betareg** is a (slightly)
modified version of @betareg:Cribari-Neto+Zeileis:2010, published in the
_Journal of Statistical Software_. A follow-up paper with various
extensions is @betareg:Gruen+Kosmidis+Zeileis:2012 -- a slightly modified
version of which is also provided within the package as
`vignette("betareg-ext", package = "betareg")`\
The class of beta regression models is commonly used by
practitioners to model variables that assume values in the standard
unit interval $(0, 1)$. It is based on the assumption that the
dependent variable is beta-distributed and that its mean is related to a set
of regressors through a linear predictor with unknown coefficients
and a link function. The model also includes a precision parameter
which may be constant or depend on a (potentially different)
set of regressors through a link function as well. This approach naturally
incorporates features such as heteroskedasticity or skewness
which are commonly observed in data taking values in the standard
unit interval, such as rates or proportions. This paper describes
the **betareg** package which provides the class of beta regressions
in the R system for statistical computing. The
underlying theory is briefly outlined, the implementation discussed
and illustrated in various replication exercises.
bibliography: betareg.bib
vignette: >
%\VignetteIndexEntry{Beta Regression in R}
%\VignetteEngine{quarto::html}
%\VignetteEncoding{UTF-8}
%\VignetteDepends{stats,betareg,car,lmtest,sandwich,strucchange}
%\VignetteKeywords{beta regression, rates, proportions, R}
%\VignettePackage{betareg}
---
```{r preliminaries, include = FALSE}
library("betareg")
data("GasolineYield", package = "betareg")
data("FoodExpenditure", package = "betareg")
gy_loglog <- betareg(yield ~ batch + temp, data = GasolineYield,
link = "loglog")
fe_beta2 <- betareg(I(food/income) ~ income + persons | persons,
data = FoodExpenditure)
knitr::opts_chunk$set(
engine = "R",
collapse = TRUE,
comment = "##",
message = FALSE,
warning = FALSE,
echo = TRUE
)
options(width = 70, prompt = "R> ", continue = "+ ", useFancyQuotes = FALSE, digits = 5)
```
## Introduction {#sec-intro}
How should one perform a regression analysis in which the dependent variable
(or response variable), $y$, assumes values in the standard unit interval $(0,1)$? The usual
practice used to be to transform the data so that the transformed response, say
$\tilde y$, assumes values in the real line and then apply a standard linear
regression analysis. A commonly used transformation is the logit, $\tilde y =
\log(y/(1-y))$. This approach, nonetheless, has shortcomings. First, the
regression parameters are interpretable in terms of the mean of $\tilde y$, and
not in terms of the mean of $y$ (given Jensen's inequality). Second, regressions
involving data from the unit interval such as rates and proportions are
typically heteroskedastic: they display more
variation around the mean and less variation as we approach the lower and upper
limits of the standard unit interval. Finally, the distributions of rates and
proportions are typically asymmetric, and thus Gaussian-based approximations for
interval estimation and hypothesis testing can be quite inaccurate in small
samples. @betareg:Ferrari+Cribari-Neto:2004 proposed a regression model
for continuous variates that assume values in the standard unit interval, e.g.,
rates, proportions, or concentration indices. Since the model is
based on the assumption that the response is beta-distributed, they called their
model _the beta regression model_. In their model, the regression
parameters are interpretable in terms of the mean of $y$ (the variable of
interest) and the model is naturally heteroskedastic and easily
accomodates asymmetries. A variant of the beta regression model that allows for
nonlinearities and variable dispersion was proposed by
@betareg:Simas+Barreto-Souza+Rocha:2010. In particular, in this more
general model, the parameter accounting for the precision of the data is not
assumed to be constant across observations but it is allowed to vary, leading to the
_variable dispersion beta regression model_.
The chief motivation for the beta regression model lies in the flexibility
delivered by the assumed beta law. The beta density can assume a number of
different shapes depending on the combination of parameter values, including
left- and right-skewed or the flat shape of the uniform density (which is a
special case of the more general beta density). This is illustrated in
@fig-beta-distributions which depicts several different beta densities.
Following @betareg:Ferrari+Cribari-Neto:2004, the densities are
parameterized in terms of the mean $\mu$ and the precision parameter $\phi$;
all details are explained in the next section. The evident flexiblity makes
the beta distribution an attractive candidate for data-driven statistical modeling.
```{r beta-distributions}
#| echo: false
#| fig-width: 8.5
#| fig-height: 4.5
#| out-width: 100%
#| label: fig-beta-distributions
#| fig-cap: "Probability density functions for beta distributions with varying parameters $\\mu = 0.10, 0.25, 0.50, 0.75, 0.90$ and $\\phi = 5$ (left) and $\\phi = 100$ (right)."
par(mfrow = c(1, 2), mar = c(4.1, 4.1, 4.1, 0.1))
dbeta2 <- function(x, mu, phi = 1) dbeta(x, mu * phi, (1 - mu) * phi)
x <- seq(from = 0.01, to = 0.99, length = 200)
xx <- cbind(x, x, x, x, x)
yy <- cbind(
dbeta2(x, 0.10, 5),
dbeta2(x, 0.25, 5),
dbeta2(x, 0.50, 5),
dbeta2(x, 0.75, 5),
dbeta2(x, 0.90, 5)
)
matplot(xx, yy, type = "l", xlab = "y", ylab = "Density", main = expression(phi == 5),
lty = 1, col = "black", ylim = c(0, 15))
text(0.05, 12 , "0.10")
text(0.95, 12 , "0.90")
text(0.22, 2.8, "0.25")
text(0.78, 2.8, "0.75")
text(0.50, 2.3, "0.50")
yy <- cbind(
dbeta2(x, 0.10, 100),
dbeta2(x, 0.25, 100),
dbeta2(x, 0.50, 100),
dbeta2(x, 0.75, 100),
dbeta2(x, 0.90, 100)
)
matplot(xx, yy, type = "l", xlab = "y", ylab = "", main = expression(phi == 100),
lty = 1, col = "black", ylim = c(0, 15))
text(0.10, 14.5, "0.10")
text(0.90, 14.5, "0.90")
text(0.25, 9.8, "0.25")
text(0.75, 9.8, "0.75")
text(0.50, 8.6, "0.50")
```
The idea underlying beta regression models dates back to earlier approaches such as
@betareg:Williams:1982 or @betareg:Prentice:1986. The initial
motivation was to model binomial random variables with extra variation. The
model postulated for the (discrete) variate of interest included a more flexible
variation structure determined by independent beta-distributed variables which
are related to a set of independent variables through a regression structure.
However, unlike the more recent literature, the main focus
was to model binomial random variables. Our interest in what follows will be
more closely related to the recent literature, i.e., modeling continous random variables that
assume values in $(0,1)$, such as rates, proportions, and concentration or
inequality indices (e.g., Gini).
In this paper, we describe the **betareg** package which can
be used to perform inference in both fixed and variable dispersion beta
regressions. The package is implemented in the R system
for statistical computing [@betareg:R:2009] and available
from the Comprehensive R Archive Network (CRAN) at
. The initial version of the
package was written by @betareg:Simas+Rocha:2006 up to version 1.2
which was orphaned and archived on CRAN in mid-2009. Starting from version 2.0-0,
Achim Zeileis took over maintenance after rewriting/extending the package's
functionality.
The paper unfolds as follows:
@sec-model outlines the theory underlying the beta regression model
before @sec-implementation describes its implementation in R.
@sec-illustrations and @sec-replications provide various
empirical applications: The former focuses on illustrating various aspects
of beta regressions in practice while the latter provides further replications
of previously published empirical research. Finally, @sec-conclusion
contains concluding remarks and directions for future research and
implementation.
## Beta regression {#sec-model}
The class of beta regression models, as introduced by
@betareg:Ferrari+Cribari-Neto:2004, is useful for modeling continuous
variables $y$ that assume values in the open standard unit interval $(0,1)$.
Note that if the variable takes on values in $(a, b)$ (with $a < b$ known) one
can model $(y - a)/(b - a)$. Furthermore, if $y$ also assumes the extremes $0$ and $1$,
a useful transformation in practice is $(y \cdot (n - 1) + 0.5) / n$
where $n$ is the sample size [@betareg:Smithson+Verkuilen:2006].
The beta regression model is based on an alternative parameterization of the beta density in
terms of the variate mean and a precision parameter. The beta density is
usually expressed as
$$
f(y;p,q) = \frac{\Gamma(p+q)}{\Gamma(p)\Gamma(q)}y^{p-1}(1-y)^{q-1}, \quad 00$ and $\Gamma(\cdot)$ is the gamma function.^[A beta regression model
based on this parameterization was proposed by @betareg:Vasconcellos+Cribari-Neto:2005.
We shall, however, focus on the parameterization indexed by the mean and a precision
parameter.]
@betareg:Ferrari+Cribari-Neto:2004 proposed a different parameterization
by setting $\mu = p/(p+q)$ and $\phi = p+q$:
$$
f(y;\mu,\phi) = \frac{\Gamma(\phi)}{\Gamma(\mu\phi)\Gamma((1-\mu)\phi)}y^{\mu\phi-1}(1-y)^{(1-\mu)\phi-1}, \quad 00$.
We write $y \,\sim\, \mathcal{B}(\mu, \phi)$. Here, $\text{E}(y) = \mu$ and $\text{Var}(y) = \mu(1-\mu)/(1+\phi)$. The
parameter $\phi$ is known as the precision parameter since, for fixed $\mu$, the larger $\phi$ the smaller the variance of $y$; $\phi^{-1}$ is a dispersion parameter.
Let $y_1,\ldots,y_n$ be a random sample such that $y_i \sim
{\mathcal{B}}(\mu_i,\phi)$, $i=1,\ldots,n$. The beta regression model is
defined as
$$
g(\mu_i) = x_{i}^\top \beta = \eta_i,
$$
where $\beta=(\beta_1,\ldots,\beta_k)^\top$ is a $k \times 1$ vector of unknown
regression parameters ($k