The surveytable
package provides short and
understandable commands that generate tabulated, formatted, and rounded
survey estimates.
There are two important concepts that we need to learn and distinguish:
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
A data frame, in an of itself, cannot represent a complex survey. This is because, just by looking at a data frame, R does not know what the sampling weights are, what the strata are, etc. Even if the variables that represent the sampling weights, etc, are part of the data frame, just by looking at the data frame, R does not know which variable represents the weights or other survey design variables.
You can get a data frame into R in many different ways. If your data
is currently in a comma-separated values (CSV) file, you can use
read.csv()
. If it’s in a SAS file, you can use a package
like haven
or importsurvey. If it’s
already in R format, use readRDS()
, and so on.
survey::svydesign()
function; if a survey uses replicate
weights, the survey::svrepdesign()
function should be
used.Generally speaking, you only need to convert a data frame to a survey
object once. After it has been converted, you can save it with
saveRDS()
(or similar). In the future, you can load it with
readRDS()
. You do not need to re-convert a data frame to a
survey object every time.
Examples in this tutorial use a survey called the National Ambulatory Medical Care Survey (NAMCS) 2019 Public Use File (PUF). NAMCS is “an annual nationally representative sample survey of visits to non-federal office-based patient care physicians, excluding anesthesiologists, radiologists, and pathologists.” Note that the unit of observation is visits, not patients – this distinction is important since a single patient can make multiple visits.
The surveytable
package comes with a data frame of
selected variables from NAMCS, called namcs2019sv_df
(sv
= selected variables; df
= data frame).
The survey object of this survey is called namcs2019sv
.
namcs2019sv
is the object that we analyze. You really
only need namcs2019sv
. The reason that the package has
namcs2019sv_df
is to illustrate how to convert the data
frame to the survey object.
When importing data from another source, such as SAS or CSV, analysts should be aware of the standard way in which variables are handled in R.
factor
.factor
as well, some programming tasks are easier if they are stored as
logical
.NA
). If a
variable contains “special values”, such as a negative value indicating
that the age is missing, those “special values” need to be converted to
NA
.Variables in namcs2019sv_df
are already stored
correctly. Thus,
AGER
(patient’s age group) is a factor
variable;PAYNOCHG
(which indicates whether there was no charge
for the physician visit) is a logical
variable; andAGE
(patient’s age in years) is a numeric
variable.As seen below, tables produced by surveytable
are
clearer if either the variable names themselves are descriptive, or if
the variables have the "label"
attribute that is
descriptive. In namcs2019sv_df
, all variables already have
the "label"
attribute set. For example, while the variable
name AGE
itself is not very descriptive, the variable does
have a more descriptive "label"
attribute:
Documentation for the NAMCS survey provides the names of the survey design variables. Specifically, in NAMCS,
CPSUM
;CSTRATM
; andPATWT
.Thus, the namcs2019sv_df
data frame can be turned into a
survey object as follows:
mysurvey = survey::svydesign(ids = ~ CPSUM
, strata = ~ CSTRATM
, weights = ~ PATWT
, data = namcs2019sv_df)
Tables produced by surveytable
are clearer if either the
name of the survey object is descriptive, or if the object has the
"label"
attribute that is descriptive. Let’s set this
attribute for mysurvey
:
The mysurvey
object should now be identical to
namcs2019sv
. Let’s verify this:
We have just successfully created a survey object from a data frame.
First, specify the survey object that you’d like to analyze.
Variables | Observations | Design |
---|---|---|
33 | 8,250 |
Stratified 1 - level Cluster Sampling design (with replacement) With (398) clusters. survey::svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT, data = namcs2019sv_df) |
Check the survey label, survey design variables, and the number of observations to verify that it all looks correct.
The var_list()
function lists the variables in the
survey. To avoid unintentionally listing all the variables in a survey,
which can be many, the starting characters of variable names are
specified. For example, to list the variables that start with the
letters age
, type:
Variable | Class | Long name |
---|---|---|
AGE | numeric | Patient age in years |
AGER | factor | Patient age recode |
The table lists
Common classes are factor
(categorical variable),
logical
(yes / no variable), and numeric
.
The main function of the surveytable
package is
tab()
, which tabulates variables. It operates on
categorical and logical variables, and presents both estimated counts,
with their standard errors (SEs) and 95% confidence intervals (CIs), and
percentages, with their SEs and CIs. For example, to tabulate
AGER
, type:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Under 15 years | 887 | 117,917 | 14,097 | 93,229 | 149,142 | 11.4 | 1.3 | 8.9 | 14.2 |
15-24 years | 542 | 64,856 | 7,018 | 52,387 | 80,292 | 6.3 | 0.6 | 5.1 | 7.5 |
25-44 years | 1,435 | 170,271 | 13,966 | 144,925 | 200,049 | 16.4 | 1.1 | 14.3 | 18.8 |
45-64 years | 2,283 | 309,506 | 23,290 | 266,994 | 358,787 | 29.9 | 1.4 | 27.2 | 32.6 |
65-74 years | 1,661 | 206,866 | 14,366 | 180,481 | 237,109 | 20 | 1.2 | 17.6 | 22.5 |
75 years and over | 1,442 | 167,069 | 15,179 | 139,746 | 199,735 | 16.1 | 1.3 | 13.7 | 18.8 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
The table title shows the variable label (the long variable name) and the survey label.
For each level of the variable, the table shows:
NCHS presentation standards. The tab()
function also applies the National Center for Health Statistics (NCHS)
presentation standards for counts and percentages, and flags estimates
if, according to the standards, they should be suppressed, footnoted, or
reviewed by an analyst. The CIs that are displayed are the ones that are
used by the NCHS presentation standards. Specifically, for counts, the
tables show the log Student’s t 95% CI, with adaptations for complex
surveys; for percentages, they show the 95% Korn and Graubard CI.
One does not need to do anything extra to perform presentation
standards checking – it is performed automatically. For example, let’s
tabulate PAYNOCHG
:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL | Flags |
---|---|---|---|---|---|---|---|---|---|---|
FALSE | 8,234 | 1,034,338 | 48,874 | 942,808 | 1,134,754 | 99.8 | 0.2 | 99 | 100 | |
TRUE | 16 | 2,146 | 1,919 | 293 | 15,703 | 0.2 | 0.2 | 0 | 1 | Cx |
N = 8250. Checked NCHS presentation standards: Cx: suppress count (and rate). |
This table tells us that, according to the NCHS presentation standards, the estimated number of visits in which there was no charge for the visit should be suppressed due to low precision. However, the lack of a percentage flag indicates that the estimated percentage of such visits can be shown.
Drop missing values. Some variables might contain
missing values (NA
). Consider the following variable, which
is not part of the actual survey, but was constructed specifically for
this example:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Primary care specialty | 2,406 | 422,807 | 26,382 | 374,099 | 477,857 | 40.8 | 2.2 | 36.5 | 45.2 |
Surgical care specialty | 2,444 | 170,714 | 23,333 | 130,514 | 223,297 | 16.5 | 2.3 | 12.2 | 21.5 |
Medical care specialty | 1,750 | 235,502 | 35,527 | 175,049 | 316,831 | 22.7 | 2.9 | 17.2 | 29 |
<N/A> | 1,650 | 207,462 | 12,458 | 184,378 | 233,436 | 20 | 0.8 | 18.5 | 21.6 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
To calculate percentages based on the non-missing values only, use
the drop_na
argument:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Primary care specialty | 2,406 | 422,807 | 26,382 | 374,099 | 477,857 | 51 | 2.6 | 45.7 | 56.3 |
Surgical care specialty | 2,444 | 170,714 | 23,333 | 130,514 | 223,297 | 20.6 | 2.9 | 15.2 | 26.9 |
Medical care specialty | 1,750 | 235,502 | 35,527 | 175,049 | 316,831 | 28.4 | 3.6 | 21.5 | 36.2 |
N = 6600. Checked NCHS presentation standards. Nothing to report. |
The above table gives percentages based only on the knowns, that is,
based only on non-NA
values.
Multiple tables. Multiple tables can be created with a single command:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
M.D. - Doctor of Medicine | 7,498 | 980,280 | 48,388 | 889,842 | 1,079,910 | 94.6 | 0.7 | 93.1 | 95.8 |
D.O. - Doctor of Osteopathy | 752 | 56,204 | 6,602 | 44,597 | 70,832 | 5.4 | 0.7 | 4.2 | 6.9 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Primary care specialty | 2,993 | 521,466 | 31,136 | 463,840 | 586,252 | 50.3 | 2.6 | 45.1 | 55.5 |
Surgical care specialty | 3,050 | 214,832 | 31,110 | 161,661 | 285,490 | 20.7 | 3 | 15.1 | 27.3 |
Medical care specialty | 2,207 | 300,186 | 43,497 | 225,806 | 399,067 | 29 | 3.6 | 22.1 | 36.6 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
MSA (Metropolitan Statistical Area) | 7,496 | 973,676 | 50,515 | 879,490 | 1,077,947 | 93.9 | 1.7 | 89.7 | 96.8 |
Non-MSA | 754 | 62,809 | 17,549 | 36,249 | 108,830 | 6.1 | 1.7 | 3.2 | 10.3 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
Estimate the total count for the entire population using the
total()
command:
n | Number (000) | SE (000) | LL (000) | UL (000) |
---|---|---|---|---|
8,250 | 1,036,484 | 48,836 | 945,014 | 1,136,809 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
To create a table of AGER
for each value of the variable
SEX
, type:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Under 15 years | 434 | 59,958 | 7,206 | 47,318 | 75,974 | 9.9 | 1.2 | 7.6 | 12.6 |
15-24 years | 346 | 41,128 | 4,532 | 33,066 | 51,156 | 6.8 | 0.7 | 5.4 | 8.4 |
25-44 years | 923 | 113,708 | 11,461 | 93,256 | 138,646 | 18.8 | 1.6 | 15.8 | 22.1 |
45-64 years | 1,253 | 175,978 | 16,009 | 147,153 | 210,450 | 29.1 | 1.7 | 25.8 | 32.6 |
65-74 years | 891 | 120,099 | 11,066 | 100,171 | 143,992 | 19.8 | 1.5 | 17 | 22.9 |
75 years and over | 762 | 94,173 | 11,085 | 74,682 | 118,751 | 15.6 | 1.5 | 12.8 | 18.7 |
N = 4609. Checked NCHS presentation standards. Nothing to report. |
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Under 15 years | 453 | 57,959 | 7,728 | 44,570 | 75,371 | 13.4 | 1.7 | 10.3 | 17.1 |
15-24 years | 196 | 23,728 | 4,344 | 16,457 | 34,210 | 5.5 | 0.8 | 4 | 7.3 |
25-44 years | 512 | 56,562 | 7,277 | 43,861 | 72,942 | 13.1 | 1.3 | 10.7 | 15.8 |
45-64 years | 1,030 | 133,528 | 12,956 | 110,319 | 161,619 | 30.9 | 1.6 | 27.8 | 34.3 |
65-74 years | 770 | 86,766 | 6,767 | 74,409 | 101,176 | 20.1 | 1.5 | 17.3 | 23.1 |
75 years and over | 680 | 72,896 | 6,661 | 60,872 | 87,296 | 16.9 | 1.5 | 14 | 20.2 |
N = 3641. Checked NCHS presentation standards. Nothing to report. |
In addition to giving the long name of the variable being tabulated,
the title of each table reflects the value of the subsetting variable
(in this case, either Female
or Male
).
With the tab_subset()
command, in each table (that is,
in each subset), the percentages add up to 100%.
The tab_cross()
function is similar – it crosses or
interacts two variables and generates a table using this new variable.
Thus, to create a table of the interaction of AGER
and
SEX
, type:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Under 15 years: Female | 434 | 59,958 | 7,206 | 47,318 | 75,974 | 5.8 | 0.7 | 4.5 | 7.3 |
15-24 years: Female | 346 | 41,128 | 4,532 | 33,066 | 51,156 | 4 | 0.4 | 3.2 | 4.9 |
25-44 years: Female | 923 | 113,708 | 11,461 | 93,256 | 138,646 | 11 | 1 | 9 | 13.2 |
45-64 years: Female | 1,253 | 175,978 | 16,009 | 147,153 | 210,450 | 17 | 1.1 | 14.9 | 19.3 |
65-74 years: Female | 891 | 120,099 | 11,066 | 100,171 | 143,992 | 11.6 | 1 | 9.7 | 13.7 |
75 years and over: Female | 762 | 94,173 | 11,085 | 74,682 | 118,751 | 9.1 | 0.9 | 7.3 | 11.1 |
Under 15 years: Male | 453 | 57,959 | 7,728 | 44,570 | 75,371 | 5.6 | 0.7 | 4.3 | 7.2 |
15-24 years: Male | 196 | 23,728 | 4,344 | 16,457 | 34,210 | 2.3 | 0.4 | 1.6 | 3.2 |
25-44 years: Male | 512 | 56,562 | 7,277 | 43,861 | 72,942 | 5.5 | 0.6 | 4.3 | 6.8 |
45-64 years: Male | 1,030 | 133,528 | 12,956 | 110,319 | 161,619 | 12.9 | 1 | 10.9 | 15.1 |
65-74 years: Male | 770 | 86,766 | 6,767 | 74,409 | 101,176 | 8.4 | 0.6 | 7.2 | 9.7 |
75 years and over: Male | 680 | 72,896 | 6,661 | 60,872 | 87,296 | 7 | 0.6 | 5.9 | 8.3 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
While the estimated counts produced by tab_subset()
and
tab_cross()
are the same, the percentages are
different.
tab_subset()
command, within each table (that
is, within each subset), the percentages add up to 100%.tab_cross()
, the percentages
across the entire population add up to 100%.The tab()
and tab_subset()
functions also
work with numeric variables, though with such variables, the output is
different. To tabulate NUMMED
(number of medications), a
numeric variable, type:
% known | Mean | SEM | SD |
---|---|---|---|
100 | 3.46 | 0.268 | 4.43 |
As before, the table title shows the variable label (the long variable name) and the survey label.
The table shows the percentage of values that are not missing (not
NA
), the mean, the standard error of the mean (SEM), and
the standard deviation (SD).
Subsetting works too:
Level | % known | Mean | SEM | SD |
---|---|---|---|---|
Under 15 years | 100 | 1.58 | 0.168 | 1.75 |
15-24 years | 100 | 1.64 | 0.112 | 1.7 |
25-44 years | 100 | 2.15 | 0.225 | 2.74 |
45-64 years | 100 | 3.49 | 0.303 | 4.49 |
65-74 years | 100 | 4.44 | 0.431 | 5.03 |
75 years and over | 100 | 5.53 | 0.494 | 5.59 |
The tab_subset()
function makes it easy to perform
hypothesis testing by using the test
argument. When the
argument is TRUE
, a test of association is performed. In
addition, t-tests for all pairs of levels are performed as well.
Consider the relationship between AGER
an
SPECCAT
:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Under 15 years | 609 | 102,720 | 14,137 | 78,373 | 134,631 | 19.7 | 2.5 | 15 | 25.1 |
15-24 years | 247 | 40,808 | 4,941 | 32,127 | 51,835 | 7.8 | 0.8 | 6.3 | 9.6 |
25-44 years | 626 | 95,305 | 9,118 | 78,964 | 115,027 | 18.3 | 1.5 | 15.4 | 21.5 |
45-64 years | 726 | 124,384 | 10,371 | 105,582 | 146,535 | 23.9 | 1.5 | 20.9 | 27 |
65-74 years | 411 | 85,504 | 10,210 | 67,581 | 108,182 | 16.4 | 1.6 | 13.4 | 19.7 |
75 years and over | 374 | 72,745 | 9,886 | 55,660 | 95,073 | 14 | 1.6 | 11 | 17.4 |
N = 2993. Checked NCHS presentation standards. Nothing to report. |
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Under 15 years | 191 | 6,201 | 1,359 | 4,017 | 9,571 | 2.9 | 0.7 | 1.6 | 4.7 |
15-24 years | 129 | 8,561 | 1,622 | 5,824 | 12,585 | 4 | 0.6 | 2.9 | 5.3 |
25-44 years | 435 | 35,953 | 11,539 | 18,976 | 68,119 | 16.7 | 3.6 | 10.2 | 25.1 |
45-64 years | 900 | 73,204 | 12,475 | 52,307 | 102,450 | 34.1 | 1.6 | 31 | 37.2 |
65-74 years | 787 | 53,482 | 6,405 | 42,227 | 67,736 | 24.9 | 2.8 | 19.6 | 30.8 |
75 years and over | 608 | 37,431 | 6,364 | 26,763 | 52,352 | 17.4 | 2.4 | 12.9 | 22.8 |
N = 3050. Checked NCHS presentation standards. Nothing to report. |
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Under 15 years | 87 | 8,996 | 3,158 | 4,330 | 18,690 | 3 | 1 | 1.4 | 5.6 |
15-24 years | 166 | 15,487 | 5,035 | 7,908 | 30,326 | 5.2 | 1.4 | 2.8 | 8.5 |
25-44 years | 374 | 39,012 | 6,892 | 27,419 | 55,507 | 13 | 1.3 | 10.5 | 15.8 |
45-64 years | 657 | 111,918 | 22,754 | 74,907 | 167,215 | 37.3 | 3.1 | 31.1 | 43.8 |
65-74 years | 463 | 67,880 | 10,945 | 49,319 | 93,425 | 22.6 | 2.9 | 17.1 | 28.9 |
75 years and over | 460 | 56,894 | 10,806 | 39,008 | 82,980 | 19 | 3.3 | 12.8 | 26.5 |
N = 2207. Checked NCHS presentation standards. Nothing to report. |
p-value | Flag |
---|---|
0 | * |
Pearson’s X^2: Rao & Scott adjustment. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Under 15 years | 15-24 years | 0 | * |
Under 15 years | 25-44 years | 0.67 | |
Under 15 years | 45-64 years | 0.259 | |
Under 15 years | 65-74 years | 0.334 | |
Under 15 years | 75 years and over | 0.083 | |
15-24 years | 25-44 years | 0 | * |
15-24 years | 45-64 years | 0 | * |
15-24 years | 65-74 years | 0 | * |
15-24 years | 75 years and over | 0.002 | * |
25-44 years | 45-64 years | 0.007 | * |
25-44 years | 65-74 years | 0.461 | |
25-44 years | 75 years and over | 0.092 | |
45-64 years | 65-74 years | 0.001 | * |
45-64 years | 75 years and over | 0 | * |
65-74 years | 75 years and over | 0.194 | |
Design-based t-test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Under 15 years | 15-24 years | 0.221 | |
Under 15 years | 25-44 years | 0 | * |
Under 15 years | 45-64 years | 0 | * |
Under 15 years | 65-74 years | 0 | * |
Under 15 years | 75 years and over | 0 | * |
15-24 years | 25-44 years | 0 | * |
15-24 years | 45-64 years | 0 | * |
15-24 years | 65-74 years | 0 | * |
15-24 years | 75 years and over | 0 | * |
25-44 years | 45-64 years | 0 | * |
25-44 years | 65-74 years | 0.196 | |
25-44 years | 75 years and over | 0.904 | |
45-64 years | 65-74 years | 0.027 | * |
45-64 years | 75 years and over | 0 | * |
65-74 years | 75 years and over | 0.007 | * |
Design-based t-test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Under 15 years | 15-24 years | 0.112 | |
Under 15 years | 25-44 years | 0 | * |
Under 15 years | 45-64 years | 0 | * |
Under 15 years | 65-74 years | 0 | * |
Under 15 years | 75 years and over | 0 | * |
15-24 years | 25-44 years | 0 | * |
15-24 years | 45-64 years | 0 | * |
15-24 years | 65-74 years | 0 | * |
15-24 years | 75 years and over | 0 | * |
25-44 years | 45-64 years | 0 | * |
25-44 years | 65-74 years | 0 | * |
25-44 years | 75 years and over | 0.139 | |
45-64 years | 65-74 years | 0.009 | * |
45-64 years | 75 years and over | 0.003 | * |
65-74 years | 75 years and over | 0.4 | |
Design-based t-test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Primary care specialty | Surgical care specialty | 0 | * |
Primary care specialty | Medical care specialty | 0 | * |
Surgical care specialty | Medical care specialty | 0.364 | |
Design-based t-test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Primary care specialty | Surgical care specialty | 0 | * |
Primary care specialty | Medical care specialty | 0.002 | * |
Surgical care specialty | Medical care specialty | 0.124 | |
Design-based t-test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Primary care specialty | Surgical care specialty | 0.001 | * |
Primary care specialty | Medical care specialty | 0 | * |
Surgical care specialty | Medical care specialty | 0.848 | |
Design-based t-test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Primary care specialty | Surgical care specialty | 0.005 | * |
Primary care specialty | Medical care specialty | 0.631 | |
Surgical care specialty | Medical care specialty | 0.163 | |
Design-based t-test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Primary care specialty | Surgical care specialty | 0.006 | * |
Primary care specialty | Medical care specialty | 0.248 | |
Surgical care specialty | Medical care specialty | 0.298 | |
Design-based t-test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Primary care specialty | Surgical care specialty | 0.003 | * |
Primary care specialty | Medical care specialty | 0.291 | |
Surgical care specialty | Medical care specialty | 0.099 | |
Design-based t-test. *: p <= 0.05 |
According to these tables, there is an association between physician specialty type and patient age. For instance, for patients under 15 years, there is a statistical difference between primary care physician specialty and medical care specialty. But for older patients, such as in the 45-64 age group, there is no statistical difference between the two specialty types.
As another example, consider the relationship between
MRI
and SPECCAT
:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL | Flags |
---|---|---|---|---|---|---|---|---|---|---|
FALSE | 2,973 | 515,172 | 30,724 | 458,304 | 579,096 | 98.8 | 0.5 | 97.3 | 99.6 | |
TRUE | 20 | 6,295 | 2,768 | 2,295 | 17,268 | 1.2 | 0.5 | 0.4 | 2.7 | Cx |
N = 2993. Checked NCHS presentation standards: Cx: suppress count (and rate). |
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
FALSE | 2,968 | 207,915 | 30,117 | 156,442 | 276,323 | 96.8 | 0.7 | 95.1 | 98 |
TRUE | 82 | 6,917 | 1,845 | 3,925 | 12,191 | 3.2 | 0.7 | 2 | 4.9 |
N = 3050. Checked NCHS presentation standards. Nothing to report. |
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL | Flags |
---|---|---|---|---|---|---|---|---|---|---|
FALSE | 2,163 | 291,560 | 40,805 | 221,456 | 383,855 | 97.1 | 1.4 | 93 | 99.2 | Pc |
TRUE | 44 | 8,626 | 4,768 | 2,451 | 30,364 | 2.9 | 1.4 | 0.8 | 7 | Cx Px |
N = 2207. Checked NCHS presentation standards: Cx: suppress count (and rate); Px: suppress percent; Pc: footnote percent - complement. |
p-value | Flag |
---|---|
0.169 | |
Pearson’s X^2: Rao & Scott adjustment. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
FALSE | TRUE | 0 | * |
Design-based t-test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
FALSE | TRUE | 0 | * |
Design-based t-test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
FALSE | TRUE | 0 | * |
Design-based t-test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Primary care specialty | Surgical care specialty | 0 | * |
Primary care specialty | Medical care specialty | 0 | * |
Surgical care specialty | Medical care specialty | 0.156 | |
Design-based t-test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Primary care specialty | Surgical care specialty | 0.856 | |
Primary care specialty | Medical care specialty | 0.657 | |
Surgical care specialty | Medical care specialty | 0.733 | |
Design-based t-test. *: p <= 0.05 |
According to these tables, there is no statistical association between MRI and physician specialty. For each of the 3 specialty types, a minority of visits have MRI’s. For the visits with MRI’s, there was no statistical difference between specialty types.
As a general rule of thumb, since there is no statistical association
between MRI and physician specialty, presenting this tabulation would
not be particularly interesting, especially since the subsetting
decreases the sample size and therefore also decreases the estimate
reliability. Instead, it would generally make more sense to just
tabulate MRI
without subsetting by
SPECCAT
.
The relationship between NUMMED
and
AGER
:
Level | % known | Mean | SEM | SD |
---|---|---|---|---|
Under 15 years | 100 | 1.58 | 0.168 | 1.75 |
15-24 years | 100 | 1.64 | 0.112 | 1.7 |
25-44 years | 100 | 2.15 | 0.225 | 2.74 |
45-64 years | 100 | 3.49 | 0.303 | 4.49 |
65-74 years | 100 | 4.44 | 0.431 | 5.03 |
75 years and over | 100 | 5.53 | 0.494 | 5.59 |
p-value | Flag |
---|---|
0 | * |
Wald test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Under 15 years | 15-24 years | 0.739 | |
Under 15 years | 25-44 years | 0.043 | * |
Under 15 years | 45-64 years | 0 | * |
Under 15 years | 65-74 years | 0 | * |
Under 15 years | 75 years and over | 0 | * |
15-24 years | 25-44 years | 0.029 | * |
15-24 years | 45-64 years | 0 | * |
15-24 years | 65-74 years | 0 | * |
15-24 years | 75 years and over | 0 | * |
25-44 years | 45-64 years | 0 | * |
25-44 years | 65-74 years | 0 | * |
25-44 years | 75 years and over | 0 | * |
45-64 years | 65-74 years | 0.007 | * |
45-64 years | 75 years and over | 0 | * |
65-74 years | 75 years and over | 0.002 | * |
Design-based t-test. *: p <= 0.05 |
According to these tables, there is an association between the number
of medications and age category. NUMMED
is statistically
similar for the “Under 15 years” and “15-24 years” AGER
categories. It is statistically different for all other pairs of age
categories.
Finally, let’s look at the relationship between NUMMED
and SPECCAT
:
Level | % known | Mean | SEM | SD |
---|---|---|---|---|
Primary care specialty | 100 | 3.7 | 0.309 | 4.46 |
Surgical care specialty | 100 | 2.87 | 0.616 | 4.59 |
Medical care specialty | 100 | 3.46 | 0.637 | 4.22 |
p-value | Flag |
---|---|
0.52 | |
Wald test. *: p <= 0.05 |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Primary care specialty | Surgical care specialty | 0.254 | |
Primary care specialty | Medical care specialty | 0.738 | |
Surgical care specialty | Medical care specialty | 0.501 | |
Design-based t-test. *: p <= 0.05 |
According to these tables, there is no association between the number
of medications and physician specialty type. NUMMED
is
statistically similar for all pairs of physician specialties.
As a general rule of thumb, since there is no statistical association
between the number of medications and physician specialty, presenting
this tabulation would not be particularly interesting, especially since
the subsetting decreases the sample size and therefore also decreases
the estimate reliability. Instead, it would generally make more sense to
just tabulate NUMMED
without subsetting by
SPECCAT
.
To test whether any pair of SPECCAT
levels is
statistically similar or different, type:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Primary care specialty | 2,993 | 521,466 | 31,136 | 463,840 | 586,252 | 50.3 | 2.6 | 45.1 | 55.5 |
Surgical care specialty | 3,050 | 214,832 | 31,110 | 161,661 | 285,490 | 20.7 | 3 | 15.1 | 27.3 |
Medical care specialty | 2,207 | 300,186 | 43,497 | 225,806 | 399,067 | 29 | 3.6 | 22.1 | 36.6 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
Level 1 | Level 2 | p-value | Flag |
---|---|---|---|
Primary care specialty | Surgical care specialty | 0 | * |
Primary care specialty | Medical care specialty | 0 | * |
Surgical care specialty | Medical care specialty | 0.168 | |
Design-based t-test. *: p <= 0.05 |
According to this, surgical and medical care specialties are statistically similar, and are statistically different from primary care.
A rate is a ratio of count estimates based on the survey in question
divided by population size, which is assumed to be known. For example,
the number of physician visits per 100 people in the population is a
rate: the number of physician visits is estimated from the
namcs2019sv
survey, while the number of people in the
population comes from another source.
To calculate rates, in addition to the survey, we need a source of
information on population size. You would typically use a function such
as read.csv()
to load the population figures and get them
into the correct format. The surveytable
package comes with
an object called uspop2019
that contains several population
figures for use in these examples.
Let’s examine uspop2019
:
class(uspop2019)
#> [1] "list"
names(uspop2019)
#> [1] "total" "MSA" "AGER" "Age group" "SEX"
#> [6] "AGER x SEX" "Age group 5"
The overall population size for the country as a whole is:
Once we have the overall population size, the overall rate is:
n | Rate | SE | LL | UL |
---|---|---|---|---|
8,250 | 320.7 | 15.1 | 292.4 | 351.7 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
To calculate the rates for a particular variable, we need to provide
a data frame with a column called Level
that matches the
levels of the variable in the survey, and a column called
Population
that gives the size of the population for that
level.
For example, for AGER
, this data frame as follows:
uspop2019$AGER
#> Level Population
#> 1 Under 15 years 60526656
#> 2 15-24 years 41718700
#> 3 25-44 years 85599410
#> 4 45-64 years 82562049
#> 5 65-74 years 31260202
#> 6 75 years and over 21519680
Now that we have the appropriate population figures, the rates table is obtained by typing:
Level | n | Rate | SE | LL | UL |
---|---|---|---|---|---|
Under 15 years | 887 | 194.8 | 23.3 | 154 | 246.4 |
15-24 years | 542 | 155.5 | 16.8 | 125.6 | 192.5 |
25-44 years | 1,435 | 198.9 | 16.3 | 169.3 | 233.7 |
45-64 years | 2,283 | 374.9 | 28.2 | 323.4 | 434.6 |
65-74 years | 1,661 | 661.8 | 46 | 577.3 | 758.5 |
75 years and over | 1,442 | 776.4 | 70.5 | 649.4 | 928.1 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
To calculate the rates for one variable (AGER
) by
another variable (SEX
), we need population figures in the
following format:
uspop2019$`AGER x SEX`
#> Level Subset Population
#> 1 Under 15 years Female 29604762
#> 2 15-24 years Female 20730118
#> 3 25-44 years Female 43192143
#> 4 45-64 years Female 42508901
#> 5 65-74 years Female 16673240
#> 6 75 years and over Female 12421444
#> 7 Under 15 years Male 30921894
#> 8 15-24 years Male 20988582
#> 9 25-44 years Male 42407267
#> 10 45-64 years Male 40053148
#> 11 65-74 years Male 14586962
#> 12 75 years and over Male 9098236
With this data frame, the rates table is obtained by typing:
Level | n | Rate | SE | LL | UL |
---|---|---|---|---|---|
Under 15 years | 434 | 202.5 | 24.3 | 159.8 | 256.6 |
15-24 years | 346 | 198.4 | 21.9 | 159.5 | 246.8 |
25-44 years | 923 | 263.3 | 26.5 | 215.9 | 321 |
45-64 years | 1,253 | 414 | 37.7 | 346.2 | 495.1 |
65-74 years | 891 | 720.3 | 66.4 | 600.8 | 863.6 |
75 years and over | 762 | 758.1 | 89.2 | 601.2 | 956 |
N = 4609. Checked NCHS presentation standards. Nothing to report. |
Level | n | Rate | SE | LL | UL |
---|---|---|---|---|---|
Under 15 years | 453 | 187.4 | 25 | 144.1 | 243.7 |
15-24 years | 196 | 113.1 | 20.7 | 78.4 | 163 |
25-44 years | 512 | 133.4 | 17.2 | 103.4 | 172 |
45-64 years | 1,030 | 333.4 | 32.3 | 275.4 | 403.5 |
65-74 years | 770 | 594.8 | 46.4 | 510.1 | 693.6 |
75 years and over | 680 | 801.2 | 73.2 | 669.1 | 959.5 |
N = 3641. Checked NCHS presentation standards. Nothing to report. |
In some situations, it might be necessary to modify survey variables, or to create new ones. This section describes how to do this.
Convert factor to logical. The variable
MAJOR
(major reason for this visit) has several levels.
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Blank | 175 | 15,887 | 3,354 | 10,335 | 24,419 | 1.5 | 0.3 | 1 | 2.3 |
New problem (less than 3 mos. onset) | 2,193 | 275,014 | 19,691 | 238,955 | 316,514 | 26.5 | 1.5 | 23.7 | 29.5 |
Chronic problem, routine | 2,861 | 380,910 | 35,080 | 317,916 | 456,386 | 36.8 | 2.5 | 31.9 | 41.9 |
Chronic problem, flare-up | 635 | 74,017 | 9,329 | 57,706 | 94,939 | 7.1 | 0.9 | 5.5 | 9.1 |
Pre-surgery | 159 | 12,864 | 2,151 | 9,188 | 18,010 | 1.2 | 0.2 | 0.9 | 1.7 |
Post-surgery | 659 | 54,170 | 6,749 | 42,350 | 69,289 | 5.2 | 0.7 | 4 | 6.7 |
Preventive care | 1,568 | 223,624 | 18,520 | 190,068 | 263,103 | 21.6 | 1.7 | 18.3 | 25.1 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
Notice that one of the levels is called
"Preventive care"
. Suppose an analyst is only interested in
whether or not a visit is a preventive care visit – they are not
interested in the other visit types. They can create a new variable
called Preventive care visits
that is TRUE
for
preventive care visits and FALSE
for all other types of
visits, as follows:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
FALSE | 6,682 | 812,861 | 45,220 | 728,841 | 906,566 | 78.4 | 1.7 | 74.9 | 81.7 |
TRUE | 1,568 | 223,624 | 18,520 | 190,068 | 263,103 | 21.6 | 1.7 | 18.3 | 25.1 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
This creates a logical variable that is TRUE
for
preventive care visits and then tabulates it. When using the
var_case()
function, specify the name of the new logical
variable to be created, an existing factor variable, and one or more
levels of the factor variable that should be set to TRUE
in
the logical variable.
Thus, if an analyst is interested in surgery-related visits, which
are indicated by two different levels of MAJOR
, they could
type:
var_case("Surgery-related visits"
, "MAJOR"
, c("Pre-surgery", "Post-surgery"))
tab("Surgery-related visits")
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
FALSE | 7,432 | 969,451 | 47,976 | 879,793 | 1,068,246 | 93.5 | 0.8 | 91.9 | 94.9 |
TRUE | 818 | 67,034 | 7,810 | 53,273 | 84,348 | 6.5 | 0.8 | 5.1 | 8.1 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
Collapse levels. The variable PRIMCARE
(whether the physician is this patient’s primary care provider) has
levels Unknown
and Blank
, among others.
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL | Flags |
---|---|---|---|---|---|---|---|---|---|---|
Blank | 16 | 1,150 | 478 | 440 | 3,005 | 0.1 | 0 | 0 | 0.2 | Cx |
Unknown | 300 | 39,519 | 9,507 | 24,520 | 63,692 | 3.8 | 0.9 | 2.3 | 6 | |
Yes | 2,278 | 383,481 | 28,555 | 331,362 | 443,798 | 37 | 2.6 | 31.9 | 42.3 | |
No | 5,656 | 612,335 | 43,282 | 533,050 | 703,413 | 59.1 | 2.5 | 53.9 | 64.1 | |
N = 8250. Checked NCHS presentation standards: Cx: suppress count (and rate). |
To collapse Unknown
and Blank
into a single
level, type:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Unknown if PCP | 316 | 40,669 | 9,479 | 25,619 | 64,560 | 3.9 | 0.9 | 2.4 | 6.1 |
Yes | 2,278 | 383,481 | 28,555 | 331,362 | 443,798 | 37 | 2.6 | 31.9 | 42.3 |
No | 5,656 | 612,335 | 43,282 | 533,050 | 703,413 | 59.1 | 2.5 | 53.9 | 64.1 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
Convert numeric to factor. The variable
AGE
is numeric.
% known | Mean | SEM | SD |
---|---|---|---|
100 | 51 | 1.04 | 24.3 |
To create a new variable of age categories based on AGE
,
type:
var_cut("Age group", "AGE"
, c(-Inf, 0, 4, 14, 64, Inf)
, c("Under 1", "1-4", "5-14", "15-64", "65 and over") )
tab("Age group")
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Under 1 | 203 | 31,148 | 5,282 | 22,269 | 43,566 | 3 | 0.5 | 2.1 | 4.1 |
1-4 | 281 | 38,240 | 5,444 | 28,864 | 50,662 | 3.7 | 0.5 | 2.7 | 4.8 |
5-14 | 403 | 48,529 | 5,741 | 38,430 | 61,282 | 4.7 | 0.5 | 3.7 | 5.9 |
15-64 | 4,260 | 544,632 | 36,082 | 478,254 | 620,223 | 52.5 | 2 | 48.6 | 56.5 |
65 and over | 3,103 | 373,935 | 24,523 | 328,777 | 425,296 | 36.1 | 1.9 | 32.3 | 40 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
In the var_cut()
command, specify the following
information:
Check whether any variable is true. For a series of
logical variables, you can check whether any of them are
TRUE
using the var_any()
command.
A physician visit is considered to be an “imaging services” visit if
it had any of a number of imaging services ordered or provided. Imaging
services are indicated using logical variables, such as MRI
and XRAY
. To create the Imaging services
variable, type:
var_any("Imaging services"
, c("ANYIMAGE", "BONEDENS", "CATSCAN", "ECHOCARD", "OTHULTRA"
, "MAMMO", "MRI", "XRAY", "OTHIMAGE"))
tab("Imaging services")
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
FALSE | 7,148 | 901,115 | 43,298 | 820,085 | 990,151 | 86.9 | 1.1 | 84.6 | 89.1 |
TRUE | 1,102 | 135,369 | 13,574 | 111,134 | 164,890 | 13.1 | 1.1 | 10.9 | 15.4 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
Interact variables. The tab_cross()
function creates a table of an interaction of two variables, but it does
not save the interacted variable. To create the interacted variable, use
the var_cross()
command:
Specify the name of the new variable as well as names of the two variables to interact.
Copy a variable. Create a new variable that is a
copy of another variable using var_copy()
. You can modify
the copy, while the original remains unchanged. For example:
var_copy("Age group", "AGER")
#> Warning in var_copy("Age group", "AGER"): Age group: overwriting a variable
#> that already exists.
var_collapse("Age group", "65+", c("65-74 years", "75 years and over"))
var_collapse("Age group", "25-64", c("25-44 years", "45-64 years"))
tab("AGER", "Age group")
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Under 15 years | 887 | 117,917 | 14,097 | 93,229 | 149,142 | 11.4 | 1.3 | 8.9 | 14.2 |
15-24 years | 542 | 64,856 | 7,018 | 52,387 | 80,292 | 6.3 | 0.6 | 5.1 | 7.5 |
25-44 years | 1,435 | 170,271 | 13,966 | 144,925 | 200,049 | 16.4 | 1.1 | 14.3 | 18.8 |
45-64 years | 2,283 | 309,506 | 23,290 | 266,994 | 358,787 | 29.9 | 1.4 | 27.2 | 32.6 |
65-74 years | 1,661 | 206,866 | 14,366 | 180,481 | 237,109 | 20 | 1.2 | 17.6 | 22.5 |
75 years and over | 1,442 | 167,069 | 15,179 | 139,746 | 199,735 | 16.1 | 1.3 | 13.7 | 18.8 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Under 15 years | 887 | 117,917 | 14,097 | 93,229 | 149,142 | 11.4 | 1.3 | 8.9 | 14.2 |
15-24 years | 542 | 64,856 | 7,018 | 52,387 | 80,292 | 6.3 | 0.6 | 5.1 | 7.5 |
25-64 | 3,718 | 479,777 | 32,175 | 420,624 | 547,247 | 46.3 | 1.8 | 42.7 | 49.9 |
65+ | 3,103 | 373,935 | 24,523 | 328,777 | 425,296 | 36.1 | 1.9 | 32.3 | 40 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
Here, the AGER
variable remains unchanged, while the
Age group
variable has fewer categories.
The tab*
and total*
functions have an
argument called csv
that specifies the name of a
comma-separated values (CSV) file to save the output to. Alternatively,
you can name the default CSV output file using the
set_output()
function. For example, the following directs
surveytable
to send all future output to a CSV file, create
some tables, and then turn off sending output to the file:
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
M.D. - Doctor of Medicine | 7,498 | 980,280 | 48,388 | 889,842 | 1,079,910 | 94.6 | 0.7 | 93.1 | 95.8 |
D.O. - Doctor of Osteopathy | 752 | 56,204 | 6,602 | 44,597 | 70,832 | 5.4 | 0.7 | 4.2 | 6.9 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
Primary care specialty | 2,993 | 521,466 | 31,136 | 463,840 | 586,252 | 50.3 | 2.6 | 45.1 | 55.5 |
Surgical care specialty | 3,050 | 214,832 | 31,110 | 161,661 | 285,490 | 20.7 | 3 | 15.1 | 27.3 |
Medical care specialty | 2,207 | 300,186 | 43,497 | 225,806 | 399,067 | 29 | 3.6 | 22.1 | 36.6 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
Level | n | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|---|
MSA (Metropolitan Statistical Area) | 7,496 | 973,676 | 50,515 | 879,490 | 1,077,947 | 93.9 | 1.7 | 89.7 | 96.8 |
Non-MSA | 754 | 62,809 | 17,549 | 36,249 | 108,830 | 6.1 | 1.7 | 3.2 | 10.3 |
N = 8250. Checked NCHS presentation standards. Nothing to report. |
If the tabulation functions are called from within an R Markdown
notebook, they produce HTML tables. This makes it easy to incorporate
the output of the surveytable
package directly into
documents, presentations, “shiny” web apps, and other output types.
Finally, the tabulation functions return the tables that they
produce. More advanced analysts can use this functionality to integrate
surveytable
into other programming tasks.