The `SNPassoc`

package contains facilities for data
manipulation, tools for exploratory data analysis, convenient graphical
facilities, and tools for assessing genetic association for both
quantitative and categorial (case-control) traits in whole genome
approaches. Genome-based studies are normally analyzed using a
multistage approach. In the first step researchers are interested in
assessing association between the outcome and thousands of SNPs. In a
second and possibly third step, medium/large scale studies are performed
in which only a few hundred of SNPs, those with a putative association
found in the first step, are genotyped. `SNPassoc`

is
specially designed for analyzing this kind of designs. In addition, a
convenience-based approach (select variants on the basis of logistical
considerations such as the ease and cost of genotyping) can also be
analyzed using `SNPassoc`

. Different genetic models are also
implemented in the package. Analysis of multiple SNPs can be analyzed
using either haplotype or gene-gene interaction approaches.

This document is an updated version of the initial vignette that was published with the SNPassoc paper González et al. (2007). It contains a more realistic example belonging to a real dataset. The original vignette is still available here.

SNP data are typically available in text format or Excel spreadsheets
which are easily uploaded in `R`

as a data frame. Here, as an
illustrative example, we are analyzing a dataset containing
epidemiological information and 51 SNPs from a case-control study on
asthma. The data is available within `SNPassoc`

and can be
loaded by

Then, the data is loaded into the R session by

```
'data.frame': 1578 obs. of 57 variables:
$ country : Factor w/ 10 levels "Australia","Belgium",..: 5 5 5 5 5 5 5 5 5 5 ...
$ gender : Factor w/ 2 levels "Females","Males": 2 2 2 1 1 1 1 2 1 1 ...
$ age : num 42.8 50.2 46.7 47.9 48.4 ...
$ bmi : num 20.1 24.7 27.7 33.3 25.2 ...
$ smoke : int 1 0 0 0 0 1 0 0 0 0 ...
$ casecontrol: int 0 0 0 0 1 0 0 0 0 0 ...
$ rs4490198 : Factor w/ 3 levels "AA","AG","GG": 3 3 3 2 2 2 3 2 2 2 ...
$ rs4849332 : Factor w/ 3 levels "GG","GT","TT": 3 2 3 2 1 2 3 3 2 1 ...
$ rs1367179 : Factor w/ 3 levels "CC","GC","GG": 2 2 2 3 3 3 2 3 3 3 ...
[list output truncated]
```

```
country gender age bmi smoke casecontrol rs4490198 rs4849332
1 Germany Males 42.80630 20.14797 1 0 GG TT
2 Germany Males 50.22861 24.69136 0 0 GG GT
3 Germany Males 46.68857 27.73230 0 0 GG TT
4 Germany Females 47.86311 33.33187 0 0 AG GT
5 Germany Females 48.44079 25.23634 0 1 AG GG
```

We observe that we have case-control status (0: control, 1: asthma) and another 4 variables encoding the country of origin, gender, age, body mass index (bmi) and smoking status (0: no smoker, 1: ex-smoker, 2: current smoker). There are 51 SNPs whose genotypes are given by the alleles names.

To start the analysis, we must indicate which columns of the dataset
`asthma`

contain the SNP data, using the
`setupSNP`

function. In our example, SNPs start from column 7
onwards, which we specify in argument *colSNPs*

This is an alternative way of determining the columns containing the SNPs

The argument *sep* indicates the character separating the
alleles. The default value is ’‘/´´. In our case, there is no separating
character, so that, we set *sep=““*. The argument
*name.genotypes* can be used when genotypes are available in
other formats, such as 0, 1, 2 or’‘norm´´,’‘het´´,’’mut´´. The purpose
of the `setupSNP`

function is to assign the class
*snp* to the SNPs variables, to which `SNPassoc`

methods will be applied. The function labels the most common genotype
across subjects as the reference one. When numerous SNPs are available,
the function can be parallelized through the argument *mc.cores*
that indicates the number of processors to be used. We can verify that
the SNP variables are given the new class *snp*

```
[1] G/G G/T G/G G/T G/T G/G
Genotypes: G/G G/T T/T
Alleles: G T
```

`[1] "snp" "factor"`

and summarize their content with `summary`

```
Genotypes:
frequency percentage
G/G 903 57.224335
G/T 570 36.121673
T/T 105 6.653992
Alleles:
frequency percentage
G 2376 75.28517
T 780 24.71483
HWE (p value): 0.250093
```

which shows the genotype and allele frequencies for a given SNP, testing for Hardy-Weinberg equilibrium (HWE). We can also visualize the results in a plot by

The argument *type* helps to get a pie chart

The *summary* function can also be applied to the whole
dataset

```
alleles major.allele.freq HWE missing (%)
rs4490198 A/G 59.2 0.174133 0.6
rs4849332 G/T 61.8 0.522060 0.1
rs1367179 G/C 81.4 0.738153 1.0
rs11123242 C/T 81.7 0.932898 0.6
rs13014858 G/A 58.3 0.351116 0.1
rs1430094 G/A 66.9 0.305509 0.4
rs1430093 C/A 66.6 0.817701 3.5
rs746710 G/C 51.5 0.614368 0.0
rs1430090 T/G 70.0 0.025180 1.6
rs6737251 C/T 69.3 0.235996 0.3
rs11685217 C/T 80.1 0.009462 4.5
rs1430097 C/A 65.1 0.738166 1.0
rs10496465 A/G 85.8 0.917997 0.6
rs3756688 T/C 63.9 0.154632 0.6
rs2303063 A/G 53.0 0.722069 1.1
rs1422993 G/T 75.3 0.250093 0.0
rs2400478 G/A 62.6 0.256786 0.9
rs714588 A/G 54.9 0.838329 0.8
rs1023555 T/A 76.8 0.943443 0.5
rs898070 G/A 62.6 1.000000 0.6
rs963218 C/T 53.2 0.389387 0.3
rs1419835 C/T 78.2 0.505391 0.6
rs765023 T/C 64.5 0.030513 6.9
rs1345267 A/G 61.0 0.112183 0.1
rs324381 G/A 64.6 0.242223 11.6
rs184448 T/G 55.9 0.008446 2.2
rs324396 C/T 71.2 0.197291 0.3
rs324957 G/A 57.0 0.007417 0.4
rs324960 C/T 66.6 0.077777 1.1
rs10486657 C/T 81.3 0.672703 4.3
rs324981 A/T 53.2 0.048438 0.2
rs1419780 C/G 80.8 0.569652 0.2
rs325462 T/A 51.0 0.337862 0.3
rs727162 G/C 78.5 0.765708 0.0
rs10250709 G/A 65.4 0.266434 0.0
rs6958905 T/C 64.7 0.377472 0.4
rs10238983 T/C 75.7 0.216435 0.4
rs4941643 A/G 54.1 0.635887 7.2
rs3794381 C/G 71.7 0.652089 7.1
rs2031532 G/A 65.0 0.911918 0.0
rs2247119 T/C 71.5 0.457710 0.5
rs8000149 T/C 63.2 0.588077 0.4
rs2274276 G/C 57.0 0.571386 0.6
rs7332573 G/T 91.5 0.869947 1.5
rs3829366 T/A 51.6 0.722626 1.3
rs6084432 G/A 83.7 0.266716 0.6
rs512625 G/A 69.5 0.905395 0.4
rs3918395 G/T 86.8 0.508732 1.2
rs2787095 G/C 60.2 0.102053 0.8
rs2853215 G/A 73.0 0.249516 0.2
```

showing the SNP labels with minor/major allele format, the major allele frequency the HWE test and the percentage of missing genotypes. Missing values can be further explored plotting with

This plot can be used to inspect if missing values appear randomly across individuals and SNPs. In our case, we can see that the missing pattern may be considered random, except for three clusters in consecutive SNPs (large black squares). These individuals should be further checked for possible problems with genotyping.

Genotyping of SNPs needs to pass quality control measures. Aside from
technical details that need to be considered for filtering SNPs with low
quality, genotype calling error can be detected by a HWE test. The test
compares the observed genotype frequencies with those expected under
random mating, which follows when the SNPs are in the absence of
selection, mutation, genetic drift, or other forces. Therefore, HWE must
be checked only in controls. There are several tests described in the
literature to verify HWE. In `SNPassoc`

HWE is tested for all
the bi-allelic SNP markers using a fast exact test Wigginton, Cutler, and Abecasis (2005)
implemented in the *tableHWE* function.

```
HWE (p value)
rs4490198 0.1741325
rs4849332 0.5220596
rs1367179 0.7381531
rs11123242 0.9328981
rs13014858 0.3511162
rs1430094 0.3055089
```

We observe that the first SNPs in the dataset are under HWE since their P-values rejecting the HWE hypothesis (null hypothesis) are larger than 0.05. However, when tested in control samples only, by stratifying by cases and controls

```
hwe2 <- tableHWE(asthma.s, casecontrol)
#SNPs is HWE in the whole sample but not controls
snpNHWE <- hwe2[,1]>0.05 & hwe2[,2]<0.05
rownames(hwe2)[snpNHWE]
```

`[1] "rs1345267"`

```
all groups 0 1
0.11218285 0.04956349 0.81604706
```

We see that rs1345267 is not in HWE within controls because its P-value is <0.05. Notice that one is interested in keeping those SNPsthat do not reject the null hypothesis. As several SNPs are tested, multiple comparisons must be considered. In this particular setting, a threshold of 0.001 is normally considered. As a quality control measure, it is not necessary to be as conservative as in those situations where false discovery rates need to be controlled.

SNPs that do not pass the HWE test must be removed form further
analyses. We can recall *setupSNP* and indicate the columns of
the SNPs to be kept

```
snps.ok <- rownames(hwe2)[hwe2[,2]>=0.001]
pos <- which(colnames(asthma)%in%snps.ok, useNames = FALSE)
asthma.s <- setupSNP(asthma, pos, sep="")
```

Note that in the variable *pos* we redefine the SNP variables
to be considered as class *snp*.

We are interested in finding those SNPs associated with asthma status
that is encoded in the variable *casecontrol*. We first
illustrate the association between case-control status and the SNP
rs1422993. The association analysis for all genetic models is performed
by the function *association* that regresses *casecontrol*
on the variable rs1422993 from the dataset *asthma.s* that
already contains the SNP variables of class *snp*.

```
SNP: rs1422993 adjusted by:
0 % 1 % OR lower upper p-value AIC
Codominant
G/G 730 59.0 173 50.9 1.00 0.017768 1642
G/T 425 34.3 145 42.6 1.44 1.12 1.85
T/T 83 6.7 22 6.5 1.12 0.68 1.84
Dominant
G/G 730 59.0 173 50.9 1.00 0.007826 1642
G/T-T/T 508 41.0 167 49.1 1.39 1.09 1.77
Recessive
G/G-G/T 1155 93.3 318 93.5 1.00 0.877863 1649
T/T 83 6.7 22 6.5 0.96 0.59 1.57
Overdominant
G/G-T/T 813 65.7 195 57.4 1.00 0.005026 1641
G/T 425 34.3 145 42.6 1.42 1.11 1.82
log-Additive
0,1,2 1238 78.5 340 21.5 1.22 1.01 1.47 0.040151 1644
```

The function *association* follows the usual syntax of R
modelling functions with the difference that the variables in the model
that are of class *snp* are tested using different genetic
models. In our example, we observe that all genetic models but the
recessive one are statistically significant. *association* also
fits the overdominant model, which compares the two homozygous genotypes
versus the heterozygous one. This genetic model of inheritance is
biologically rare although it has been linked to sickle cell anemia in
humans. The result table describes the number of individuals in each
genotype across cases and controls. The ORs and CI-95% are also
computed. The last column describes the AIC (Akaike information
criteria) that can be used to decide which is the best model of
inheritance; the lower the better the model is.

In the example, one may conclude that rs1422993 is associated with asthma and that, for instance, the risk of being asthmatic is 39% higher in people having at least one alternative allele (T) with respect to individuals having none (dominant model). This risk is statistically significant since the CI-95% does not contain 1, the P-value is 0.0078<0.022, or the P-value of the max-statistics is 0.01

```
dominant recessive log-additive MAX-statistic Pr(>z)
[1,] 7.073 0.024 4.291 7.073 0.0179 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

If an expected model of inheritance is hypothesized, the association
analysis for the model can be specified in the argument *model*,
which by default test all models,

```
SNP: rs1422993 adjusted by:
0 % 1 % OR lower upper p-value AIC
Dominant
G/G 730 59 173 50.9 1.00 0.007826 1642
G/T-T/T 508 41 167 49.1 1.39 1.09 1.77
```

Association tests are typically adjusted by covariates, which are incorporated in the model in the usual form

```
SNP: rs1422993 adjusted by: country smoke
0 % 1 % OR lower upper p-value AIC
Codominant
G/G 728 59.1 173 51.0 1.00 0.06940 1408
G/T 423 34.3 144 42.5 1.38 1.05 1.82
T/T 81 6.6 22 6.5 1.06 0.61 1.85
Dominant
G/G 728 59.1 173 51.0 1.00 0.03429 1407
G/T-T/T 504 40.9 166 49.0 1.33 1.02 1.73
Recessive
G/G-G/T 1151 93.4 317 93.5 1.00 0.79338 1411
T/T 81 6.6 22 6.5 0.93 0.54 1.60
Overdominant
G/G-T/T 809 65.7 195 57.5 1.00 0.02147 1406
G/T 423 34.3 144 42.5 1.37 1.05 1.80
log-Additive
0,1,2 1232 78.4 339 21.6 1.19 0.96 1.46 0.11191 1409
```

ORs for stratified analysis on given categorical covariates are used to verify whether the risk is constant across groups

```
SNP: rs1422993 adjusted by: survival::strata(gender)
0 % 1 % OR lower upper p-value AIC
Codominant
G/G 730 59.0 173 50.9 1.00 0.022940 1634
G/T 425 34.3 145 42.6 1.42 1.11 1.83
T/T 83 6.7 22 6.5 1.09 0.66 1.80
Dominant
G/G 730 59.0 173 50.9 1.00 0.011144 1633
G/T-T/T 508 41.0 167 49.1 1.37 1.07 1.74
Recessive
G/G-G/T 1155 93.3 318 93.5 1.00 0.805330 1640
T/T 83 6.7 22 6.5 0.94 0.58 1.53
Overdominant
G/G-T/T 813 65.7 195 57.4 1.00 0.006378 1632
G/T 425 34.3 145 42.6 1.41 1.10 1.80
log-Additive
0,1,2 1238 78.5 340 21.5 1.21 1.00 1.46 0.055231 1636
```

We can see, for instance, that the dominant model is significant only in males. The subset argument allows fitting the model in a subgroup of individuals

```
SNP: rs1422993 adjusted by:
0 % 1 % OR lower upper p-value AIC
Codominant
G/G 179 54.6 22 44.9 1.00 0.3550 295.2
G/T 125 38.1 24 49.0 1.56 0.84 2.91
T/T 24 7.3 3 6.1 1.02 0.28 3.66
Dominant
G/G 179 54.6 22 44.9 1.00 0.2059 293.7
G/T-T/T 149 45.4 27 55.1 1.47 0.81 2.70
Recessive
G/G-G/T 304 92.7 46 93.9 1.00 0.7576 295.2
T/T 24 7.3 3 6.1 0.83 0.24 2.85
Overdominant
G/G-T/T 203 61.9 25 51.0 1.00 0.1502 293.2
G/T 125 38.1 24 49.0 1.56 0.85 2.85
log-Additive
0,1,2 328 87.0 49 13.0 1.23 0.77 1.96 0.3816 294.5
```

These analyses can be also be performed in quantitative traits, such
as body mass index, since *association* function automatically
selects the error distribution of the regression analysis (either
Gaussian or binomial).

```
SNP: rs1422993 adjusted by:
n me se dif lower upper p-value AIC
Codominant
G/G 896 25.53 0.1446 0.000000 0.9069 9069
G/T 565 25.50 0.1834 -0.027059 -0.4874 0.4332
T/T 105 25.71 0.4676 0.178076 -0.7057 1.0619
Dominant
G/G 896 25.53 0.1446 0.000000 0.9818 9067
G/T-T/T 670 25.54 0.1710 0.005089 -0.4324 0.4426
Recessive
G/G-G/T 1461 25.52 0.1135 0.000000 0.6694 9067
T/T 105 25.71 0.4676 0.188540 -0.6769 1.0540
Overdominant
G/G-T/T 1001 25.55 0.1383 0.000000 0.8424 9067
G/T 565 25.50 0.1834 -0.045739 -0.4965 0.4050
log-Additive
0,1,2 0.033951 -0.3153 0.3832 0.8489 9067
```

For BMI, *association* tests whether the difference between
means is statistically significant, rather than computing an OR.

For multiple SNP data, our objective is to identify the variants that
are significantly associated with the trait. The most basic strategy is,
therefore, to fit an association test like the one described above for
each of the SNPs in the dataset and determine which of those
associations are significant. The massive univariate testing is the most
widely used analysis method for *omic* data because of its
simplicity. In *SNPassoc*, this type of analysis is done with the
function *WGassociation*

```
comments codominant dominant recessive overdominant log-additive
rs4490198 - 0.52765 0.29503 0.96400 0.29998 0.49506
rs4849332 - 0.96912 0.92986 0.84806 0.82327 0.97049
rs1367179 - 0.62775 0.59205 0.35786 0.86419 0.43994
rs11123242 - 0.68622 0.67596 0.39801 0.92878 0.52009
rs13014858 - 0.52578 0.26739 0.88011 0.34966 0.40897
rs1430094 - 0.13375 0.10569 0.54432 0.04490 0.36611
```

Here, only the outcome is required in the formula argument (first
argument) since the function successively calls *association* on
each of the variables of class *snp* within *data*. The
function returns the P-values of association of each SNP under each
genetic model. Covariates can also be introduced in the model

`SNPassoc`

is computationally limited on large genomic
data. The computing time can be reduced by parallelization, specifying
in the argument *mc.cores* the number of computing cores to be
used. Alternatively, the function *scanWGassociation*, a C
compiled function, can be used to compute a predetermined genetic model
across all SNPs, passed in the argument *model*, which by default
is the additive model

**NOTE**: This function is not available on the
`SNPassoc`

version available on CRAN. The user can install
the development version available on GitHub to get access to this
function just executing

The P-values obtained from massive univariate analyses are visualized
with the generic *plot* function

This produces a Manhattan plot of the -log10(P-values) for all the SNPs over all models. It shows the nominal level of significance and the Bonferroni level, which is the level corrected by the multiple testing across all SNPs. The overall hypothesis of massive univariate association tests is whether there is any SNP that is significantly associated with the phenotype. As multiple SNPs are tested, the probability of finding at least one significant finding increases if we do not lower the significance threshold. The Bonferroni correction lowers the threshold by the number of SNPs tested (0.0001=0.05/51). In the Manhattan plotof our analysis, we see that no SNP is significant at the Bonferroni level, and therefore there is no SNP that is significantly associated with asthma.

Maximum-statistic (see González et al. (2008)) can also be used to test association between asthma status and SNPs

```
dominant recessive log-additive MAX-statistic Pr(>z)
rs4490198 1.097 0.002 0.466 1.097 0.50281
rs4849332 0.008 0.037 0.001 0.037 0.97632
rs1367179 0.287 0.845 0.602 0.845 0.58891
rs11123242 0.175 0.714 0.417 0.714 0.63924
rs13014858 1.230 0.023 0.683 1.230 0.46342
rs1430094 2.617 0.368 0.821 2.617 0.20708
rs1430093 1.051 0.042 0.743 1.051 0.51905
rs746710 0.728 0.679 1.051 1.051 0.51656
rs1430090 0.172 0.463 0.000 0.463 0.74274
rs6737251 0.143 0.156 0.217 0.217 0.86875
rs11685217 0.894 0.030 0.705 0.894 0.56865
rs1430097 0.003 0.183 0.029 0.183 0.88910
rs10496465 0.003 0.020 0.008 0.020 0.98723
rs3756688 0.016 0.738 0.266 0.738 0.62503
rs2303063 0.060 1.271 0.658 1.271 0.45218
rs1422993 7.073 0.024 4.291 7.073 0.01765 *
rs2400478 1.662 0.056 1.055 1.662 0.35825
rs714588 0.659 0.061 0.150 0.659 0.65827
rs1023555 0.221 0.104 0.261 0.261 0.84686
rs898070 0.020 1.794 0.346 1.794 0.33203
rs963218 0.165 0.190 0.263 0.263 0.84326
rs1419835 1.084 0.775 0.295 1.084 0.51120
rs765023 1.959 0.526 0.483 1.959 0.30303
rs1345267 2.582 0.003 1.383 2.582 0.21188
rs324381 0.175 0.402 0.377 0.402 0.77469
rs184448 9.710 2.026 8.253 9.710 0.00327 **
rs324396 2.312 0.188 1.051 2.312 0.24882
rs324957 7.598 2.134 7.157 7.598 0.01440 *
rs324960 2.901 7.928 6.443 7.928 0.00890 **
rs10486657 0.054 0.500 0.001 0.500 0.73005
rs324981 3.168 2.431 4.270 4.270 0.08184 .
rs1419780 0.123 1.193 0.005 1.193 0.47784
rs325462 1.545 1.645 2.412 2.412 0.23145
rs727162 3.022 0.643 3.074 3.074 0.16196
rs10250709 0.737 0.007 0.360 0.737 0.62815
rs6958905 0.490 0.099 0.447 0.490 0.73239
rs10238983 0.094 0.692 0.328 0.692 0.64811
rs4941643 1.294 0.006 0.465 1.294 0.44629
rs3794381 0.127 0.998 0.489 0.998 0.53523
rs2031532 0.025 0.240 0.127 0.240 0.85690
rs2247119 0.379 0.000 0.229 0.379 0.78478
rs8000149 0.020 0.043 0.043 0.043 0.97249
rs2274276 0.005 0.034 0.023 0.034 0.97753
rs7332573 1.247 1.703 1.825 1.825 0.33044
rs3829366 1.137 0.448 0.069 1.137 0.49161
rs6084432 2.999 0.848 3.279 3.279 0.14106
rs512625 2.180 0.030 1.119 2.180 0.26479
rs3918395 0.903 0.414 0.455 0.903 0.57180
rs2787095 0.178 0.069 0.184 0.184 0.88709
rs2853215 0.686 0.931 1.130 1.130 0.49269
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We note that even under the max-statistics none of the SNPs tested is significant under the Bonferroni correction (<0.0001) for multiple SNP testing

`[1] 0.003267284`

Information for specific association models for given SNPs can also
be retrieved with *WGstats*

Therefore, we can have access to the results for a given SNP by

```
SNP: rs1422993 adjusted by:
0 % 1 % OR lower upper p-value AIC
Codominant
G/G 730 59.0 173 50.9 1.00 0.017768 1642
G/T 425 34.3 145 42.6 1.44 1.12 1.85
T/T 83 6.7 22 6.5 1.12 0.68 1.84
Dominant
G/G 730 59.0 173 50.9 1.00 0.007826 1642
G/T-T/T 508 41.0 167 49.1 1.39 1.09 1.77
Recessive
G/G-G/T 1155 93.3 318 93.5 1.00 0.877863 1649
T/T 83 6.7 22 6.5 0.96 0.59 1.57
Overdominant
G/G-T/T 813 65.7 195 57.4 1.00 0.005026 1641
G/T 425 34.3 145 42.6 1.42 1.11 1.82
log-Additive
0,1,2 1238 78.5 340 21.5 1.22 1.01 1.47 0.040151 1644
```

recovering our previous results given by *association*
function.

**NOTE**: The R output of specific association analyses
can be exported into LaTeX by using *getNiceTable* function and
*xtable* R package. The following code creates a table for the
SNPs rs1422993 and rs184448

```
library(xtable)
out <- getNiceTable(ans[c("rs1422993", "rs184448")])
nlines <- attr(out, "nlines")
hlines <- c(-1, -1, 0, cumsum(nlines+1), nrow(out), nrow(out))
print(xtable(out, caption='Genetic association using
different genetic models from asthma
data example of rs1422993 and rs184448
SNPs obtained with SNPassoc.',
label = 'tab-2SNPs'),
tabular.enviroment="longtable", file="tableSNPs",
floating=FALSE, include.rownames = FALSE,
hline.after= hlines, sanitize.text.function=identity)
```

Gene-enviroment (GxE) analyses can be performed within
`SNPassoc`

using *association* function. Assume that
we are interested in testing whether the risk of rs1422993 for asthma
under the dominant model is different among smokers (variable smoke;
0=never, 1=ever). This code fits a model with an interaction term where
the environmental variable is required to be a factor factor
variable.

```
SNP: dominant(rs1422993 adjusted by:
Interaction
---------------------
0 OR lower upper 1 OR lower upper
G/G 486 130 1.00 NA NA 242 43 0.66 0.46 0.97
G/T-T/T 354 128 1.35 1.02 1.79 150 38 0.95 0.63 1.42
p interaction: 0.8513
factor(smoke) within dominant(rs1422993
---------------------
G/G
0 1 OR lower upper
0 486 130 1.00 NA NA
1 242 43 0.66 0.46 0.97
G/T-T/T
0 1 OR lower upper
0 354 128 1.0 NA NA
1 150 38 0.7 0.47 1.06
p trend: 0.8513
dominant(rs1422993 within factor(smoke)
---------------------
0
0 1 OR lower upper
G/G 486 130 1.00 NA NA
G/T-T/T 354 128 1.35 1.02 1.79
1
0 1 OR lower upper
G/G 242 43 1.00 NA NA
G/T-T/T 150 38 1.43 0.88 2.31
p trend: 0.8513
```

The result is an interaction table showing that the risk of individuals carrying the T allele increases the risk of asthma in never smokers (OR=1.35; CI: 1.02-1.79) while it is not significant in ever smokers (OR=0.95; CI: 0.63-1.42). However, the interaction is not statistically significant (\(P\)-interaction=0.8513). The output also shows the stratified ORs that can help in interpreting the results.

In a similar way, gene-gene interaction (GxG) of a given SNP
epistasis model can also be fitted using the same function. In that
case, the genetic model of the interacting SNP must be indicated in the
*model.inteaction* argument.

```
association(casecontrol ~ rs1422993*factor(rs184448),
data=asthma.s, model.interaction = "dominant" )
```

```
SNP: rs1422993 adjusted by:
Interaction
---------------------
T/T OR lower upper T/G OR lower upper 0 1 G/G lower upper
G/G 227 43 1.00 NA NA 359 96 1.41 0.95 2.10 128 30 1.24 0.74 2.07
G/T-T/T 154 33 1.13 0.69 1.86 265 93 1.85 1.24 2.77 78 38 2.57 1.55 4.27
p interaction: 0.24499
factor(rs184448) within rs1422993
---------------------
G/G
0 1 OR lower upper
T/T 227 43 1.00 NA NA
T/G 359 96 1.41 0.95 2.10
G/G 128 30 1.24 0.74 2.07
G/T-T/T
0 1 OR lower upper
T/T 154 33 1.00 NA NA
T/G 265 93 1.64 1.05 2.55
G/G 78 38 2.27 1.32 3.90
p trend: 0.24499
rs1422993 within factor(rs184448)
---------------------
T/T
0 1 OR lower upper
G/G 227 43 1.00 NA NA
G/T-T/T 154 33 1.13 0.69 1.86
T/G
0 1 OR lower upper
G/G 359 96 1.00 NA NA
G/T-T/T 265 93 1.31 0.95 1.82
G/G
0 1 OR lower upper
G/G 128 30 1.00 NA NA
G/T-T/T 78 38 2.08 1.19 3.62
p trend: 0.12743
```

We observe that the interaction between these two SNPs is not statistically significant (P-value=0.24). However, the OR of GG genotype of rs184448 differs across individuals between the GG and GT-TT genotypes of rs1422993 (see ORs for GG in the second table of the output).

The user also can perform GxG for a set of SNPs using this code. Let us assume we are interested in assessing interaction between the SNPs that are significant at 10% level

```
ans <- WGassociation(casecontrol, data=asthma.s)
mask <- apply(ans, 1, function(x) min(x, na.rm=TRUE)<0.1)
sig.snps <- names(mask[mask])
sig.snps
```

```
[1] "rs1430094" "rs1422993" "rs765023" "rs184448" "rs324396" "rs324957"
[7] "rs324960" "rs324981" "rs727162" "rs6084432"
```

```
idx <- which(colnames(asthma)%in%sig.snps)
asthma.s2 <- setupSNP(asthma, colSNPs = idx, sep="")
ans.int <- interactionPval(casecontrol ~ 1, data=asthma.s2)
ans.int
```

```
rs1430094 rs1422993 rs765023 rs184448 rs324396
rs1430094 0.132526514 0.653457029 0.816154586 0.787386835 0.694311497
rs1422993 0.140433948 0.016719949 0.133131712 0.375246182 0.683959376
rs765023 0.171993821 0.144029395 0.182805988 0.961405419 0.194520134
rs184448 0.100074594 0.019963426 0.034666219 0.007969948 0.036825613
rs324396 0.204742661 0.190583402 0.131945703 0.344952544 0.209589962
rs324957 0.103680541 0.019820456 0.034446832 0.696623215 0.318903026
rs324960 0.107216765 0.024180405 0.174801032 0.291457716 0.584132142
rs324981 0.128489057 0.144077239 0.153521356 0.646139555 0.301782199
rs727162 0.258828613 0.240900610 0.188971595 0.173341993 0.120240909
rs6084432 0.214413938 0.187372912 0.187830183 0.264160389 0.252965105
rs324957 rs324960 rs324981 rs727162 rs6084432
rs1430094 0.644661690 0.096913557 0.774083175 0.043758067 0.946641185
rs1422993 0.257661509 0.175946789 0.095251421 0.577504655 0.027437424
rs765023 0.437535350 0.614095008 0.117705086 0.217580773 0.896707843
rs184448 0.018290318 0.356124918 0.311031078 0.297592380 0.349332875
rs324396 0.232076552 0.747574191 0.064231657 0.361336095 0.388626207
rs324957 0.019578585 0.595796247 0.965290164 0.402781904 0.272919867
rs324960 0.404830873 0.017401694 0.168077057 0.406220478 0.725774374
rs324981 0.657378128 0.533095852 0.117043376 0.463436271 0.862291944
rs727162 0.174960014 0.500811687 0.167364803 0.206131411 0.885294485
rs6084432 0.174006927 0.146808311 0.210647392 0.203957720 0.193711884
```

we can visualize the results by

Genetic association studies can be extended from single SNP associations to haplotype associations. Several examples including different complex diseases can be found in . While alleles naturally occur in haplotypes, as they belong to chromosomes, the phase, or the knowledge of the chromosome an allele belongs to is lost in the genotyping process. As each genotype is measured with a different probe the phase between the alleles in a chromosome is broken. Consider for instance an individual for which one chromosome has alleles A and T at two different loci and alleles G or C at the second chromosome. The individual’s genotypes for the individual at the two loci are A/G and T/C, which are the same genotypes of another individual that has alleles A and C in one chromosome and G and T in the second chromosome. Clearly, if the pair A-T confers a risk to a disease, subject one is at risk while subject two is not, despite both of them having the same genotypes.

The only unequivocal method of resolving phase ambiguity is sequencing the chromosomes of individuals. However, given the correlational structure of SNPs, it is possible to estimate the probability of a particular haplotype in a subject. This can be done in a genetic association study where a number of cases and controls are genotyped. There are numerous methods to infer unobserved haplotypes, two of the most popular are maximum likelihood, implemented via the expectation-maximization (EM) algorithm , and a parsimony method . Recent methods based on Bayesian models have also been proposed . In addition, haplotypes inferences carry uncertainty, which should be considered in association analyses.

We now illustrate how to perform haplotype estimation from genotype
data using the EM algorithm and how to integrate haplotype uncertainty
when evaluating the association between traits and haplotypes. Haplotype
inference is performed with `haplo.stats`

, for which
genotypes are encoded in a different format. *make.geno*, from
`SNPassoc`

, formats data for `haplo.stats`

. The
function *haplo.em* computes the haplotype frequency in the data
for the SNPs of interest. Here we illustrate how to estimate haplotypes
built from the SNPs rs714588, rs1023555 and rs898070:

```
library(haplo.stats)
snpsH <- c("rs714588", "rs1023555", "rs898070")
genoH <- make.geno(asthma.s, snpsH)
em <- haplo.em(genoH, locus.label = snpsH, miss.val = c(0, NA))
em
```

```
================================================================================
Haplotypes
================================================================================
rs714588 rs1023555 rs898070 hap.freq
1 1 1 1 0.04090
2 1 1 2 0.02439
3 1 2 1 0.04265
4 1 2 2 0.44047
5 2 1 1 0.08271
6 2 1 2 0.08403
7 2 2 1 0.20794
8 2 2 2 0.07691
================================================================================
Details
================================================================================
lnlike = -4102.691
lr stat for no LD = 774.5411 , df = 4 , p-val = 0
```

Coding the common and variant alleles as 1 and 2, we can see there
are 8 possible haplotypes across the subjects and are listed with an
estimated haplotype frequency. Clearly, the haplotypes are not equally
probable, as expected from the high LD between the SNPs. In particular,
we observe that haplotypes 4 and 7 are the most probable, accumulating
65% of the haplotype sample. *haplo.em* estimates for each
subject the probability of a given haplotype in each of the subject’s
chromosomes.

We then want to assess if any of these haplotypes significantly
associates with asthma. The *haplo.glm* fits a regression model
between the phenotype and the haplotypes, incorporating the uncertainty
for the probable haplotypes of individuals. The function
*intervals* of `SNPassoc`

provides a nice summary of
the results

```
trait <- asthma.s$casecontrol
mod <- haplo.glm(trait ~ genoH,
family="binomial",
locus.label=snpsH,
allele.lev=attributes(genoH)$unique.alleles,
control = haplo.glm.control(haplo.freq.min=0.05))
intervals(mod)
```

```
freq or 95% C.I. P-val
ATG 0.4405 1.00 Reference haplotype
GAA 0.0827 1.09 ( 0.77 - 1.56 ) 0.6160
GAG 0.0841 0.99 ( 0.69 - 1.42 ) 0.9642
GTA 0.2080 1.06 ( 0.84 - 1.34 ) 0.6379
GTG 0.0769 1.09 ( 0.76 - 1.58 ) 0.6367
genoH.rare 0.1079 1.15 ( 0.83 - 1.59 ) 0.3968
```

*haplo.glm* fits a logistic regression model for the asthma
status (*trait* - NOTE: this must be a 0/1 variable) on the
inferred haplotypes (*genoH*), the names of SNPs and their allele
names are passed in the *locus.label* and *allele.lev*
arguments, while only haplotypes with at least 5% frequency are
considered. As a result, we obtain the OR for each haplotype with their
significance P-value, with respect to the most common haplotype (ATG
with 44% frequency). In particular, from this analysis, we cannot see
any haplotype significantly associated with asthma.

The inference of haplotypes depends on a predefined region or sets of SNPs. For instance, in the last section, we selected three SNPs that were in high LD. However, when no previous knowledge is available about the region or SNPs for which haplotypes should be inferred, we can apply a sliding window for haplotype inference .

To illustrate this type of analysis, we now consider a second block
of 10 SNPs in our asthma example (from 6th to 15th SNP). Considering
large haplotypes, however, increases the number of possible haplotypes
in the sample, decreasing the power of finding real associations. In
addition, in predefined blocks, it is possible to miss the most
efficient length of the susceptible haplotype, incrementing the loss of
power. This is overcome using a sliding window. We thus ask which is the
haplotype combination from any of 4, 5, 6 or 7 consecutive SNPs that
gives the highest association with asthma status. We then reformat SNP
genotypes in the region with the *make.geno* function and perform
an association analysis for multiple haplotypes of *i* SNPs
sliding from the 6th to the 15th SNP in the data. We perform an analysis
for each window length *i* varying from 4 to 7 SNPs.

```
snpsH2 <- labels(asthma.s)[6:15]
genoH2 <- make.geno(asthma.s, snpsH2)
haplo.score <- list()
for (i in 4:7) {
trait <- asthma.s$casecontrol
haplo.score[[i-3]] <- haplo.score.slide(trait, genoH2,
trait.type="binomial",
n.slide=i,
simulate=TRUE,
sim.control=score.sim.control(min.sim=100,
max.sim=200))
}
```

The results can be visualized with the following plot

```
par(mfrow=c(2,2))
for (i in 4:7) {
plot(haplo.score[[i-3]])
title(paste("Sliding Window=", i, sep=""))
}
```

We observe that the highest -log10(P-value) is obtained for a haplotype of 4 SNP length starting at the 4th SNP of the selected SNPs. After deciding the best combination of SNPs, the haplotype association with asthma can be estimated by

```
snpsH3 <- snpsH2[4:7]
genoH3 <- make.geno(asthma.s, snpsH3)
mod <- haplo.glm(trait~genoH3,
family="binomial",
locus.label=snpsH3,
allele.lev=attributes(genoH3)$unique.alleles,
control = haplo.glm.control(haplo.freq.min=0.05))
intervals(mod)
```

```
freq or 95% C.I. P-val
TCCC 0.3521 1.00 Reference haplotype
GCCC 0.2963 1.01 ( 0.82 - 1.24 ) 0.9495
TTCA 0.1184 0.89 ( 0.67 - 1.20 ) 0.4433
TTTA 0.1830 1.12 ( 0.87 - 1.43 ) 0.3796
genoH3.rare 0.0502 0.78 ( 0.50 - 1.22 ) 0.2750
```

Here, we observe that individuals carrying the haplotype GCAC have a
53% increased risk of asthma relative to those having the reference
haplotype TCGT (OR=1.53, p=0.0021). The haplotype GTCA is also
significantly associated with the disease (p=0.0379). A likelihood ratio
test for haplotype status can be extracted from the results of the
*haplo.glme* function:

`[1] 0.5044636`

We can also test the association between asthma and the haplotype adjusted for smoking status

```
smoke <- asthma.s$smoke
mod.adj.ref <- glm(trait ~ smoke, family="binomial")
mod.adj <- haplo.glm(trait ~ genoH3 + smoke ,
family="binomial",
locus.label=snpsH3,
allele.lev=attributes(genoH3)$unique.alleles,
control = haplo.glm.control(haplo.freq.min=0.05))
lrt.adj <- mod.adj.ref$deviance - mod.adj$deviance
pchisq(lrt.adj, mod.adj$lrt$df, lower=FALSE)
```

`[1] 0.6457956`

```
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=C LC_CTYPE=Spanish_Spain.utf8
[3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C
[5] LC_TIME=Spanish_Spain.utf8
time zone: Europe/Madrid
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] haplo.stats_1.9.7 arsenal_3.6.3 SNPassoc_2.1-2
loaded via a namespace (and not attached):
[1] gtable_0.3.5 xfun_0.48 bslib_0.8.0 ggplot2_3.5.1
[5] htmlwidgets_1.6.4 lattice_0.22-6 vctrs_0.6.5 tools_4.4.1
[9] generics_0.1.3 parallel_4.4.1 sandwich_3.1-1 tibble_3.2.1
[13] fansi_1.0.6 highr_0.11 cluster_2.1.6 pkgconfig_2.0.3
[17] Matrix_1.7-0 data.table_1.16.2 checkmate_2.3.2 lifecycle_1.0.4
[21] farver_2.1.2 compiler_4.4.1 stringr_1.5.1 MatrixModels_0.5-3
[25] munsell_0.5.1 codetools_0.2-20 poisbinom_1.0.1 SparseM_1.84-2
[29] quantreg_5.99 htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10
[33] htmlTable_2.4.3 Formula_1.2-5 tidyr_1.3.1 pillar_1.9.0
[37] jquerylib_0.1.4 MASS_7.3-60.2 cachem_1.1.0 rms_6.8-2
[41] Hmisc_5.1-3 rpart_4.1.23 multcomp_1.4-26 nlme_3.1-164
[45] tidyselect_1.2.1 digest_0.6.37 polspline_1.1.25 mvtnorm_1.3-1
[49] stringi_1.8.4 purrr_1.0.2 dplyr_1.1.4 labeling_0.4.3
[53] splines_4.4.1 fastmap_1.2.0 grid_4.4.1 colorspace_2.1-1
[57] cli_3.6.3 magrittr_2.0.3 base64enc_0.1-3 survival_3.6-4
[61] utf8_1.2.4 TH.data_1.1-2 withr_3.0.1 foreign_0.8-86
[65] scales_1.3.0 backports_1.5.0 rmarkdown_2.28 nnet_7.3-19
[69] gridExtra_2.3 zoo_1.8-12 evaluate_1.0.1 knitr_1.48
[73] rlang_1.1.4 Rcpp_1.0.13 glue_1.8.0 rstudioapi_0.17.1
[77] jsonlite_1.8.9 plyr_1.8.9 R6_2.5.1
```

González, Juan R, Lluı́s Armengol, Xavier Solé, Elisabet Guinó, Josep M
Mercader, Xavier Estivill, and Vı́ctor Moreno. 2007.
“SNPassoc: An R Package to
Perform Whole Genome Association Studies.”
*Bioinformatics* 23 (5): 654–55.

González, Juan R, Josep L Carrasco, Frank Dudbridge, Lluı́s Armengol,
Xavier Estivill, and Victor Moreno. 2008. “Maximizing Association
Statistics over Genetic Models.” *Genetic Epidemiology* 32
(3): 246–54.

Wigginton, Janis E, David J Cutler, and Gonçalo R Abecasis. 2005.
“A Note on Exact Tests of Hardy-Weinberg
Equilibrium.” *The American Journal of Human Genetics* 76
(5): 887–93.