Chapter 10: Exercise 11

b

dd = as.dist(1 - cor(data))
plot(hclust(dd, method="complete"))

plot of chunk 2b

plot(hclust(dd, method="single"))

plot of chunk 2b

plot(hclust(dd, method="average"))

plot of chunk 2b

Two or three groups depending on the linkage method.

c

To look at which genes differ the most across the healthy patients and diseased patients, we could look at the loading vectors outputted from PCA to see which genes are used to describe the variance the most.

pr.out = prcomp(t(data))
summary(pr.out)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6    PC7
## Standard deviation     11.941 6.0682 5.9348 5.8312 5.7521 5.7003 5.6345
## Proportion of Variance  0.127 0.0327 0.0313 0.0302 0.0294 0.0289 0.0282
## Cumulative Proportion   0.127 0.1594 0.1907 0.2209 0.2503 0.2792 0.3074
##                           PC8    PC9   PC10   PC11   PC12   PC13   PC14
## Standard deviation     5.5773 5.5494 5.5062 5.4885 5.4602 5.4023 5.3344
## Proportion of Variance 0.0276 0.0274 0.0269 0.0268 0.0265 0.0259 0.0253
## Cumulative Proportion  0.3350 0.3624 0.3893 0.4160 0.4425 0.4685 0.4938
##                          PC15   PC16  PC17   PC18   PC19   PC20   PC21
## Standard deviation     5.2776 5.2159 5.200 5.1514 5.1160 5.0559 5.0384
## Proportion of Variance 0.0248 0.0242 0.024 0.0236 0.0232 0.0227 0.0226
## Cumulative Proportion  0.5185 0.5427 0.567 0.5903 0.6135 0.6362 0.6588
##                          PC22   PC23   PC24  PC25   PC26   PC27   PC28
## Standard deviation     5.0187 4.9597 4.9139 4.864 4.8180 4.8081 4.7348
## Proportion of Variance 0.0224 0.0219 0.0215 0.021 0.0206 0.0205 0.0199
## Cumulative Proportion  0.6812 0.7030 0.7245 0.745 0.7661 0.7866 0.8066
##                          PC29   PC30   PC31   PC32   PC33  PC34   PC35
## Standard deviation     4.7010 4.6556 4.6162 4.5673 4.5303 4.495 4.3650
## Proportion of Variance 0.0196 0.0193 0.0189 0.0185 0.0182 0.018 0.0169
## Cumulative Proportion  0.8262 0.8455 0.8644 0.8829 0.9012 0.919 0.9360
##                          PC36   PC37   PC38   PC39     PC40
## Standard deviation     4.3586 4.2670 4.2028 4.1392 5.25e-15
## Proportion of Variance 0.0169 0.0162 0.0157 0.0152 0.00e+00
## Cumulative Proportion  0.9529 0.9691 0.9848 1.0000 1.00e+00

total_load = apply(pr.out$rotation, 1, sum)
indices = order(abs(total_load), decreasing=T)
indices[1:10]

##  [1] 865  68 911 428 624  11 524 803 980 822

total_load[indices[1:10]]

##  [1]  0.7765  0.7138 -0.7100 -0.6364 -0.6196  0.5885  0.5583  0.5535
##  [9] -0.5217  0.4982

This shows one representation of the top 1% of differing genes.

(*) I’m not sure this is the correct way to aggregate the loading vector.