snpgdsPCA {SNPRelate} | R Documentation |
To calculate the eigenvectors and eigenvalues for principal component analysis in GWAS.
snpgdsPCA(gdsobj, sample.id=NULL, snp.id=NULL, autosome.only=TRUE, remove.monosnp=TRUE, maf=NaN, missing.rate=NaN, algorithm=c("exact", "randomized"), eigen.cnt=ifelse(identical(algorithm, "randomized"), 16L, 32L), num.thread=1L, bayesian=FALSE, need.genmat=FALSE, genmat.only=FALSE, eigen.method=c("DSPEVX", "DSPEV"), aux.dim=eigen.cnt*2L, iter.num=10L, verbose=TRUE) ## S3 method for class 'snpgdsPCAClass' plot(x, eig=c(1L,2L), ...)
gdsobj |
an object of class |
sample.id |
a vector of sample id specifying selected samples; if NULL, all samples are used |
snp.id |
a vector of snp id specifying selected SNPs; if NULL, all SNPs are used |
autosome.only |
if |
remove.monosnp |
if TRUE, remove monomorphic SNPs |
maf |
to use the SNPs with ">= maf" only; if NaN, no MAF threshold |
missing.rate |
to use the SNPs with "<= missing.rate" only; if NaN, no missing threshold |
eigen.cnt |
output the number of eigenvectors; if eigen.cnt <= 0, then return all eigenvectors |
algorithm |
"exact", traditional exact calculation; "randomized", fast PCA with randomized algorithm introduced in Galinsky et al. 2016 |
num.thread |
the number of (CPU) cores used; if |
bayesian |
if TRUE, use bayesian normalization |
need.genmat |
if TRUE, return the genetic covariance matrix |
genmat.only |
return the genetic covariance matrix only, do not compute the eigenvalues and eigenvectors |
eigen.method |
"DSPEVX" – compute the top |
aux.dim |
auxiliary dimension used in fast randomized algorithm |
iter.num |
iteration number used in fast randomized algorithm |
verbose |
if TRUE, show information |
x |
a |
eig |
indices of eigenvectors, like |
... |
the arguments passed to or from other methods, like
|
The minor allele frequency and missing rate for each SNP passed in
snp.id
are calculated over all the samples in sample.id
.
Return a snpgdsPCAClass
object, and it is a list:
sample.id |
the sample ids used in the analysis |
snp.id |
the SNP ids used in the analysis |
eigenval |
eigenvalues |
eigenvect |
eigenvactors, "# of samples" x "eigen.cnt" |
varprop |
variance proportion for each principal component |
TraceXTX |
the trace of the genetic covariance matrix |
Bayesian |
whether use bayerisan normalization |
genmat |
the genetic covariance matrix |
Xiuwen Zheng
Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006 Dec;2(12):e190.
Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet. 2016 Mar 3;98(3):456-72. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.
snpgdsPCACorr
, snpgdsPCASNPLoading
,
snpgdsPCASampLoading
, snpgdsAdmixProp
,
snpgdsEIGMIX
# open an example dataset (HapMap) genofile <- snpgdsOpen(snpgdsExampleFileName()) # run PCA RV <- snpgdsPCA(genofile) # eigenvalues head(RV$eigenval) # variance proportion (%) head(round(RV$varprop*100, 2)) # [1] 12.23 5.84 1.01 0.95 0.84 0.74 plot(RV) plot(RV, 1:4) #### there is no population information #### # make a data.frame tab <- data.frame(sample.id = RV$sample.id, EV1 = RV$eigenvect[,1], # the first eigenvector EV2 = RV$eigenvect[,2], # the second eigenvector stringsAsFactors = FALSE) head(tab) # sample.id EV1 EV2 # 1 NA19152 -0.08411287 -0.01226860 # 2 NA19139 -0.08360644 -0.01085849 # 3 NA18912 -0.08110808 -0.01184524 # 4 NA19160 -0.08680864 -0.01447106 # 5 NA07034 0.03109761 0.07709255 # 6 NA07055 0.03228450 0.08155730 # draw plot(tab$EV2, tab$EV1, xlab="eigenvector 2", ylab="eigenvector 1") #### there are population information #### # get population information # or pop_code <- scan("pop.txt", what=character()) # if it is stored in a text file "pop.txt" pop_code <- read.gdsn(index.gdsn(genofile, "sample.annot/pop.group")) # get sample id samp.id <- read.gdsn(index.gdsn(genofile, "sample.id")) # assume the order of sample IDs is as the same as population codes cbind(samp.id, pop_code) # samp.id pop_code # [1,] "NA19152" "YRI" # [2,] "NA19139" "YRI" # [3,] "NA18912" "YRI" # [4,] "NA19160" "YRI" # [5,] "NA07034" "CEU" # ... # make a data.frame tab <- data.frame(sample.id = RV$sample.id, pop = factor(pop_code)[match(RV$sample.id, samp.id)], EV1 = RV$eigenvect[,1], # the first eigenvector EV2 = RV$eigenvect[,2], # the second eigenvector stringsAsFactors = FALSE) head(tab) # sample.id pop EV1 EV2 # 1 NA19152 YRI -0.08411287 -0.01226860 # 2 NA19139 YRI -0.08360644 -0.01085849 # 3 NA18912 YRI -0.08110808 -0.01184524 # 4 NA19160 YRI -0.08680864 -0.01447106 # 5 NA07034 CEU 0.03109761 0.07709255 # 6 NA07055 CEU 0.03228450 0.08155730 # draw plot(tab$EV2, tab$EV1, col=as.integer(tab$pop), xlab="eigenvector 2", ylab="eigenvector 1") legend("bottomright", legend=levels(tab$pop), pch="o", col=1:4) # close the file snpgdsClose(genofile)