R/Bioconductor Package: HIBAG
Email: Xiuwen Zheng or Bruce S. Weir
This page was last updated on Mar 1, 2017
HIBAG is a state of the art software package for imputing HLA types using SNP data, and it uses the R statistical programming language. HIBAG is highly accurate, computationally tractable, and can be used by researchers with published parameter estimates (provided for subjects of European, Asian, Hispanic and African ancestries) instead of requiring access to large training sample datasets. It combines the concepts of attribute bagging, an ensemble classifier method, with haplotype inference for SNPs and HLA types. Attribute bagging is a technique which improves the accuracy and stability of classifier ensembles deduced using bootstrap aggregating and random variable selection.
back to contents
R/Bioconductor: http://www.bioconductor.org/packages/release/bioc/html/HIBAG.html
The published parameters were estimated from HLA and SNP genotypes of multiple GlaxoSmithKline clinical trials (referred to as “HLARES”) and HapMap Phase 2. The HIBAG models were built from SNP markers common to the Illumina 1M Duo, OmniQuad, OmniExpress, 660K and 550K platforms. The training data consist of 1) HLARES data of European ancestry, 2) HLARES data of Asian ancestry (East & South Asia) and HapMap CHB+JPT, 3) HLARES data of Hispanic ancestry, and 4) African American HLARES data and 60 African parents of HapMap YRI.
HLA Nomenclature Updates (important update: April 2010)
Summary of training data set:
Ethnic-specific models of two-field (4-digit) resolution, presented in Zheng et al. (2014):
Prediction accuracy was used to assess overall model performance,
defined as "the number of chromosomes with HLA alleles predicted correctly"
over "the total number of chromosomes". The standard statistical quantities of
prediction quality for a specific HLA allele H:
library(HIBAG) # Load the published parameter estimates from European ancestry model.list <- get(load("European-HLA4.RData")) ######################################################################### # Import your PLINK BED file # yourgeno <- hlaBED2Geno(bed.fn=".bed", fam.fn=".fam", bim.fn=".bim") summary(yourgeno) # HLA imputation at HLA-A hla.id <- "A" model <- hlaModelFromObj(model.list[[hla.id]]) summary(model) # HLA allele frequencies cbind(frequency = model$hla.freq) # SNPs in the model head(model$snp.id) # "rs2523442" "rs9257863" "rs2107191" "rs4713226" "rs1362076" "rs7751705" head(model$snp.position) # 29525796 29533563 29542274 29542393 29549148 29549597 # best-guess genotypes and all posterior probabilities pred.guess <- predict(model, yourgeno, type="response+prob") summary(pred.guess) pred.guess$value pred.guess$postprob |
library(HIBAG) # Import your PLINK BED file geno <- hlaBED2Geno(bed.fn=".bed", fam.fn=".fam", bim.fn=".bim") summary(geno) # The HLA type of the first individual is 01:02/02:01, the second is 05:01/03:01, ... train.HLA <- hlaAllele(geno$sample.id, H1=c("01:02", "05:01", ...), H2=c("02:01", "03:01", ...), locus="A") # Or the HLA types are saved in a text file "YourHLATypes.txt": # SampleID Allele1 Allele2 # NA001101 01:02 02:01 # NA001201 05:01 03:01 # ... D <- read.table("YourHLATypes.txt", header=TRUE, stringsAsFactors=FALSE) train.HLA <- hlaAllele(D$SampleID, H1=D$Allele1, H2=D$Allele2, locus="A") summary(train.HLA) # Selected SNPs, two options: # 1) the flanking region of 500kb on each side, # or an appropriate flanking size without sacrificing predictive accuracy snpid <- hlaFlankingSNP(geno$snp.id, geno$snp.position, "A", 500*1000) # 2) the SNPs in our pre-fit models model.list <- get(load("European-HLA4.RData")) snpid <- model.list[["A"]]$snp.id # Subset training SNP genotypes train.geno <- hlaGenoSubset(geno, snp.sel=match(snpid, geno$snp.id)) # Building ... set.seed(1000) model <- hlaAttrBagging(train.HLA, train.geno, nclassifier=100, verbose.detail=TRUE) summary(model) # Save your model model.obj <- hlaModelToObj(model) save(model.obj, file="your_model.RData") # Predict ... model.obj <- get(load("your_model.RData")) model <- hlaModelFromObj(model.obj) summary(model) # best-guess genotypes and all posterior probabilities pred.guess <- predict(model, newgeno, type="response+prob") summary(pred.guess) pred.guess$value pred.guess$postprob |
library(parallel) library(HIBAG) # Import your PLINK BED file geno <- hlaBED2Geno(bed.fn=".bed", fam.fn=".fam", bim.fn=".bim") summary(geno) # The HLA type of the first individual is 01:02/02:01, the second is 05:01/03:01, ... train.HLA <- hlaAllele(geno$sample.id, H1=c("01:02", "05:01", ...), H2=c("02:01", "03:01", ...), locus="A") # Or the HLA types are saved in a text file "YourHLATypes.txt": # SampleID Allele1 Allele2 # NA001101 01:02 02:01 # NA001201 05:01 03:01 # ... D <- read.table("YourHLATypes.txt", header=TRUE, stringsAsFactors=FALSE) train.HLA <- hlaAllele(D$SampleID, H1=D$Allele1, H2=D$Allele2, locus="A") summary(train.HLA) # Create an environment with an appropriate cluster size cl <- makeCluster(8) # Building ... set.seed(1000) hlaParallelAttrBagging(cl, train.HLA, geno, nclassifier=100, auto.save="AutoSaveModel.RData") model.obj <- get(load("AutoSaveModel.RData")) model <- hlaModelFromObj(model.obj) summary(model) # best-guess genotypes and all posterior probabilities pred.guess <- predict(model, yourgeno, type="response+prob", cl=cl) summary(pred.guess) pred.guess$value pred.guess$postprob |