We will show here a typical pipeline for analysis of gene expression data. This tutorial is based on the use of Monocytes and Macrophage data from the following paper:

van de Laar L, Saelens W, De Prijck S, Martens L et al. Yolk Sac Macrophages, Fetal Liver, and Adult Monocytes Can Colonize an Empty Niche and Develop into Functional Tissue-Resident Macrophages. Immunity 2016 Apr 19;44(4):755-68. PMID: 26992565

This tutorial contains the following steps.

  1. Download the data from GEO
  2. Process the expression table for analysis
  3. DE Analysis by limma
  4. PCA (Principal component analysis)
  5. Clustering by Heatmap (only top 250 genes)
  6. GO (Gene Ontology) Enrichment Analysis
  7. Combining other dataset for PCA and clustering

Steps 1 through 3 correspond to code that can be automatically generated by GEO2R from Gene Expression Omnibus (GEO), as shown during the course.

Here you can find the files necessary to complete this tutorial. Inside this zip folder, course_files.zip, you have:

File Description
handout.pdf PDF version of this page, a step-by-step explanation of the tasks.
Imun_gen.ex.Rdump Preprocessed object from GEO for loading in the analysis.
pheno.table A table storing the phenotype data for comparison.

Loading the libraries

Before running the analysis, all the necessary libraries need to be loaded. We assume you have successfully installed all libraries as described here (http://costalab.org/r-installation/)


Biobase contains base functions for Bioconductor;
GEOquery is the bridge between GEO and BioConductor;
limma is a library for the analysis of gene expression microarray data, especially the use of linear models for analyzing designed experiments and the assessment of differential expression;
gplots is for making graphical figures;
RColorBrewer can make colorful palettes;
gProfileR is the package to perform gene enrichment analysis;
sva (ComBat) is used here to remove known batch effects from microarray data.

1 Download the data from GEO

First, we need to download the data from GEO by the GEO number.

gset <- getGEO("GSE76999", GSEMatrix =TRUE, AnnotGPL=TRUE)
if (length(gset) > 1) idx <- grep("GPL6246", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

ExpressionSet (gset here) is a Container for high-throughput assays and experimental metadata.

2 Process the expression table for analysis

Here we select only the first 12 samples for downstream analysis and eliminate others. These 12 samples are BM-MOs (bone marrow monocytes), FL-MOs (fetal liver monocytes), and YS-Macs (yolk sac-derived macrophages), with 4 replicates for each.

# make proper column names to match toptable
fvarLabels(gset) <- make.names(fvarLabels(gset))
# keep only first 12 samples
gset <- gset[ ,1:12]
# pData function allows us to check the sample data information in the gset container
# as a sanity check, we examine the sample information to make sure we picked the correct samples
# the columns "title" and "tissue:ch1" give us the relevant information for this
##                                                                                 title
## GSM2042244 Monocyte extracted from adult (wk6-12) Bone Marrow, biological replicate 1
## GSM2042245 Monocyte extracted from adult (wk6-12) Bone Marrow, biological replicate 2
## GSM2042246 Monocyte extracted from adult (wk6-12) Bone Marrow, biological replicate 3
## GSM2042247 Monocyte extracted from adult (wk6-12) Bone Marrow, biological replicate 4
## GSM2042248          Monocyte extracted from E15.5 Fetal Liver, biological replicate 1
## GSM2042249          Monocyte extracted from E15.5 Fetal Liver, biological replicate 2
## GSM2042250          Monocyte extracted from E15.5 Fetal Liver, biological replicate 3
## GSM2042251          Monocyte extracted from E15.5 Fetal Liver, biological replicate 4
## GSM2042252           Macrophage extracted from E12.5 Yolk Sac, biological replicate 1
## GSM2042253           Macrophage extracted from E12.5 Yolk Sac, biological replicate 2
## GSM2042254           Macrophage extracted from E12.5 Yolk Sac, biological replicate 3
## GSM2042255           Macrophage extracted from E12.5 Yolk Sac, biological replicate 4
##             tissue:ch1
## GSM2042244 Bone Marrow
## GSM2042245 Bone Marrow
## GSM2042246 Bone Marrow
## GSM2042247 Bone Marrow
## GSM2042248 Fetal Liver
## GSM2042249 Fetal Liver
## GSM2042250 Fetal Liver
## GSM2042251 Fetal Liver
## GSM2042252    Yolk Sac
## GSM2042253    Yolk Sac
## GSM2042254    Yolk Sac
## GSM2042255    Yolk Sac
# make label (remember 12 samples, 3 different types, 4 replicates of each)
factor_label <- as.factor(c(rep(0,4), rep(1,4), rep(2,4)))
# line below creates the same label but directly from the sample data information 
# factor_label <- as.factor(as.numeric(as.factor(pData(gset)[,"tissue:ch1"]))-1)

labels <- c("BM-MOs", "FL-MOs", "YS-Macs")

Before we do any further analysis, we need to do quality control of the data. We use boxplot to check whether we need further normalization on the data.

palette(c("#dfeaf4","#f4dfdf", "#f2cb98"))
title <- paste ("GSE76999", '/', annotation(gset), " selected samples", sep ='')
boxplot(exprs(gset), boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=factor_label)
legend("topleft", labels, fill=palette(), bty="n")

The boxplot indicates all samples have the same distribution, therefore there is no need for any normalization.

3 Differential Expression Analysis by limma

Next, we want to find genes that are differentially expressed between BM-MOs, FL-MOs and YS-Macs using limma. To this end, we perform a log2 transformation of the expression data. This is how gene expression data is often displayed. Then, we create a design matrix defining the groups to compare. Finally, using limma, we fit a model to the data given the design matrix we defined. Here we show you how to get the top 250 DE genes.

# transform into log2 scale
exprs(gset) <- log2(exprs(gset))

# set up the data and proceed with analysis
tmp <- paste("G", factor_label, sep="") # set group names
factor_label <- as.factor(tmp)
gset$description <- factor_label
design <- model.matrix(~ description - 1, gset) # "- 1" removes the intercept term
colnames(design) <- levels(factor_label)
fit <- lmFit(gset, design) # fit model to data given design
cont.matrix <- makeContrasts(G2-G0, G1-G0, G2-G1, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.05)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)
tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","F","Gene.symbol","Gene.title"))
write.table(tT, file="de_genes.csv", row.names=F, sep="\t")

Please open the output table de_genes.csv and try to understand each column.

The above script is (basically) from GEO2R and only gets the top 250 DE genes. In order to get all significant DE genes, we use the script below:

tT <- topTable(fit2, adjust="fdr", sort.by="B", p.value=0.05, number=Inf)

Up to this point most of the code can be automatically generated with GEO2R. For these data specifically, you can do it from https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE76999, by selecting the correct sample groups and clicking the R Script tab.

MA plot

The MA plot visualizes the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values. Here we take BM-MOs (G0) v.s. YS-Macs (G2) as an example.

ma <- fit2[,"G2 - G0"]

Volcano plot

A volcano plot is a type of scatter-plot that is used to quickly identify changes in large data sets. It plots significance versus fold-change on the y and x axes, respectively.

volcanoplot(fit2, coef=1, xlim=c(-2,2), col="gray")