1 Intro

This exercise will show how to obtain clinical and genomic data from the Cancer Genome Atlas (TGCA) and to perform classical analysis important for clinical data.

These include:

  1. Download the data (clinical and expresion) from TGCA
  2. Processing of the data (normalization) and saving it locally using simple table formats.
  3. Unsupervised analysis includes differential expression, PCA and clustering.
  4. Build a machine learning model (classifier) to predict cancer.
  5. Perform survival analysis of molecular markers detected in previous analysis.

First, we start by loading all libraries necessary for this exercise. Please check their documentation if you want to know more.

# Load packages
library("TCGAbiolinks")
library("limma")
library("edgeR")
library("glmnet")
library("factoextra")
library("FactoMineR")
library("caret")
library("SummarizedExperiment")
library("gplots")
library("survival")
library("survminer")
library("RColorBrewer")
library("gProfileR")
library("genefilter")

2 TCGA data

In this tutorial, we will focus on Liver Hepatocellular Carcinoma, which is identified in TCGA as LIHC. For LIHC, TCGA provides data for 377 patients including: clinical, expression, DNA methylation and genotyping data. In this tutorial, we will work with clinical and expression data (RNA-seq). Go to https://portal.gdc.cancer.gov/ and search for TCGA-LIHC if you want to understand the data deposited in TCGA. You can also try to find your way through the previous data to look for other data sets of your interest.

We will make use of the TCGAbiolinks library, which allows us to query, prepare and download data from the TCGA portal. TCGAbiolinks provides important functionality as matching data of same the donors across distinct data types (clinical vs expression) and provides data structures to make its analysis in R easy.

To download TCGA data with TCGAbiolinks, you need to follow 3 steps. First, you will query the TCGA database through R with the function GDCquery. This will allow you to investigate the data available at the TCGA database. Next, we use GDCdownload to download raw version of desired files into your computer. Finally GDCprepare will read these files and make R data structures so that we can further analyse them.

Before we get there however we need to know what we are searching for. We can check all the available projects at TCGA with the command bellow. Since there are many lets look at the first 6 projects using the command head().

GDCprojects = getGDCprojects()

head(GDCprojects[c("project_id", "name")])

As a general rule in R (and especially if you are working in RStudio) whenever some method returns some value or table you are not familiar with, you should check its structure and dimensions. You can always use functions such as head() to only show the first entries and dim() to check the dimension of the data.

We already know that Liver Hepatocellular Carcinoma has as id TCGA-LIHC. We can use the following function to get details on all data deposited for TCGA-LIHC.

TCGAbiolinks:::getProjectSummary("TCGA-LIHC")
## $data_categories
##   case_count file_count               data_category
## 1        376       2122     Transcriptome Profiling
## 2        376       1537       Copy Number Variation
## 3        375       3032 Simple Nucleotide Variation
## 4        377        430             DNA Methylation
## 5        377        423                    Clinical
## 6        377       1654            Sequencing Reads
## 7        377       1634                 Biospecimen
## 
## $case_count
## [1] 377
## 
## $file_count
## [1] 10832
## 
## $file_size
## [1] 1.780716e+13

Of note, not all patients were measured for all data types. Also, some data types have more files than samples. This is the case when more experiments were performed per patient, i.e. transcriptome profiling was performed both in mRNA and miRNA, or that data have been analysed by distinct computational strategies.

Let us start by querying all RNA-seq data from LIHC project. When using GDCquery we always need to specify the id of the project, i.e. “TCGA_LIHC”, and the data category we are interested in, i.e. “Transcriptome Profiling”. Here, we will focus on a particular type of data summarization for mRNA-seq data (workflow.type), which is based on raw counts estimated with HTSeq.

Note that performing this query will take a few of minutes

query_TCGA = GDCquery(
  project = "TCGA-LIHC",
  data.category = "Transcriptome Profiling", # parameter enforced by GDCquery
  experimental.strategy = "RNA-Seq",
  workflow.type = "HTSeq - Counts")

To visualize the query results in a more readable way, we can use the command getResults.

lihc_res = getResults(query_TCGA) # make results as table
# head(lihc_res) # data of the first 6 patients.
colnames(lihc_res) # columns present in the table
##  [1] "data_release"              "data_type"                
##  [3] "updated_datetime"          "file_name"                
##  [5] "submitter_id"              "file_id"                  
##  [7] "file_size"                 "cases"                    
##  [9] "id"                        "created_datetime"         
## [11] "md5sum"                    "data_format"              
## [13] "access"                    "state"                    
## [15] "version"                   "data_category"            
## [17] "type"                      "experimental_strategy"    
## [19] "project"                   "analysis_id"              
## [21] "analysis_updated_datetime" "analysis_created_datetime"
## [23] "analysis_submitter_id"     "analysis_state"           
## [25] "analysis_workflow_link"    "analysis_workflow_type"   
## [27] "analysis_workflow_version" "tissue.definition"

One interesting question is the tissue type measured at an experiment (normal, solid tissue, cell line). This information is present at column “tisse.definition”.

head(lihc_res$tissue.definition) # first 6 types of tissue.
## [1] Primary solid Tumor Primary solid Tumor Primary solid Tumor
## [4] Solid Tissue Normal Primary solid Tumor Primary solid Tumor
## 19 Levels: Additional - New Primary ... Solid Tissue Normal

tissue.definition is a factor variable, which can be better visualized with the summary function.

summary(lihc_res$tissue.definition) # summary of distinct tissues types present in this study
##                          Additional - New Primary 
##                                                 0 
##                             Additional Metastatic 
##                                                 0 
##                              Blood Derived Normal 
##                                                 0 
##                                Bone Marrow Normal 
##                                                 0 
##                                Buccal Cell Normal 
##                                                 0 
##                Cell Line Derived Xenograft Tissue 
##                                                 0 
##                                        Cell Lines 
##                                                 0 
##                                   Control Analyte 
##                                                 0 
##                           EBV Immortalized Normal 
##                                                 0 
##                        Human Tumor Original Cells 
##                                                 0 
##                                        Metastatic 
##                                                 0 
##        Primary Blood Derived Cancer - Bone Marrow 
##                                                 0 
##   Primary Blood Derived Cancer - Peripheral Blood 
##                                                 0 
##                          Primary Xenograft Tissue 
##                                                 0 
##                               Primary solid Tumor 
##                                               371 
##      Recurrent Blood Derived Cancer - Bone Marrow 
##                                                 0 
## Recurrent Blood Derived Cancer - Peripheral Blood 
##                                                 0 
##                             Recurrent Solid Tumor 
##                                                 3 
##                               Solid Tissue Normal 
##                                                50

As you see, there are 50 controls (Solid Tissue Normal) and 371 cancer samples (Primary Solid Tumors). For simplicity, we will ignore the small class of recurrent solid tumors. Therefore, we will redo the query as

query_TCGA = GDCquery(
  project = "TCGA-LIHC",
  data.category = "Transcriptome Profiling", # parameter enforced by GDCquery
  experimental.strategy = "RNA-Seq",
  workflow.type = "HTSeq - Counts",
  sample.type = c("Primary solid Tumor", "Solid Tissue Normal"))

Next, we need to download the files from the query. Before, be sure that you set your current working directory to the place you want to save your data. TCGA will save the data in a directory structure startign with a directory “GDCdata”.

Let us now download the files specified in the query.

GDCdownload(query = query_TCGA)

Given that you need to download many files, the previous operation might take some time. Often the download fails for one or another file. You can re-run the previous command until no error message is given. The method will recognize that the data has already been downloaded and won’t download the data again.

Finally, lets load the actual RNASeq data into R. Remember that the output directory set must be the same to where you downloaded the data.

tcga_data = GDCprepare(query_TCGA)

We can then check the size of the object with the command.

dim(tcga_data)
## [1] 56512   421

There are 3 functions that allow us to access to most important data present in this object, these are: colData(), rowData(), assays(). Use the command ?SummarizedExperiment to find more details. colData() allows us to access the clinical data associated with our samples. The functions colnames() and rownames() can be used to extract the column and rows names from a given table respectively.

# In R (and other programming languages) you can often
# chain functions to save time and space
colnames(colData(tcga_data))
##  [1] "sample"                            "patient"                          
##  [3] "barcode"                           "shortLetterCode"                  
##  [5] "definition"                        "year_of_diagnosis"                
##  [7] "classification_of_tumor"           "last_known_disease_status"        
##  [9] "updated_datetime.x"                "primary_diagnosis"                
## [11] "tumor_stage"                       "age_at_diagnosis"                 
## [13] "morphology"                        "days_to_last_known_disease_status"
## [15] "created_datetime.x"                "prior_treatment"                  
## [17] "ajcc_pathologic_n"                 "ajcc_pathologic_m"                
## [19] "state.x"                           "days_to_last_follow_up"           
## [21] "days_to_recurrence"                "diagnosis_id"                     
## [23] "tumor_grade"                       "treatments"                       
## [25] "icd_10_code"                       "days_to_diagnosis"                
## [27] "tissue_or_organ_of_origin"         "progression_or_recurrence"        
## [29] "prior_malignancy"                  "ajcc_staging_system_edition"      
## [31] "ajcc_pathologic_stage"             "synchronous_malignancy"           
## [33] "site_of_resection_or_biopsy"       "ajcc_pathologic_t"                
## [35] "cigarettes_per_day"                "weight"                           
## [37] "updated_datetime.y"                "alcohol_history"                  
## [39] "alcohol_intensity"                 "bmi"                              
## [41] "years_smoked"                      "created_datetime.y"               
## [43] "state.y"                           "exposure_id"                      
## [45] "height"                            "updated_datetime"                 
## [47] "created_datetime"                  "gender"                           
## [49] "state"                             "year_of_birth"                    
## [51] "race"                              "days_to_birth"                    
## [53] "ethnicity"                         "vital_status"                     
## [55] "demographic_id"                    "age_at_index"                     
## [57] "year_of_death"                     "days_to_death"                    
## [59] "bcr_patient_barcode"               "dbgap_accession_number"           
## [61] "disease_type"                      "releasable"                       
## [63] "released"                          "primary_site"                     
## [65] "project_id"                        "name"                             
## [67] "is_ffpe"

This link provides a basic explanation about tcga_data. Note that both clinical and expression data are present in this object.

Lets look at some potentially interesting features. The table() function (in this context) produces a small summary with the sum of each of the factors present in a given column.

table(tcga_data@colData$vital_status)
## 
##        Alive         Dead Not Reported 
##          255          164            2
table(tcga_data@colData$tumor_stage)
## 
## not reported      stage i     stage ii    stage iii   stage iiia   stage iiib 
##           32          189           97            6           73            8 
##   stage iiic     stage iv    stage iva    stage ivb 
##           10            3            1            2
table(tcga_data@colData$definition)
## 
## Primary solid Tumor Solid Tissue Normal 
##                 371                  50
table(tcga_data@colData$tissue_or_organ_of_origin)
## 
## Liver 
##   421
table(tcga_data@colData$gender)
## 
## female   male 
##    143    278
table(tcga_data@colData$race)
## 
## american indian or alaska native                            asian 
##                                2                              164 
##        black or african american                     not reported 
##                               24                               13 
##                            white 
##                              218

Is there a particular column (feature) that allows you to distinguish tumor tissue from normal tissue?

What about the RNA-seq data? We can use the assay function to obtain the RNA-seq count matrices and rowData to see gene mapping information. Can you tell how many genes and how many samples are included there?

dim(assay(tcga_data))     # gene expression matrices.
## [1] 56512   421
head(assay(tcga_data)[,1:10]) # expression of first 6 genes and first 10 samples
##                 TCGA-EP-A2KA-01A-11R-A180-07 TCGA-DD-AACP-01A-11R-A41C-07
## ENSG00000000003                         1819                         8203
## ENSG00000000005                            0                            1
## ENSG00000000419                          789                         1313
## ENSG00000000457                          791                          460
## ENSG00000000460                          171                          802
## ENSG00000000938                          418                           83
##                 TCGA-DD-A11C-01A-11R-A131-07 TCGA-DD-A3A8-11A-11R-A22L-07
## ENSG00000000003                         6408                         4374
## ENSG00000000005                           16                            0
## ENSG00000000419                         3543                          453
## ENSG00000000457                         1314                          210
## ENSG00000000460                          452                           42
## ENSG00000000938                          200                          207
##                 TCGA-BD-A3ER-01A-11R-A213-07 TCGA-DD-A4NO-01A-11R-A28V-07
## ENSG00000000003                         3068                         8685
## ENSG00000000005                            0                            1
## ENSG00000000419                          635                         2054
## ENSG00000000457                          505                          779
## ENSG00000000460                          127                          191
## ENSG00000000938                          237                          111
##                 TCGA-ED-A5KG-01A-11R-A27V-07 TCGA-BC-A110-01A-11R-A131-07
## ENSG00000000003                         4941                         2326
## ENSG00000000005                            2                            2
## ENSG00000000419                         1717                          544
## ENSG00000000457                         1287                          247
## ENSG00000000460                          450                           58
## ENSG00000000938                          898                          336
##                 TCGA-DD-A1EL-11A-11R-A155-07 TCGA-DD-AAD5-01A-11R-A41C-07
## ENSG00000000003                         6314                         1523
## ENSG00000000005                            1                            1
## ENSG00000000419                         1616                         1717
## ENSG00000000457                          806                         1092
## ENSG00000000460                          143                          265
## ENSG00000000938                          351                          169
head(rowData(tcga_data))     # ensembl id and gene id of the first 6 genes.
## DataFrame with 6 rows and 3 columns
##                 ensembl_gene_id external_gene_name original_ensembl_gene_id
##                     <character>        <character>              <character>
## ENSG00000000003 ENSG00000000003             TSPAN6       ENSG00000000003.13
## ENSG00000000005 ENSG00000000005               TNMD        ENSG00000000005.5
## ENSG00000000419 ENSG00000000419               DPM1       ENSG00000000419.11
## ENSG00000000457 ENSG00000000457              SCYL3       ENSG00000000457.12
## ENSG00000000460 ENSG00000000460           C1orf112       ENSG00000000460.15
## ENSG00000000938 ENSG00000000938                FGR       ENSG00000000938.11

Finally, we can use some core functionality of R to save the TCGA_data as a .RDS file. This is faster than repeating the previous operations and useful if you need to work in your data for several days.

# Save the data as a file, if you need it later, you can just load this file
# instead of having to run the whole pipeline again
saveRDS(object = tcga_data,
        file = "tcga_data.RDS",
        compress = FALSE)

The data can be loaded with the following command

tcga_data = readRDS(file = "tcga_data.RDS")

3 RNASeq Normalization

3.1 Defining a pipeline

A typical task on RNA-Seq data is differential expression (DE) analysis, based on some clinical phenotypes. This, in turn, requires normalization of the data, as in its raw format it may have batch effects and other artifacts.

A common approach to such complex tasks is to define a computational pipeline, performing several steps in sequence, allowing the user to select different parameters.

We will now define and run one such pipeline, through the use of an R function.

The function is called limma_pipeline(tcga_data, condition_variable, reference_group), where tcga_data is the data object we have gotten from TCGA and condition_variable is the interesting variable/condition by which you want to group your patient samples. You can also decide which one of the values of your conditional variable is going to be the reference group, with the reference_group parameter.

This function returns a list with three different objects:

  • A complex object, resulting from running voom, this contains the TMM+voom normalized data;
  • A complex object, resulting from running eBayes, this contains the the fitted model plus a number of statistics related to each of the probes;
  • A simple table, resulting from running topTable, which contains the top 100 differentially expressed genes sorted by p-value. This is how the code of this function:
limma_pipeline = function(
  tcga_data,
  condition_variable,
  reference_group=NULL){

  design_factor = colData(tcga_data)[, condition_variable, drop=T]

  group = factor(design_factor)
  if(!is.null(reference_group)){group = relevel(group, ref=reference_group)}

  design = model.matrix(~ group)

  dge = DGEList(counts=assay(tcga_data),
                 samples=colData(tcga_data),
                 genes=as.data.frame(rowData(tcga_data)))

  # filtering
  keep = filterByExpr(dge,design)
  dge = dge[keep,,keep.lib.sizes=FALSE]
  rm(keep)

  # Normalization (TMM followed by voom)
  dge = calcNormFactors(dge)
  v = voom(dge, design, plot=TRUE)

  # Fit model to data given design
  fit = lmFit(v, design)
  fit = eBayes(fit)

  # Show top genes
  topGenes = topTable(fit, coef=ncol(design), number=100, sort.by="p")

  return(
    list(
      voomObj=v, # normalized data
      fit=fit, # fitted model and statistics
      topGenes=topGenes # the 100 most differentially expressed genes
    )
  )
}

Please read the Workflow in Details tab for a step by step explanation of what this pipeline does. For now, copy the limma_pipeline function in your R console, so we can start using it on the TCGA data.

With the following command, we can obtain the DE analysis comparing Primary solid Tumor samples against Solid Tissue Normal. This will be used in the next section, on the classification task.

limma_res = limma_pipeline(
  tcga_data=tcga_data,
  condition_variable="definition",
  reference_group="Solid Tissue Normal"
)