Tophat2 : Download, build reference genome and align the reads to the reference genome

Objectives
Download data
Download the reference genome
- Download reference genome from Ensembl
- Download reference genome from UCSC
Get gene model annotations
Build the reference index
Workflow
Inspect alignments with IGV

This analysis was performed using R (ver. 3.1.0).

Objectives

The aim of this article is to show you:

How to download and build reference genome?
How to align the reads using Tophat2?
How to inspect BAM file using IGV?

Download data

The data from Popovic et al., (GEO accession number: GSE57478) where used in the following example. The SRA files are available here : http://www.ncbi.nlm.nih.gov/sra?term=SRP032510.

Download the SRA files and convert to FASTQ files. This is described here.

Download the reference genome

Create a directory called “genome/” to save your reference genome.

Note that bowtie2/tophat2 indices for many commonly used reference genomes can be downloaded directly from http://tophat.cbcb.umd.edu/igenomes.html.

Reference genome index (from FASTA file) for bowtie2/tophat2, can be build by following the explanation down below.

User have to download the reference genome sequence for the organism under study in (compressed) FASTA format. This can be done from Ensembl and UCSC databases among many others.

Ensembl FTP server : http://www.ensembl.org/info/data/ftp/index.html UCSC FTP server : ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/

For Ensembl, choose the FASTA (DNA) link instead of FASTA (cDNA), since alignments to the genome, not the transcriptome, are desired.

Download reference genome from Ensembl

In this article, homo sapiens reference genome from Ensembl database is used. For homo sapiens the file labeled toplevel combines all chromosomes. Download and uncompress the reference genome, using the following UNIX commands :

wget ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz gunzip Homo_sapiens.GRCh38.dna.toplevel.fa.gz

Download reference genome from UCSC

Click here to download the data corresponding to your organism of interest
Click on “Homo sapiens” -> bigZips -> download the “chromFa.tar.gz”

UNIX command :

wget ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/bigZips/chromFa.tar.gz gunzip chromFa.tar.gz

Get gene model annotations

Download a GTF file with gene models for the organism of interest.

Homo sapiens gene model annotation can be downloaded as follow:

wget ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz gunzip Homo_sapiens.GRCh38.77.gtf.gz

Always download the FASTA reference sequence and the GTF annotation data from the same resource provider.

Build the reference index

Before reads can be aligned, the reference FASTA files need to be preprocessed into an index that allows the aligner easy access.

To build a bowtie2-specific index from the FASTA file use the command:

bowtie2-build -f Homo_sapiens.GRCh38.dna.toplevel.fa. Homo_sapiens_GRCh38

A set of BT2 files will be produced, with names starting with Homo_sapiens_GRCh38 as specified.

This procedure needs to be run only once for each reference genome used. As mentioned, pre-built indices for many commonly used genomes are available from http://tophat.cbcb.umd.edu/igenomes.html.

Workflow

Assess sequence quality control with ShortRead

From R, set your working directory (using `setwd’ command) to the directory where FASTQ file are situated.
Run the folllowing R code :

library("ShortRead")
QC = qa(dirPath=".", pattern=".fastq$", type="fastq")
report(QC, type="html", dest="fastqQAreport")

The output is an html files located in the “fastqQAreport” directory.

Use a web browser to inspect the generated HTML file.

Prepare your sample metadata informations

# Initial SRA info table
sri<-read.csv("SraRunInfo.csv", stringsAsFactors=FALSE)
# Prepare sample information table
#Add important descriptive columns
samples=as.data.frame(
  list(
     name=c('KMS11_GSK343_1', 'KMS11_GSK343_2','TKO_GSK343_1', 'TKO_GSK343_2',
           'KMS11_GSK669_1', 'KMS11_GSK669_2', 'TKO_GSK669_1', 'TKO_GSK669_2'),
     fastq=paste(sri$Run, ".fastq", sep=""),
     avgLength=sri$avgLength,
     data_type=sri$LibraryStrategy,
     layout = sri$LibraryLayout,
     geoid=sri$SampleName,
     cell_line=c(1, 1, 2, 2, 1,1,2,2),
     treatment=c("ctr", "ctr", "ctr", "ctr", "ezh2i", "ezh2i", "ezh2i", "ezh2i")
    )
  )
rownames(samples)=samples$name
samples[sort(rownames(samples)), ]

##                          name            fastq avgLength data_type layout      geoid cell_line treatment
## KMS11_GSK343_1 KMS11_GSK343_1 SRR1282056.fastq        51   RNA-Seq SINGLE GSM1383539         1       ctr
## KMS11_GSK343_2 KMS11_GSK343_2 SRR1282057.fastq        51   RNA-Seq SINGLE GSM1383540         1       ctr
## KMS11_GSK669_1 KMS11_GSK669_1 SRR1282060.fastq        51   RNA-Seq SINGLE GSM1383543         1     ezh2i
## KMS11_GSK669_2 KMS11_GSK669_2 SRR1282061.fastq        51   RNA-Seq SINGLE GSM1383544         1     ezh2i
## TKO_GSK343_1     TKO_GSK343_1 SRR1282058.fastq        51   RNA-Seq SINGLE GSM1383541         2       ctr
## TKO_GSK343_2     TKO_GSK343_2 SRR1282059.fastq        51   RNA-Seq SINGLE GSM1383542         2       ctr
## TKO_GSK669_1     TKO_GSK669_1 SRR1282062.fastq        51   RNA-Seq SINGLE GSM1383545         2     ezh2i
## TKO_GSK669_2     TKO_GSK669_2 SRR1282063.fastq        51   RNA-Seq SINGLE GSM1383546         2     ezh2i

head(samples)

##                          name            fastq avgLength data_type layout      geoid cell_line treatment
## KMS11_GSK343_1 KMS11_GSK343_1 SRR1282056.fastq        51   RNA-Seq SINGLE GSM1383539         1       ctr
## KMS11_GSK343_2 KMS11_GSK343_2 SRR1282057.fastq        51   RNA-Seq SINGLE GSM1383540         1       ctr
## TKO_GSK343_1     TKO_GSK343_1 SRR1282058.fastq        51   RNA-Seq SINGLE GSM1383541         2       ctr
## TKO_GSK343_2     TKO_GSK343_2 SRR1282059.fastq        51   RNA-Seq SINGLE GSM1383542         2       ctr
## KMS11_GSK669_1 KMS11_GSK669_1 SRR1282060.fastq        51   RNA-Seq SINGLE GSM1383543         1     ezh2i
## KMS11_GSK669_2 KMS11_GSK669_2 SRR1282061.fastq        51   RNA-Seq SINGLE GSM1383544         1     ezh2i

Since the downstream statistical analysis of differential expression relies on this table, carefully inspect (and correct, if necessary) the metadata table.

Align the reads to reference genome using tophat2

Use R to create the list of shell commands :

genome="genome/Homo_sapiens_GRCh38"
cmd=with(samples, paste("tophat2 -o ", name, " -p 10  ",genome, " ", fastq, sep="" ))
for(c in cmd) system(c)

## tophat2 -o KMS11_GSK343_1 -p 10  genome/Homo_sapiens_GRCh38 SRR1282056.fastq 
## tophat2 -o KMS11_GSK343_2 -p 10  genome/Homo_sapiens_GRCh38 SRR1282057.fastq 
## tophat2 -o TKO_GSK343_1 -p 10  genome/Homo_sapiens_GRCh38 SRR1282058.fastq 
## tophat2 -o TKO_GSK343_2 -p 10  genome/Homo_sapiens_GRCh38 SRR1282059.fastq 
## tophat2 -o KMS11_GSK669_1 -p 10  genome/Homo_sapiens_GRCh38 SRR1282060.fastq 
## tophat2 -o KMS11_GSK669_2 -p 10  genome/Homo_sapiens_GRCh38 SRR1282061.fastq 
## tophat2 -o TKO_GSK669_1 -p 10  genome/Homo_sapiens_GRCh38 SRR1282062.fastq 
## tophat2 -o TKO_GSK669_2 -p 10  genome/Homo_sapiens_GRCh38 SRR1282063.fastq

In the call to tophat2, the option -o specifies the output directory, -p specifies the number of threads to use (affect run times, vary depending on the resources available). The first argument is the name of the index. The second argument is a list of all FASTQ files.

Other parameters can be specified to tophat2 :

tophat2 -G Homo_sapiens.GRCh38.77.gtf -o output_dir -p 10 –-no-coverage-search genome/Homo_sapiens_GRCh38 file.fastq

- The option -G points tophat2 to a GTF file of annotation to facilitate mapping reads across exon-exon junctions (some of witch can be found de novo).

The coverage-search algorithm was turned off because it took too long.

Note that the FASTQ files are concatenated with commas, without spaces. For experiments with paired-end reads, pairs of FASTQ files are given as separate arguments and the order in both arguments must match.

paired-end sequencing

tophat2 -o tophat_output_dir -p 8 –no-coverage-search /path/to/genome/Bowtie2Index/genome file_1.fastq file_2.fastq samtools sort -n file_tophat_out/accepted_hits.bam _sorted

Single-end sequencing tophat2 -o tophat_output_dir -p 8 /path/to/genome/Bowtie2Index/genome file.fastq samtools sort -n file_tophat_out/accepted_hits.bam _sorted

Run the command

The commands can be executed by copy and paste in UNIX terminal. system function can be used to execute these command direct from R. The list of the commands can also be stored in a text file and use UNIX source command.

Sort and index the BAM files and create SAM files

BAM files are organized into a single directory. SAM files is required when you want to count reads with htseq-count.

for(i in 1:nrow(samples)){
  lib=samples$name[i]
  bamFile=file.path(lib, "accepted_hits.bam")
  #sort by name and convert to SAM for htseq-count counting
  system(paste0("samtools sort -n ",bamFile," ",lib,"_sn")) #sort
  system(paste0("samtools view -o ",lib,"_sn.sam ",lib,"_sn.bam")) #convert to sam
 
 # sort by position and index for IGV
 system(paste0("samtools sort ",bamFile," ",lib,"_s"))
 system(paste0("samtools index ",lib,"_s.bam"))
}

The following code is executed :

## samtools sort -n KMS11_GSK343_1/accepted_hits.bam KMS11_GSK343_1_sn 
## samtools view -o KMS11_GSK343_1_sn.sam KMS11_GSK343_1_sn.bam 
## samtools sort KMS11_GSK343_1/accepted_hits.bam KMS11_GSK343_1_s 
## samtools index KMS11_GSK343_1_s.bam 
## 
## samtools sort -n KMS11_GSK343_2/accepted_hits.bam KMS11_GSK343_2_sn 
## samtools view -o KMS11_GSK343_2_sn.sam KMS11_GSK343_2_sn.bam 
## samtools sort KMS11_GSK343_2/accepted_hits.bam KMS11_GSK343_2_s 
## samtools index KMS11_GSK343_2_s.bam 
## 
## samtools sort -n TKO_GSK343_1/accepted_hits.bam TKO_GSK343_1_sn 
## samtools view -o TKO_GSK343_1_sn.sam TKO_GSK343_1_sn.bam 
## samtools sort TKO_GSK343_1/accepted_hits.bam TKO_GSK343_1_s 
## samtools index TKO_GSK343_1_s.bam 
## 
## samtools sort -n TKO_GSK343_2/accepted_hits.bam TKO_GSK343_2_sn 
## samtools view -o TKO_GSK343_2_sn.sam TKO_GSK343_2_sn.bam 
## samtools sort TKO_GSK343_2/accepted_hits.bam TKO_GSK343_2_s 
## samtools index TKO_GSK343_2_s.bam 
## 
## samtools sort -n KMS11_GSK669_1/accepted_hits.bam KMS11_GSK669_1_sn 
## samtools view -o KMS11_GSK669_1_sn.sam KMS11_GSK669_1_sn.bam 
## samtools sort KMS11_GSK669_1/accepted_hits.bam KMS11_GSK669_1_s 
## samtools index KMS11_GSK669_1_s.bam 
## 
## samtools sort -n KMS11_GSK669_2/accepted_hits.bam KMS11_GSK669_2_sn 
## samtools view -o KMS11_GSK669_2_sn.sam KMS11_GSK669_2_sn.bam 
## samtools sort KMS11_GSK669_2/accepted_hits.bam KMS11_GSK669_2_s 
## samtools index KMS11_GSK669_2_s.bam 
## 
## samtools sort -n TKO_GSK669_1/accepted_hits.bam TKO_GSK669_1_sn 
## samtools view -o TKO_GSK669_1_sn.sam TKO_GSK669_1_sn.bam 
## samtools sort TKO_GSK669_1/accepted_hits.bam TKO_GSK669_1_s 
## samtools index TKO_GSK669_1_s.bam 
## 
## samtools sort -n TKO_GSK669_2/accepted_hits.bam TKO_GSK669_2_sn 
## samtools view -o TKO_GSK669_2_sn.sam TKO_GSK669_2_sn.bam 
## samtools sort TKO_GSK669_2/accepted_hits.bam TKO_GSK669_2_s 
## samtools index TKO_GSK669_2_s.bam

For each original accepted_hits.bam file, sorted-by-name SAM and BAM files (for htseq-count), as well as a sorted-by-chromosome-position BAM file (for IGV) are created.

Inspect alignments with IGV

An extensive explanation is described by clicking on this link : igv-integrative-genomics-viewer.

IGV can be downloaded here : http://www.broadinstitute.org/igv/

Briefly, start IGV Select the correct genome (Human ) Load BAM file and the GTF file Search for a gene of interest and zoom in.

References: http://master.bioconductor.org/help/course-materials/2013/CSAMA2013/

Enjoyed this article? I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!

Recommended for You!

Machine Learning Essentials: Practical Guide in R

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

R Graphics Essentials for Great Data Visualization

Network Analysis and Visualization in R

More books on R and data science

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Coursera - Online Courses and Specialization

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Want to Learn More on R Programming and Data Science?

Follow us by Email On Social Networks:

Get involved :
Click to follow us on Facebook and Google+ :
Comment this article by clicking on "Discussion" button (top-right position of this page)