Tophat2 : Download, build reference genome and align the reads to the reference genome
This analysis was performed using R (ver. 3.1.0).
Objectives
The aim of this article is to show you:
- How to download and build reference genome?
- How to align the reads using Tophat2?
- How to inspect BAM file using IGV?
Download data
The data from Popovic et al., (GEO accession number: GSE57478) where used in the following example. The SRA files are available here : http://www.ncbi.nlm.nih.gov/sra?term=SRP032510.
Download the SRA files and convert to FASTQ files. This is described here.
Download the reference genome
- Create a directory called “genome/” to save your reference genome.
Note that bowtie2/tophat2 indices for many commonly used reference genomes can be downloaded directly from http://tophat.cbcb.umd.edu/igenomes.html.
Reference genome index (from FASTA file) for bowtie2/tophat2, can be build by following the explanation down below.
User have to download the reference genome sequence for the organism under study in (compressed) FASTA format. This can be done from Ensembl and UCSC databases among many others.
Ensembl FTP server : http://www.ensembl.org/info/data/ftp/index.html UCSC FTP server : ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/
For Ensembl, choose the FASTA (DNA) link instead of FASTA (cDNA), since alignments to the genome, not the transcriptome, are desired.
Download reference genome from Ensembl
In this article, homo sapiens reference genome from Ensembl database is used. For homo sapiens the file labeled toplevel combines all chromosomes. Download and uncompress the reference genome, using the following UNIX commands :
wget ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz gunzip Homo_sapiens.GRCh38.dna.toplevel.fa.gz
Download reference genome from UCSC
- Click here to download the data corresponding to your organism of interest
- Click on “Homo sapiens” -> bigZips -> download the “chromFa.tar.gz”
UNIX command :
wget ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/bigZips/chromFa.tar.gz gunzip chromFa.tar.gz
Get gene model annotations
Download a GTF file with gene models for the organism of interest.
Homo sapiens gene model annotation can be downloaded as follow:
wget ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz gunzip Homo_sapiens.GRCh38.77.gtf.gz
Always download the FASTA reference sequence and the GTF annotation data from the same resource provider.
Build the reference index
Before reads can be aligned, the reference FASTA files need to be preprocessed into an index that allows the aligner easy access.
To build a bowtie2-specific index from the FASTA file use the command:
bowtie2-build -f Homo_sapiens.GRCh38.dna.toplevel.fa. Homo_sapiens_GRCh38
A set of BT2 files will be produced, with names starting with Homo_sapiens_GRCh38 as specified.
This procedure needs to be run only once for each reference genome used. As mentioned, pre-built indices for many commonly used genomes are available from http://tophat.cbcb.umd.edu/igenomes.html.
Workflow
Assess sequence quality control with ShortRead
From R, set your working directory (using `setwd’ command) to the directory where FASTQ file are situated.
Run the folllowing R code :
library("ShortRead")
QC = qa(dirPath=".", pattern=".fastq$", type="fastq")
report(QC, type="html", dest="fastqQAreport")
The output is an html files located in the “fastqQAreport” directory.
- Use a web browser to inspect the generated HTML file.
Prepare your sample metadata informations
# Initial SRA info table
sri<-read.csv("SraRunInfo.csv", stringsAsFactors=FALSE)
# Prepare sample information table
#Add important descriptive columns
samples=as.data.frame(
list(
name=c('KMS11_GSK343_1', 'KMS11_GSK343_2','TKO_GSK343_1', 'TKO_GSK343_2',
'KMS11_GSK669_1', 'KMS11_GSK669_2', 'TKO_GSK669_1', 'TKO_GSK669_2'),
fastq=paste(sri$Run, ".fastq", sep=""),
avgLength=sri$avgLength,
data_type=sri$LibraryStrategy,
layout = sri$LibraryLayout,
geoid=sri$SampleName,
cell_line=c(1, 1, 2, 2, 1,1,2,2),
treatment=c("ctr", "ctr", "ctr", "ctr", "ezh2i", "ezh2i", "ezh2i", "ezh2i")
)
)
rownames(samples)=samples$name
samples[sort(rownames(samples)), ]
## name fastq avgLength data_type layout geoid cell_line treatment
## KMS11_GSK343_1 KMS11_GSK343_1 SRR1282056.fastq 51 RNA-Seq SINGLE GSM1383539 1 ctr
## KMS11_GSK343_2 KMS11_GSK343_2 SRR1282057.fastq 51 RNA-Seq SINGLE GSM1383540 1 ctr
## KMS11_GSK669_1 KMS11_GSK669_1 SRR1282060.fastq 51 RNA-Seq SINGLE GSM1383543 1 ezh2i
## KMS11_GSK669_2 KMS11_GSK669_2 SRR1282061.fastq 51 RNA-Seq SINGLE GSM1383544 1 ezh2i
## TKO_GSK343_1 TKO_GSK343_1 SRR1282058.fastq 51 RNA-Seq SINGLE GSM1383541 2 ctr
## TKO_GSK343_2 TKO_GSK343_2 SRR1282059.fastq 51 RNA-Seq SINGLE GSM1383542 2 ctr
## TKO_GSK669_1 TKO_GSK669_1 SRR1282062.fastq 51 RNA-Seq SINGLE GSM1383545 2 ezh2i
## TKO_GSK669_2 TKO_GSK669_2 SRR1282063.fastq 51 RNA-Seq SINGLE GSM1383546 2 ezh2i
head(samples)
## name fastq avgLength data_type layout geoid cell_line treatment
## KMS11_GSK343_1 KMS11_GSK343_1 SRR1282056.fastq 51 RNA-Seq SINGLE GSM1383539 1 ctr
## KMS11_GSK343_2 KMS11_GSK343_2 SRR1282057.fastq 51 RNA-Seq SINGLE GSM1383540 1 ctr
## TKO_GSK343_1 TKO_GSK343_1 SRR1282058.fastq 51 RNA-Seq SINGLE GSM1383541 2 ctr
## TKO_GSK343_2 TKO_GSK343_2 SRR1282059.fastq 51 RNA-Seq SINGLE GSM1383542 2 ctr
## KMS11_GSK669_1 KMS11_GSK669_1 SRR1282060.fastq 51 RNA-Seq SINGLE GSM1383543 1 ezh2i
## KMS11_GSK669_2 KMS11_GSK669_2 SRR1282061.fastq 51 RNA-Seq SINGLE GSM1383544 1 ezh2i
Since the downstream statistical analysis of differential expression relies on this table, carefully inspect (and correct, if necessary) the metadata table.
Align the reads to reference genome using tophat2
Use R to create the list of shell commands :
genome="genome/Homo_sapiens_GRCh38"
cmd=with(samples, paste("tophat2 -o ", name, " -p 10 ",genome, " ", fastq, sep="" ))
for(c in cmd) system(c)
## tophat2 -o KMS11_GSK343_1 -p 10 genome/Homo_sapiens_GRCh38 SRR1282056.fastq
## tophat2 -o KMS11_GSK343_2 -p 10 genome/Homo_sapiens_GRCh38 SRR1282057.fastq
## tophat2 -o TKO_GSK343_1 -p 10 genome/Homo_sapiens_GRCh38 SRR1282058.fastq
## tophat2 -o TKO_GSK343_2 -p 10 genome/Homo_sapiens_GRCh38 SRR1282059.fastq
## tophat2 -o KMS11_GSK669_1 -p 10 genome/Homo_sapiens_GRCh38 SRR1282060.fastq
## tophat2 -o KMS11_GSK669_2 -p 10 genome/Homo_sapiens_GRCh38 SRR1282061.fastq
## tophat2 -o TKO_GSK669_1 -p 10 genome/Homo_sapiens_GRCh38 SRR1282062.fastq
## tophat2 -o TKO_GSK669_2 -p 10 genome/Homo_sapiens_GRCh38 SRR1282063.fastq
In the call to tophat2, the option -o specifies the output directory, -p specifies the number of threads to use (affect run times, vary depending on the resources available). The first argument is the name of the index. The second argument is a list of all FASTQ files.
Other parameters can be specified to tophat2 :
tophat2 -G Homo_sapiens.GRCh38.77.gtf -o output_dir -p 10 –-no-coverage-search genome/Homo_sapiens_GRCh38 file.fastq
- The option -G points tophat2 to a GTF file of annotation to facilitate mapping reads across exon-exon junctions (some of witch can be found de novo).
- The coverage-search algorithm was turned off because it took too long.
Note that the FASTQ files are concatenated with commas, without spaces. For experiments with paired-end reads, pairs of FASTQ files are given as separate arguments and the order in both arguments must match.
- paired-end sequencing
tophat2 -o tophat_output_dir -p 8 –no-coverage-search /path/to/genome/Bowtie2Index/genome file_1.fastq file_2.fastq samtools sort -n file_tophat_out/accepted_hits.bam _sorted
- Single-end sequencing tophat2 -o tophat_output_dir -p 8 /path/to/genome/Bowtie2Index/genome file.fastq samtools sort -n file_tophat_out/accepted_hits.bam _sorted
Run the command
The commands can be executed by copy and paste in UNIX terminal. system
function can be used to execute these command direct from R. The list of the commands can also be stored in a text file and use UNIX source command.
Sort and index the BAM files and create SAM files
BAM files are organized into a single directory. SAM files is required when you want to count reads with htseq-count.
for(i in 1:nrow(samples)){
lib=samples$name[i]
bamFile=file.path(lib, "accepted_hits.bam")
#sort by name and convert to SAM for htseq-count counting
system(paste0("samtools sort -n ",bamFile," ",lib,"_sn")) #sort
system(paste0("samtools view -o ",lib,"_sn.sam ",lib,"_sn.bam")) #convert to sam
# sort by position and index for IGV
system(paste0("samtools sort ",bamFile," ",lib,"_s"))
system(paste0("samtools index ",lib,"_s.bam"))
}
The following code is executed :
## samtools sort -n KMS11_GSK343_1/accepted_hits.bam KMS11_GSK343_1_sn
## samtools view -o KMS11_GSK343_1_sn.sam KMS11_GSK343_1_sn.bam
## samtools sort KMS11_GSK343_1/accepted_hits.bam KMS11_GSK343_1_s
## samtools index KMS11_GSK343_1_s.bam
##
## samtools sort -n KMS11_GSK343_2/accepted_hits.bam KMS11_GSK343_2_sn
## samtools view -o KMS11_GSK343_2_sn.sam KMS11_GSK343_2_sn.bam
## samtools sort KMS11_GSK343_2/accepted_hits.bam KMS11_GSK343_2_s
## samtools index KMS11_GSK343_2_s.bam
##
## samtools sort -n TKO_GSK343_1/accepted_hits.bam TKO_GSK343_1_sn
## samtools view -o TKO_GSK343_1_sn.sam TKO_GSK343_1_sn.bam
## samtools sort TKO_GSK343_1/accepted_hits.bam TKO_GSK343_1_s
## samtools index TKO_GSK343_1_s.bam
##
## samtools sort -n TKO_GSK343_2/accepted_hits.bam TKO_GSK343_2_sn
## samtools view -o TKO_GSK343_2_sn.sam TKO_GSK343_2_sn.bam
## samtools sort TKO_GSK343_2/accepted_hits.bam TKO_GSK343_2_s
## samtools index TKO_GSK343_2_s.bam
##
## samtools sort -n KMS11_GSK669_1/accepted_hits.bam KMS11_GSK669_1_sn
## samtools view -o KMS11_GSK669_1_sn.sam KMS11_GSK669_1_sn.bam
## samtools sort KMS11_GSK669_1/accepted_hits.bam KMS11_GSK669_1_s
## samtools index KMS11_GSK669_1_s.bam
##
## samtools sort -n KMS11_GSK669_2/accepted_hits.bam KMS11_GSK669_2_sn
## samtools view -o KMS11_GSK669_2_sn.sam KMS11_GSK669_2_sn.bam
## samtools sort KMS11_GSK669_2/accepted_hits.bam KMS11_GSK669_2_s
## samtools index KMS11_GSK669_2_s.bam
##
## samtools sort -n TKO_GSK669_1/accepted_hits.bam TKO_GSK669_1_sn
## samtools view -o TKO_GSK669_1_sn.sam TKO_GSK669_1_sn.bam
## samtools sort TKO_GSK669_1/accepted_hits.bam TKO_GSK669_1_s
## samtools index TKO_GSK669_1_s.bam
##
## samtools sort -n TKO_GSK669_2/accepted_hits.bam TKO_GSK669_2_sn
## samtools view -o TKO_GSK669_2_sn.sam TKO_GSK669_2_sn.bam
## samtools sort TKO_GSK669_2/accepted_hits.bam TKO_GSK669_2_s
## samtools index TKO_GSK669_2_s.bam
For each original accepted_hits.bam file, sorted-by-name SAM and BAM files (for htseq-count), as well as a sorted-by-chromosome-position BAM file (for IGV) are created.
Inspect alignments with IGV
An extensive explanation is described by clicking on this link : igv-integrative-genomics-viewer.
IGV can be downloaded here : http://www.broadinstitute.org/igv/
Briefly, start IGV Select the correct genome (Human ) Load BAM file and the GTF file Search for a gene of interest and zoom in.
References: http://master.bioconductor.org/help/course-materials/2013/CSAMA2013/
Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!
Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!
Recommended for You!
Recommended for you
This section contains best data science and self-development resources to help you on your path.
Coursera - Online Courses and Specialization
Data science
- Course: Machine Learning: Master the Fundamentals by Standford
- Specialization: Data Science by Johns Hopkins University
- Specialization: Python for Everybody by University of Michigan
- Courses: Build Skills for a Top Job in any Industry by Coursera
- Specialization: Master Machine Learning Fundamentals by University of Washington
- Specialization: Statistics with R by Duke University
- Specialization: Software Development in R by Johns Hopkins University
- Specialization: Genomic Data Science by Johns Hopkins University
Popular Courses Launched in 2020
- Google IT Automation with Python by Google
- AI for Medicine by deeplearning.ai
- Epidemiology in Public Health Practice by Johns Hopkins University
- AWS Fundamentals by Amazon Web Services
Trending Courses
- The Science of Well-Being by Yale University
- Google IT Support Professional by Google
- Python for Everybody by University of Michigan
- IBM Data Science Professional Certificate by IBM
- Business Foundations by University of Pennsylvania
- Introduction to Psychology by Yale University
- Excel Skills for Business by Macquarie University
- Psychological First Aid by Johns Hopkins University
- Graphic Design by Cal Arts
Books - Data Science
Our Books
- Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
- Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
- Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
- GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
- Network Analysis and Visualization in R by A. Kassambara (Datanovia)
- Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
- Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)
Others
- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
- Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
- Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
- An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet