Mapping algorithms and softwares


This analysis was performed using R (ver. 3.1.0).

FASTQ file

FASTQ file, is the result of a next-generation sequencing experiment. Every four lines, indicates a read. The first line gives the name of the read and some other information, including its length. The second line is the base pairs of the read. The fourth line contains information about the quality of each base pair. The third line links all the qualities to the first line.

FASTQ Format

A FASTQ file normally uses four lines per sequence.

  • Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line).
  • Line 2 is the raw sequence letters.
  • Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.
  • Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

A FASTQ file containing a single sequence might look like this:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

source : wiki

Mapping Algorithms and Softwares

Mapping software is also called alignment software. One of the most popular short read aligners is Bowtie. Help can be found on Bowtie website or from linux command: bowtie –help.

Download sequencing data and extract the FASQ files

Download RNA sequencing reads from Short Read Archive (SRA) :

GSE52166, SRP032775, http://www.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1177756.

wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1177756/SRR1177756.sra

We use the software fastq-dump to extract the FASTQ files from the .sra file :

fastq-dump --split-file-3 SRR1177756.sra
# view generated files with size
ls -lh *.fastq
The option --split-file-3 is used for paired-end sequencing.
Two FastQ files are generated (SRR1177756_1.fastq, SRR1177756_2.fastq), because data is a paired-end sequencing. Each read is paired with the same name in each file. 

Align RNA sequencing data using Tophat

TopHat software is used to align RNA sequencing data. It takes care of the fact that the reads might span an intron. The TopHat software actually uses Bowtie internally to do part of the mapping.

To use TopHat, you need a Bowtie index, which is a precompiled version of the genome, which is efficiently built so that you can map these short reads against it. Bowtie index can be downloaded for different species on TopHat website.

Download reference genome from tophat website:

wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/Ensembl/GRCh37/Homo_sapiens_Ensembl_GRCh37.tar.gz
tar zxvf Homo_sapiens_Ensembl_GRCh37.tar.gz

Running TopHat:

Tophat needs the path to reference genome. This path might be different on your computer.

tophat2 -o SRR117756_out -p 10 genomes/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index SRR1177756_1.fastq  SRR1177756_2.fastq

The 2 files indicate left and right paired-end reads.

In the call to tophat2, the option -o specifies the output directory, -p specifies the number of threads to use (this may affect run times and can vary depending on the resources available).

Use bioinformatic cluster to align reads

A bioinformatic cluster is needed to process huge data generated by next generation sequencing. I use genotool plateform to run my alignement experiment. The aim of the following section is to show, how to submit a job to genotoul. You can skip this section

Submit job to genotoul

My working directory hierarchy:

  • work
  • akassambara
    • seq
    • genomes : contains reference genome
    • SRR117756_out
    • SRR1177756_1.fastq
    • SRR1177756_2.fastq
    • SRR1177756.sra
    • tophat_SRR1177756.sh

Preparing my script (tophat_SRR1177756.sh) to run on genotool:

#!/bin/bash
export BOWTIE2_INDEXES=/work/akassambara/seq/genomes/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index
tophat2 -o SRR117756_out -p 10 genomes/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome SRR1177756_1.fastq SRR1177756_2.fastq

Submit the job

qsub -q workq -l h_vmem=100G -l mem=100G tophat_SRR1177756.sh 

Tophat result files

TopHat generates a list of files which are located in SRR117756_out folder.

akassambara@genotoul ~/work/seq/SRR117756_out $ ls -lh
total 4,1G
-rw-r--r-- 1 akassambara U1040 2,2G 30 juil. 05:12 accepted_hits.bam
-rw-r--r-- 1 akassambara U1040 9,5M 30 juil. 04:57 deletions.bed
-rw-r--r-- 1 akassambara U1040 7,4M 30 juil. 04:57 insertions.bed
-rw-r--r-- 1 akassambara U1040  11M 30 juil. 04:57 junctions.bed
drwxr-xr-x 2 akassambara U1040  16K 30 juil. 05:12 logs
-rw-r--r-- 1 akassambara U1040  188 29 juil. 16:59 prep_reads.info
-rw-r--r-- 1 akassambara U1040 1,9G 30 juil. 05:24 unmapped.bam

accepted_hits.bam file

This is a compressed file. It contains the reads which could be successfully mapped to the genome. You can use the samtools view function to read this compressed file.

The following line code can be used to view the first 1000 lines. Use “return” to view more and type q to quit.

#Look at the first 1000 lines, and then use less.
#You have to type q to quit less command
samtools view accepted_hits.bam | head -1000 | less
SRR1177756.30816791     393     1       12006   0       101M    *       0       0       GCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGGAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGAT   BBBBF

This gives each read with the name of the read on the left, and where it was possible to align the read to the genome. For example, the read SRR1177756.30816791 was aligned to chromosome 1 at position 12,006. We could see that this whole read aligned (101 matches).

Down below, we see that there are reads where the whole read didn't align. If we consider that we're looking at RNA sequencing, and that some of these reads come from RNA pieces, which span junction in the genome, it makes sense that there would be parts of these which wouldn't align to the genome. So instead of a match, we get an N, which indicates there was a gap: 10M385N91M = 10 matches then 385 nucleotide gaps then 91 matches. 

The following SAMtools calls can be used to process the BAM files.

samtools sort -n accepted_hits.bam accepted_hits_name_sorted
samtools sort -n unmapped.bam unmapped_name_sorted
samtools merge -n all_reads.bam accepted_hits_name_sorted unmapped_name_sorted

The -n option indicates that I want to sort these reads, not by genomic location but, by name of the read. Sorted reads are put in the accepted_hits_name_sorted file.

Sorted reads: mapping-sorted-reads

After sorting, you can see something interesting, which is that the fragment number 4 only has one read in this file. And also you can see that fragment number 1, and 6, and 7 are also not in this file. If we want to find those reads, we can look into the unmapped.bam.

If we want a single file with all of the alignments in it, we can use the samtools merge command. 

Licence

Licence


Enjoyed this article? I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!
Avez vous aimé cet article? Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!





This page has been seen 10487 times