Read Mapping with bowtie2 Tutorial - Bioinformatics Team (BioITeam) at the University of Texas (2024)

The first step in nearly every next-gen sequence analysis pipeline is to map sequencing reads to a reference genome. In this tutorial we'll explore these basic principles using bowtie2 on TACC.

The world of read mappers seems to be settling down a bit after being a bioinformatics Wild West where there was a new gun in town every week that promised to be a faster and more accurate shot than the current record holder. Things seem to have reached the point where there is mainly a trade-off between speed, accuracy, and configurability among read mappers that have remained popular.There are over50 read mapping programs listed here. Each mapper has its own set of limitations (on the lengths of reads it accepts, on how it outputs read alignments, on how many mismatches there can be, on whether it produces gapped alignments, on whether it supports SOLiD colorspace data, etc.). As evidence of how things are settling down, we're going to just use bowtie2 in this course.

Tool	TACC	Version	Download	Manual	Example
Bowtie2	module load bowtie/2.2.6 You may recall we added this to our .bashrc file yesterday so it is already loaded	2.2.6	link	link	#Bowtie2

Transferring Data

We have already downloaded data files for this example and put them in the path:

$BI/gva_course/mapping/data

You may recognize this as the same files we used for the fastqc and fastx_toolkit tutorial. If you chose to improve the quality of R2 reads using fastx_toolkit as you did for R1 in the tutorial, you could use the improved reads in this tutorial to see what a difference the improved reads can make for read mapping.

File Name	Description	Sample
`SRR030257_1.fastq`	Paired-end Illumina, First of pair, FASTQ format	Re-sequenced E. coli genome
`SRR030257_2.fastq`	Paired-end Illumina, Second of pair, FASTQ format	Re-sequenced E. coli genome
`NC_012967.1.gbk`	Reference Genome in Genbank format	E. coli B strain REL606

The easiest way to run the tutorial is to copy this entire directory into a new folder called "BDIB_bowtie2_mapping" on your $SCRATCH space and then run all of the commands from inside that directory. See if you can figure out how to do that. When you're in the right place, you should get output like this from thelscommand.

tacc:~$ lsNC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq SRR030257_2.fastq.gz

Hint

Remember that to copy an entire folder requires the use of the recursive (-r) option.

Still stuck? click here for the correct code

cdscp -r $BI/gva_course/mapping/data BDIB_bowtie2_mappingcd BDIB_bowtie2_mappingls

Useful commands

Often you will have general questions about your sequencing files that you want to answer before or after starting your actual analysis. Here we show you some very handy commands after a warning:

Beware the cat command when working with NGS data

NGS data can be quite large, a single lane of an Illumina Hi-Seq run generates 2 files each with 100s of millions of lines. Printing all of that can take an enormous amount of time and may crash your terminal long before it finishes. If you find yourself in a seemingly endless scroll of sequence (or anything else for that matter) remember ctrl+c will kill whatever command you just executed

Reminder about Linux 1 liners

Below are several commands we've already been using, and some new ones put together to improve your skills.

How to look at the top/bottom of files to determine their type

head * # will show the top 10 lines to give you an idea of the file type/structuretail * # will show the last 10 lines to determine if the file is a repetitive structure

How to count the total number of lines in a file

Converting sequence file formats

Occasionally you might download a sequence or have it emailed to you by a collaborator in one format, and then the program that you want to use demands that it be in another format. Why do they have to be so picky?

Thebp_seqconvert.plscript that is installed as part of Bioperl is one helpful utility for converting between many common sequence formats. On TACC, the Bioperl modules are installed, but the helper script isn't. So, we've put it in a place that you can run it from for your convenience. However, remember that any time that you use the script you must have the bioperl module loaded. We also took care of this for you when we edited your~/.bashrcfile in the Linux introduction.

Run the script without any arguments to get the help message:

module load gccmodule load bioperlbp_seqconvert.pl

Exercises

The fileNC_012967.1.gbkis inGenbankformat. The filesSRR030257_*.fastqare inFASTQformat.

ConvertNC_012967.1.gbktoEMBLformat. Call the outputNC_012967.1.embl.
- Does EMBL format have sequence features (like genes) annotated?
  Click here for a hint
  Try reading through the program help when you run the bp_seqconvert.pl without any options to see the syntax required
  Sill need help?
  
  bp_seqconvert.pl --from genbank --to embl < NC_012967.1.gbk > NC_012967.1.emblhead -n 100 NC_012967.1.embl
  You might get an error or a warning like the following, even if the bp_seqconvert.pl script executed correctly so don't worry.
  Use of uninitialized value in substitution (s///) at /opt/apps/bioperl/1.6.901/Bio/SeqIO/embl.pm line 777, <STDIN> line 164674.Use of uninitialized value in concatenation (.) or string at /opt/apps/bioperl/1.6.901/Bio/SeqIO/embl.pm line 779, <STDIN> line 164674.
  From the head command, you should see that yes, EMBL files do maintain gene annotation features.
Convert only the first 10,000 lines ofSRR030257_1.fastqtoFASTAformat.
- What information was lost by this conversion?
  Click here if you need a hint
  Remember use the | character to have the output of head feed into the bp_seqconvert.pl script.
  Click here for the answer
  
  head -n 10000 SRR030257_1.fastq | bp_seqconvert.pl --from fastq --to fasta > SRR030257_1.fastahead SRR030257_1.fastqhead SRR030257_1.fasta
  The line of funny ASCII characters was lost. Remember, those are your "base quality scores". Many mappers will use the base quality scores to improve how the reads are aligned by not placing as much emphasis on poor bases.

Mapping with bowtie2

Bowtie2is a complete rewrite ofbowtie. It is currently the latest and greatest in the eyes of one very picky instructor (and his postdoc/gradstudent) in terms of configurability, sensitivity, and speed. After years of teaching bwa mapping along with bowtie2, we've decided that you will be the first class to use only bowtie2 since we never recommend anyone use bwa. For some more details about the differences between them see the bonus presentation, and if you find a compelling reason to use bwa rather than bowtie2, we'd love to hear from you.

Create a fresh output directory named bowtie2. We are going to create a specific output directory for the bowtie2 mapperwithinthe directory that has the input files so that you can compare the results of other mappers if you choose to do the other tutorials.

Commands for making a directory and changing into it

See Also

Bowtie2 manual | BioQueue Encyclopedia

mkdir bowtie2

Next, make sure thebowtie2module is loaded (we usemodule spiderto get the current name, which may not bebowtie/2.2.6if you re-run this tutorial):

Click here for a hint without the answer

Remember in our earlier tutorial we discussed the use of lonestar's module commands "spider" and "load" to install new functionality

click here for the answer without having to go back through the previous tutorial

module list bowtie# This should return the following 2 lines:# Currently Loaded Modules Matching: bowtie# 1) bowtie/2.2.6# If you do not see the above output, run these commands, and repeat this blockmodule spider bowtiemodule load bowtie/2.2.6

OPTIONAL -- How to check what version of bowtie2 was loaded?

Here are a few of the possibilities that will work.

In this case all of these methods will work, that may not be true of all programs

bowtie2 --versionmodule listwhich bowtie2

Note that which can be very useful for making sure you are running the executable that you think you are running, especially if you install your own programs. In particular make sure that the path matches up to what you expect. The most common situations arise from wanting to run a simplistically named script in your $HOME directory conflicting with something of the same name in the $BI directories or TACC modules.

Now convert the reference file from GenBank to FASTA using what you learned above. Name the new output fileNC_012967.1.fastaand put it in the same directory asNC_012967.1.gbk.

If you're stuck and want a hint without the answer...

Use the bp_seqconvert.pl script

Click here to get the answer or to check your command

bp_seqconvert.pl --from genbank --to fasta < NC_012967.1.gbk > NC_012967.1.fasta

Generally speaking, the first step in mapping is quite often indexing the reference file regardless of what mapping program is used. Put the output of this command into thebowtiedirectory we created a minute ago. The command you need is:

bowtie2-build

Try typing this alone in the terminal and figuring out what to do from the help show just from typing the command by itself.

If you're stuck click here for an explanation of what arguments the command does need

The command requires 2 arguments. The first argument is the reference FASTA. The second argument is the "base" file name to use for the created index files. It will create a bunch of files beginning bowtie/NC_012967.1*.

Click here to check your work, or for the answer if needed

bowtie2-build NC_012967.1.fasta bowtie2/NC_012967.1

Take a look at your output directory usingls bowtie2to see what new files have appeared. These files are binary files, so looking at them withheador tail isn't instructive and can cause issues with your terminal. If you insist on looking at them and your terminal begins behaving oddly, simply close it and log back into lonestar with a new terminal.

Why do so many different mapping programs create an index as a first step you may be wondering?

Like an index for a book (in the olden days before Kindles and Nooks), creating an index for a computer database allows quick access to any "record" given a short "key". In the case of mapping programs, creating an index for a reference sequence allows it to more rapidly place a read on that sequence at a location where it knows at least a piece of the read matches perfectly or with only a few mismatches. By jumping right to these spots in the genome, rather than trying to fully align the read to every place in the genome, it saves a ton of time.

Indexing is a separate step in running most mapping programs because it can take a LONG time if you are indexing a very large genome (like our own overly complicated human genome). Furthermore, you only need to index a genome sequence once, no matter how many samples you want to map. Keeping it as a separate step means that you can skip it later when you want to align a new sample to the same reference sequence.

Finally, map the reads! The command you need is:

bowtie2

Try reading the help to figure out how to run the command yourself.Remember these are paired-end reads.

IMPORTANT

This command can take a while (~5 minutes) and is extremely taxing. This is longer than we want to run a job on the head node (especially when all of us are doing it at once). In fact, in previous years, TACC has noticed the spike in usage when multiple students forgot to make sure they were on idev nodes and complained pretty forcefully to us about it. Let's not have this be one of those years. Use the showq -u command to make sure you are on an idev node.

Solution

bowtie2 -t -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam # the -t command is not required for the mapping, but it can be particularly informative when you begin comparing different mappers

Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:

SAM files can be enormously humongous text files (maybe >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands likeheadorgrepor more or using a viewer like IGV, which we will cover later.
SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time, or they are covered in the bonus tutorial.

Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:

head bowtie2/SRR030257.sam

What do you think the 4th and 8th columns mean(click for answer)?

If you thought the answer was the mapping coordinates of the read pairs you were right!

Multithreaded execution

We have actually massively under-utilized Lonestar in this example. We ran the command using only a single processor(a single "thread") rather than the 48 we have available. For programs that support multithreaded execution (and most mappers do because they are obsessed with speed) we could have sped things up by using all 48 processors for the bowtie process.

See if you can figure out how to re-run this using all 48 processors. Click here for a hint

You need to use the -p, for "processors" option. Since we had 48processors available to our job.

click here to check your answer

bowtie2 -t -p 48 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam

Try it out and compare the speed of execution by looking at the log files.

How much faster was it using all 48 processors?

1 processor took ~5 minutes, 48 processors took ~ 1 minute. Can you think of any reasons why it was ~ 5x faster rather than 48x faster?

Answer

Anytime you use multiprocessing correctly things will go faster, but even if a program can divide the input perfectly among all available processors, and combine the outputs back together perfectly, there is "overhead" in dividing things up and recombining them. These are the types of considerations you may have to make with your data: When is it better to give more processors to a single sample? How fast do I actually need the data to come back?

If you want to launch many processes as part of one job, so that they are distributed one per node and use the maximum number of processors available, then you need to learn about the "wayness" of how you request nodes on Lonestar and possibly edit your slurm script in the future (more on this on Friday), or make better use of running commands in the background using the & symbol at the end of the command line.

One consequence of using multithreading that might be confusing is that the aligned reads might appear in your output SAM file in a different order than they were in the input FASTQ. This happens because small sets of reads get continuously packaged, "sent" to the different processors, and whichever set "returns" fastest is written first. You can force them to appear in the same order (at a slight cost in speed)by adding the--reorderflag to your command, but is typically only necessary if the reads are already ordered or you intend to do some comparison between the input and output.

Optional Exercises for your free time

In thebowtie2example, we mapped in--localmode. Try mapping in--end-to-endmode (aka global mode).
Do theBWA tutorial so you can compare their outputs.
- Didbowtie2orBWAmap more reads?
- In our examples, we mapped in paired-end mode. Try to figure out how to map the reads in single-end mode and create this output.
- Which aligner took less time to run?Are there any options you can change that:
  - Lead to a larger percentage of the reads being mapped? (increase sensitivity)
  - Speed up performance without causing many fewer reads to be mapped? (increase performance)

Next steps...

The next steps are often to view the output using a specific viewer on your local machine, or to begin identifying variant locations where the reads differ from the reference sequence. These will be the next things we cover in the course. Here is a link to help you return to the GVA 2016 course schedule.

Read Mapping with bowtie2 Tutorial - Bioinformatics Team (BioITeam) at the University of Texas (2024)

FAQs

What is read mapping in bioinformatics? ›

Read mapping is the process to align the reads on a reference genomes. A mapper takes as input a reference genome and a set of reads.

Read On ›

What is the algorithm of Bowtie2? ›

Bowtie2 utilizes Smith–Waterman algorithm. The tool allows two alignment modes: end-to-end and local. End-to-end approach searches for alignments involving whole-length sequences. The local mode supports clipping an alignment from one or both ends to maximize the alignment score.

Learn More Now ›

How does Bowtie2 align to genome? ›

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s of characters to relatively long (e.g. mammalian) genomes.

Explore More ›

What is the difference between Bowtie and Bowtie2? ›

Bowtie support only end-to-end alignments, while Bowtie2 supports both end-to-end and local alignment. Bowtie has an upper limit on read length of around 1,000 bp, while Bowtie2 does not have any. Bowtie2's paired-end alignment is more flexible that Bowtie's. Bowtie2 does not align colorspace reads.

Know More ›

What are the mapping techniques in bioinformatics? ›

A physical map provides the nucleotide numbers and the precise physical distance between genetic markers. Radiation hybrid mapping, sequence mapping, and cytogenetic mapping are the techniques used to produce a physical map. Physical mapping assembles larger DNA sections using DNA markers and DNA fragments.

Learn More ›

What is a read in Illumina sequencing? ›

In next-generation sequencing, a read refers to the DNA sequence from one fragment (a small section of DNA).

What is the Bowtie algorithm in bioinformatics? ›

Bowtie is promoted as "an ultrafast, memory-efficient short aligner for short DNA sequences." The speed increase of Bowtie is partly due to implementing the Burrows–Wheeler transform for aligning, which reduces the memory footprint (typically to around 2.2GB for the human genome); a similar method is used by the BWA ...

Read The Full Story ›

What are Bowtie2 index files? ›

Bowtie and Bowtie 2 are read aligners for sequencing reads. Bowtie specializes in short reads, generally about 50bp or shorter. Bowtie 2 specializes in longer reads, up to around hundreds of base pairs.

Read On ›

How much memory does Bowtie2 need? ›

Bowtie2 indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).

Get More Info ›

What is the advantage of Bowtie2? ›

Bowtie2 is a fast and accurate alignment tool that supports gapped, local and paired-end alignment modes and works best for reads that are at least 50 bp (shorter read lengths should use Bowtie1).

See Details ›

What is the difference between star alignment and Bowtie2? ›

What is the fundamental difference between STAR and Bowtie(2). STAR for mapping spliced (i.e. with introns) short RNA-seq reads against a genome. Bowtie2 for mapping short reads without splicing.