Code Repository 2014

Code Repository for BMMB 852, Fall 2014

This page is a collection of most of the examples shown during lectures.

Lecture 1 - Basic Command Line

# Lines that start with the # sign are comments that won't be executed!

# What folder are we in
pwd

# Make a directory
mkdir work

# See the directory
ls

# See more details
ls -l

# Change to the work directory
cd work

# Where am I
pwd

# Let's go back one level
cd ..

# Where am I
pwd

# Remove the directory
rmdir work

# Create a file, then list it.
touch abc.txt

# Rename (move the file)
mv abc.txt xyz.txt

# How does the mv command work?
man mv

# Learn about cp, ls, mv, rm, mkdir, rmdir
# via the manual pages
man ls

Lecture 2 - Simple Data Manipulations

# Lines that start with the # sign are comments that won't be executed!

# Get the data from SGD.
# Download the URL with curl and store it in the sc.gff file.
curl http://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff -o sc.gff

#
# Or if the site seems slow get the data from the course site. But then ask yourself
# isn't it odd when a class website is faster than the official data repository? Way to go SGD!
#
# curl http://www.personal.psu.edu/iua1/courses/files/2014/data/saccharomyces_cerevisiae.gff -o sc.gff

# How many lines does the file have.
wc -l sc.gff

# Page through the file with more (or less)
# SPACE and f moves forward, b moves backward, q quits, h help, / allows you to search
more sc.gff

# Let's look at the head of the file.
head sc.gff

# And the end of the file.
tail sc.gff

# Find simple patterns in the file
grep YAL060W sc.gff

# We can place the results of the matching into a new file.
grep YAL060W sc.gff > match.gff

# Look at your files.
ls

# Paginate the match file
head match.gff

# Piping (chaining) commands together.
# How many lines match the word gene?
grep gene sc.gff | wc -l

# How many lines match both the word gene and the word chrVI (chromosome six)?
grep gene sc.gff | grep chrVI | wc -l

# How many lines of the above DO NOT match the word Dubious?
grep gene sc.gff | grep chrVI | grep -v Dubious | wc -l

# This file is a bit odd. It contains two sections, a tabular
# section with 9 tab delimited columns and a sequence section
# that lists the entire genome.

# Break up the file into two parts. Put all lines that occur before
# the genome  information into features.gff.
# Our spies from behind enemy lines have reported that the word FASTA
# in the file indicates that the genome information is about to start
cat -n sc.gff | head

# Find the position of the word FASTA in the file. The -n flag adds line numbers.
cat -n sc.gff | grep FASTA

# Keep only those number of lines and redirect into a new file that
# contains only the features. Also remove the comment characters from the file.
head -22994 sc.gff | grep -v '#' > features.gff

# We did all this work to have simple tabular formatted data that only
# contains columnar information. So how many features are there?
wc -l features.gff

# Take the first three columns.
cut -f 1,2,3 features.gff | head

# Find unique words in column 3 of GFF.
# The uniq command collapses consecutive identical words into one.
cut -f 3 features.gff | sort | uniq | head

# It can also count how many identical consecutive words are there.
cut -f 3 features.gff | sort | uniq -c | head

Lecture 3 - Installing and Using Entrez Direct

# Installing and running the Entrez Direct suite

# Go to your home directory.
cd

# Equivalent to using ~/ that represents your home directory.
cd ~/

# Create an src directory that will contain the
# programs we will install during this lecture.
mkdir ~/src

# go to the src directory
cd ~/src

# Get the entrez direct toolkit, the capital -O flag instructs curl
# to figure out the filename from the url.
curl ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.zip -O

# The above is equivalent to:
curl ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.zip -o edirect.zip

# Unzip the tools.
unzip edirect.zip

# Investigate the new tool.
cd edirect
ls

# To make your system become aware of where
# your new tools are you need to add the path to
# the tool to your environment. There are different
# ways to do this. To add all tools at once add to the PATH variable.
export PATH=$PATH:~/src/edirect

# Create the work folder for the current lecture.
mkdir -p ~/edu/lec3
cd  ~/edu/lec3
pwd

# Run einfo
einfo -help

# Fetch descrptions then look at them.
einfo -dbs > einfo-dbs.txt
more einfo-dbs.txt

einfo -db sra > einfo-sra.txt
more einfo-sra.txt

# Run an esearch.
esearch -help
esearch -db nucleotide -query PRJNA257197
esearch -db sra -query PRJNA257197
esearch -db protein -query PRJNA257197

# Fetch the data for nucleotides.
esearch -db nucleotide -query PRJNA257197 | efetch -format fasta > ebola.fasta

# How many sequences in the file
cat ebola.fasta | grep ">" | wc -l

# Get the data in genbank format.
esearch -db nucleotide -query PRJNA257197 | efetch -format gb > ebola.gb

Lecture 4 - Installing and Using the SRA tookit

# Go to your source directory.
cd ~/srrc

# Download the SRA toolkit (make sure to put the link that is relevant to your platform)
#
# Mac OSX.
#
curl -O http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.3.5-2/sratoolkit.2.3.5-2-mac64.tar.gz

# Linux.
curl -O http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.3.5-2/sratoolkit.2.3.5-2-ubuntu64.tar.gz

# Unpack the archive.
tar xzvf sratoolkit.2.3.5-2-mac64.tar.gz

# Switch to the directory and look at the files there
cd sratoolkit.2.3.5-2-mac64

# Programs are collected in the bin directory.
cd bin

# See what is there.
ls

# Let's add these paths permanently to our PATH.

# There is a special file that is read when your shell starts. It applies
# the settings to each terminal that you start. The file is called ~/.profile
# on a Mac and ~/.bashrc on linux. The >> symbol appends to a file rather than
# overwriting it. You may also edit this file with a text editor.

# On the Mac:
echo export PATH=$PATH:~/src/edirect:~/src/sratoolkit.2.3.5-2-mac64/bin >> ~/.profile

# On Linux:
echo export PATH=$PATH:~/src/edirect:~/src/sratoolkit.2.3.5-2-mac64/bin >> ~/.bashrc

# You need to open a new terminal to apply the settings or do:
source ~/.profile

# Go to the lecture folder. What does the -p flag do? Look at the manual page.
mkdir -p ~/edu/lec4
cd ~/edu/lec4

# Where is the prefetch command?
which prefetch

# The prefetch command download files from the remote site.
# Look the help for the file.
prefetch -h

# Now let's obtain the file.
prefetch SRR1553610

# Where did the file go? There is a default folder in your home directory.
ls ~/ncbi

# What files have been added there? Use the find tool.
find ~/ncbi

# We unpack the file with the fastq-dump program.
fastq-dump -h

# Unpack the file
fastq-dump --split-files SRR1553610

# The raw data files in FASTQ format are in the current folder.
ls

# Pattern matching in the shell. The * (star) indicates match anything.
wc -l *.fastq

# look at the file.
head SRR1553610_1.fastq

#
cat *.fastq | grep @SRR | wc -l

# How to download multiple files? Create a file that contains SRR runs.
echo SRR1553608 > sra.ids
echo SRR1553605 >> sra.ids

# Use this file to prefetch the runs.
prefetch --option-file sra.ids

# Unpack all the downloaded files. Note how this is not quite right since
# there may be more prefetched files than what we had in sra.ids
fastq-dump --split-files ~/ncbi/public/sra/SRR15536*

# The right solution is to only download the files in sra.ids. Alas
# that is not directly supported by fastq-dump? Grr!

# The right solution will take a little more string processing at command line..
# This will use the sed (streamline editor) tool to replace the patterns SRR with
# the command to extract the file.
cat sra.ids | sed 's/SRR/fastq-dump --split-files SRR/'

# Now pass the output of it into a bash process
cat sra.ids | sed 's/SRR/fastq-dump --split-files SRR/' | bash

# We still do better. Why copy all those SRR ids. We could get them all from
# the command line.
# There seems to be a bug in the efetch program. It produces the data but
# then also an error at the end! Why? Probably a bug! Grr! x 2
esearch -db sra -query PRJNA257197  | efetch -format runinfo

# The command to put the results into a file.
esearch -db sra -query PRJNA257197  | efetch -format runinfo > runinfo.txt

# Also since it is comma separated file we need to specify the
# delimiter (column separator) to the cut program.
cat runinfo.txt | cut -f 1 -d ","

# The files seem to be all there. It does produce 195 files as reported in the search.
# Conundrum: do we trust this or not. We could get the file from the BioProject site
# as well by selecting Send -> RunInfo but you have to be in the right place for it
# to work.
cat runinfo.txt | cut -f 1 -d ',' | grep SRR | wc -l

# We'll trust it but you may have a nagging feeling that
# you might have made a mistake. Welcome to the club!
cat runinfo.txt | cut -f 1 -d ',' | grep SRR > sra.ids

# Warning! This will now download a lot of files (195 to be precise).
# Depending on the connection speed may take hours.
prefetch --option-file sra.ids

# Check the total size of the download (about 15Gb)
du -hs ~/ncbi

# Let's convert them all
cat sra.ids | sed 's/SRR/fastq-dump --split-files SRR/' | bash

# How many lines in total in all the files?
wc -l *.fastq

# And with that we got all the sequencing data for the entire PRJNA257197 project .

Lecture 5 - Writing reusable scripts. Fetching subsequences.

# Get a sequence by locus.
efetch -db nucleotide -id KM233090 -format fasta > KM233090.fa

# The same sequence is also available by accession number.
efetch -db nucleotide -id 667853062 -format fasta > 667853062.fa

# One can fetch multiple ids at the same time
efetch -db nucleotide -id KM233090,KM233066,KM233113.1 -format fasta > multi.fa

# How many lines in the file
wc -l multi.fa

# Count the header files.
cat multi.fa | grep ">" | wc -l

# See the header files.
cat multi.fa | grep ">"

# We can also fetch a subset from the sequences.
efetch -db nucleotide -id KM233090,KM233066,KM233113.1 -format fasta -seq_start 1 -seq_stop 10 > multi.fa

# Let's get the start of all ebola genomic builds.
esearch -db nucleotide -query PRJNA257197 | efetch -db nucleotide -format fasta -seq_start 1 -seq_stop 10 > starts.fa

# This is too much to print at each time. Time to solve it
# at a higher level. Start to write shell scripts.
# A shell script is a collection of commands that we can
# repeatedly execute.
# Put the line below into a file, usually the extension is .sh, I'll call it get_seq.sh
# esearch -db nucleotide -query PRJNA257197 | efetch -db nucleotide -format fasta -seq_start 1 -seq_stop 10

# You can now run this entire command from a single script.
bash get_seq.sh > starts.fa

# What would be nice is to be able to specify some parts of this script from the outside.
# The two dollar signs will be substituted with the parameters from command line. See the
# included files below.
bash get_seq.sh 1 10 > starts.fa

# Now extract just the sequences from this file.
# the -A 1 flag prints the match and the one line after the match.
# The -v flag prints lines that do not match the pattern.
# Never type these in directly. Type one statement, check the results, type a second one and so on.
cat starts.fa | grep ">" -A 1 | grep -v ">" | grep -v "-"

The first version of our script:

#!/bin/sh
#
# Fetches nucleotide subsequences from all sequences
# deposited in the PRJNA257197 project.
#
# Expects two parameters start and stop.

# This setting makes the script stop on errors and on undefined variables.
set -ue

# Access NCBI via Entrez Direct.
esearch -db nucleotide -query PRJNA257197 | efetch -db nucleotide -format fasta -seq_start $1 -seq_stop $2

A more advanced version of our script. As you add functions it becomes easier to build different commands:

#
# Fetches nucleotide subsequences from all sequences
# deposited in the PRJNA257197 project.
#
# Expects two parameters start and stop.

# This setting makes the script stop on errors and on undefined variables.
set -ue

function esearch_ebola() {
    # Search the nucleotide sequences for ebola.
    esearch -db nucleotide -query PRJNA257197
}

function efetch_limits() {
    # You can make it more explicit where varaibles get assigned to.
    # For more complex command lines this may help.
    START=$1
    END=$2
	# Run the efetch.
    efetch -db nucleotide -format fasta -seq_start $START -seq_stop $END
}

# Chain the two commands to fetch.
# Note how parameters are passed to functions.
esearch_ebola | efetch_limits $1 $2

Lecture 6 - Encodings and the FASTQ format.

# This command executes the string following the -c using python.
python -c 'print "Hello World"'

# The ordinal (integer code) of a character.
python -c 'print ord("A")'

# The character (letter code) of an integer.
python -c "print chr(65)"

# Find the quality value of a character in the Sanger (+33) encoding.
python -c 'print ord("I") - 33'

# Find the quality value of a character in the Illumina 1.5 (+64) encoding.
python -c 'print ord("I") - 64'

# Find the quality character of the value 40 in the Sanger (+33) encoding.
python -c 'print chr(40 + 33)'

# Find the quality character of the value 40 in the Illumina 1.5 (+64) encoding.
python -c 'print chr(40 + 64)'

# Get a test dataset from the course website.
curl -O http://localhost:8080/courses/files/2014/illumina-data.fq.gz

# You can unzip this with gzip illumina-data.fq.gz or keep it compressed
# and open it with gzcat (Mac OSX) or zcat (Linux). When you have
# lots of datasets it may be worth saving space but reading the files will be
# a little slower.

time gzcat illumina-data.fq.gz > tmp

# Unzip the data while keeping the compressed version.
# Withouth the -c flag you would not need to redirect the output and
# it would unzip in place.
gunzip -c illumina-data.fq.gz > illumina-data.fq
time cat illumina-data.fq > tmp

# Show one FASTQ record (Mac OSX)
gzcat illumina-data.fq.gz | head -4

# Show one FASTQ record (Linux)
zcat illumina-data.fq.gz | head -4

# Unpack an SRA data and raw dataset from the instrument.
# This generates two files.
fastq-dump --split-files SRR1553605

# Show a single fastq record.
cat SRR1553605_1.fastq | head -4

# "Quick and dirty" quality checks, are there
# stretches of ### characters in the first 10,000 lines?

gzcat illumina-data.fq.gz | head -10000 | grep '###' |  wc -l
cat SRR1553605_1.fastq | head -10000 | grep '###' | wc -l

# Which one looks better?
# It is always a bit risky to look at first/last lines since
# the order of data may not be random. Typically we want to
# get a random sample (we'll see how to do this later)

# We can look for and color patterns in the data.
# Are there start codons? Is the pattern TATA present?
gzcat illumina-data.fq.gz | head -10000 | grep 'ATG' --color=always | head

gzcat illumina-data.fq.gz | head -10000 | grep 'TATA' --color=always | head

# Are there sequences that look like the Illumina adapter: GATCGGA?
gzcat illumina-data.fq.gz | head -10000 | grep 'GATCGGA' --color=always | head

Lecture 7 - Quality control with FastQC. Interpreting the outputs.

# Compressing and decompressing files
# Get an SRA file that we will use later in this exercise.

# Use prefetch if you want to keep the SRA files.
# Otherwise you can use fastq-dump directly.
# In class I have added the -X 10000 flag to limit the number of spots so that we don't have
# to wait that long.
fastq-dump --split-files SRR519926

# See what you got.
ls -l

# Compress the files. You can also time any process by adding the word time
# in front of the command. This is a good command to test overall
# computer performance as it requires both fast disk access and high performance CPU.
# Here are some timings
#
# - Grit: desktop iMac: 29 seconds (4 core medium speed CPU, 64GB  RAM, medium speed disk)
# - Apollo: high performance linux server at the BCC: 38 seconds (32 cores fast CPUs, 256GB RAM, bit it is also running lots of other commands)
# - Peppy: Chromebook with virtualized Linux via Crouton: 49 seconds (slow 2 core CPU but very fast solid state disk)
# - Airy: MacBook Air: 53 seconds (the CPU seems to be slower than that in the ChromeBook)

# What is your timing?
#
time gzip *.fastq

# See how above gzip replaces the file with the compressed one.

# Test what the unzipping times are.
#
# - Airy: 2.1 seconds
# - Grit: 2.8 seconds
# - Apollo: 3.6 seconds
# - Peppy: 6.5 seconds
#
time gunzip *.gz

# Package both files into a single archive
tar cf mydata.tar *.fastq
gzip mydata.tar

# The two commands can be combined into one
tar cfzv mydata.tar.gz *.fastq

# Remove the fastq files
rm -f *.fastq
ls -l

# Unpacking the archive extracts the files.
tar xzvf mydata.tar.gz

# You can archive entire directory trees. Pack up the entire lecture 7 folder.
cd ..
tar cfzv lec7.tar.gz lec7

# List what is inside of the archive with the t action.
tar tf lec7.tar.gz

# Uncompress the archive at a different location.
mkdir ~/edu/tmp
mv lec7.tar.gz ~/edu/tmp
cd ~/edu/tmp
tar xzvf lec7.tar.gz

# Here comes a tarbomb. Handle with care!
# http://www.xkcd.com/1168/
curl -O http://www.personal.psu.edu/iua1/courses/files/2013/data/tarbomb.tar.gz

# Unpack and enjoy! Always check your tarfiles or put them into a separate directory.

# Quality control.
# Download and install FastQC into the src directory.
cd ~/src
curl -O http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.2.zip

# Linux people: do an: sudo apt-get install unzip
unzip fastqc_v0.11.2.zip
cd FastQC
ls -l

# There is an error in the distribution. The executable flag is not set.
chmod +x fastqc

# We want fastqc to be globally available. We have two ways to do this. We could
# add the path to the library to the main PATH (as before), but that gets old when you
# have to add every single program and reload the PATH.
# Alternatively we could add one directory say ~/bin then create links (shortcuts)
# in this directoy to the programs that we want to be available globally.

# Create the ~/bin directory
mkdir -p ~/bin

# Add the ~/bin folder into the PATH

# On a Mac:
echo 'export PATH=~/bin:$PATH' >> ~/.profile
# to apply to the current shell not just future ones:
source ~/.profile

# Under Linux do the following:
# echo 'export PATH=~/bin:$PATH' >> ~/.bashrc
# source ~/.bashrc

# Link the fastqc under an "shortcut" in ~/bin
ln -s ~/src/FastQC/fastqc ~/bin/fastqc

# Test that the tool works.
# Linux people may need to install java.
# See the linux install guide: http://www.personal.psu.edu/iua1/courses/code-repository-2014.html#chromebook
fastqc -h

# Run a fastqc report on all files in the directory.
cd ~/edu/lec7
fastqc *.fastq

# This create *.html files. Open and investigate the outputs.

Lecture 8 - Quality control: base quality trimming.

# Install prinseq, trimmomatic and seqtk

# Download and unpack the software in your src directory
cd ~/src

# Install prinseq
# We need to also pass the -L flag since this site uses link redirects and
# we want to follow those
curl -OL http://downloads.sourceforge.net/project/prinseq/standalone/prinseq-lite-0.20.4.tar.gz

# Unpack the archive.
tar zxvf prinseq-lite-0.20.4.tar.gz

# Go into the archive and read the instructions
# located in the README file. Let's follow what it says there.
cd prinseq-lite-0.20.4

# Make the script executable.
chmod +x prinseq-lite.pl

# Link it into you ~/bin folder.
ln -s ~/src/prinseq-lite-0.20.4/prinseq-lite.pl ~/bin/prinseq-lite

# Now you have prinseq running anywhere on your system.
prinseq-lite

# Install trimmomatic
cd ~/src
curl -O http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip
unzip Trimmomatic-0.32.zip
cd Trimmomatic-0.32

# Alas not much is there. You got to hit the manual at:
# http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
# It has a really user unfriendly way to run it. Thanks a bunch!
java -jar trimmomatic-0.32.jar

# Lets make a script that lauches that for us.
# Create a script with Komodo or from the command line that contains the following
# shown below :

# You can also do it from the command line
echo '#!/bin/bash' > ~/bin/trimmomatic
echo 'java -jar ~/src/Trimmomatic-0.32/trimmomatic-0.32.jar $@' >> ~/bin/trimmomatic
chmod +x ~/bin/trimmomatic

# Look Ma! It works.
trimmomatic

# Now we have both prinseq and trimmomatic. Let's use them.
fastq-dump --split-files SRR519926

# Run prinseq and trim bases from the right in a window of 5 bases
prinseq-lite -fastq SRR519926_1.fastq -trim_qual_right 30 -trim_qual_window 4 -min_len 35 -out_good prinseq_1

# Trimmomatic uses a different set of paramters and philosophy. Parameters need to be space separated words
# where the sub-parameters  are separated by a colon. Basically they invented their own parameter formats.
trimmomatic SE -phred33 SRR519926_1.fastq trimmomatic_1.fq SLIDINGWINDOW:4:30 MINLEN:35 TRAILING:3

# Time the execution speed for both tools. Note just how radically faster trimmomatic is.
# Depending on the operation it could be 100x faster.
# On the other hand prinseq is more than just a quality control tool, it has
# a whole slew of other functionality.
fastqc SRR519926_1.fastq prinseq_1.fq trimmomatic_1.fq

# By default FastQC creates bins in the plots and that may sometimes
# hide details.
fastqc --nogroup SRR519926_1.fastq prinseq_1.fq trimmomatic_1.fq

# To process files in batch you can make use of simple shell looping constructs.
for name in *.fastq; do echo $name; done

Create the trimmomatic file in your ~/bin folder:

#!/bin/bash

# The line above tells the shell to execute the program with bash.
# The line below the $@ indicates passing all arguments into the program

java -jar ~/src/Trimmomatic-0.32/trimmomatic-0.32.jar $@

Example of a simple looping construct

#!/bin/bash

# This script will run the trimming on
# all fastq files in the directory
for name in *.fastq
do

	echo "*** currently processing file $name, saving to trimmed-$name"
	trimmomatic SE -phred33 $name trimmed-$name SLIDINGWINDOW:4:30 MINLEN:35 TRAILING:3
	echo "*** done"
done

# Now run the fastqc report.
# You can add any more commands here.
fastqc trimmed-*.fastq

Lecture 9 - Quality control: pattern matching and adapter trimming

# This is data from last lecture
fastq-dump --split-files SRR519926

# Note: pattern searches take place on lines, if the sequences wrap
# other methods are needed. Most instruments produce reads on one line
# so those would work. But do remeber that this also may match
# the quality string (although usually unlikely). A proper
# solution would first extract the sequence.

# Find  ATG anchored at the start of the line
cat SRR519926_1.fastq | egrep "^ATG" --color=always | head

# Find ATG anchored at the end of the line
cat SRR519926_1.fastq | egrep "ATG\$" --color=always | head

# Find TAATA or TATTA patterns, this is a range of characters
cat SRR519926_1.fastq | egrep "TA[A,T]TA" --color=always | head

# Find TAAATA or TACCTA, these are groups of words
cat SRR519926_1.fastq | egrep "TA(AA,CC)TA" --color=always | head

# Quantify matches with metacharacters
# Find TA followed by zero or or more A followed by TA
cat SRR519926_1.fastq | egrep "TA(A*)TA" --color=always | head

# Find TA followed by one or or more A followed by TA
cat SRR519926_1.fastq | egrep "TA(A+)TA" --color=always | head

# Find TA followed by two to five As followed by TA
cat SRR519926_1.fastq | egrep "TAA{2,5}TA" --color=always | head

# Practice RegExp matches on online regexp testers
# http://regexpal.com/

# Match Ilumina adaptors at the end of the reads
# Match AGATCGG anywhere followed by any number of bases
cat SRR519926_1.fastq | egrep "AGATCGG.*" --color=always | head

# See the contaminant and adapter files for FastQC and Trimmomatic
$ ls ~/src/FastQC/Configuration/

# The content of this file contains a few known adapter sequences.
more ~/src/FastQC/Configuration/adapter_list.txt

# The content of this file shows known contaminants.
more ~/src/FastQC/Configuration/contaminant_list.txt

# To perform the adapter cutting we need to find a file with adapter sequences.
# You can create your own adapter or use the ones that come with Trimmomatic
# Let's create an Illummina adapter file.
echo ">adapter" > adapter.fa
echo "AGATCGGAAGAG" >> adapter.fa

# Let's do both quality and adapter trimming.
trimmomatic SE SRR519926_1.fastq trimmed.fq ILLUMINACLIP:adapter.fa:2:30:10 SLIDINGWINDOW:4:30 TRAILING:30 ILLUMINACLIP:adapter.fa:2:30:5

# Trimmomatic comes with a number of adapters that we can use.
# Depending on the content of the file different modes of operations may take place.

# This is a palindromic adapter. Trimmomatic detects the
# mode of operation from the name of the adapter.
ln -s ~/src/Trimmomatic-0.32/adapters/TruSeq3-PE.fa

# This is an extended palindromic adapter file that also lists single adapters.
ln -s ~/src/Trimmomatic-0.32/adapters/TruSeq3-PE-2.fa

# These are the classic adapters.
ln -s ~/src/Trimmomatic-0.32/adapters/TruSeq3-SE.fa

trimmomatic SE SRR519926_1.fastq trimmed.fq ILLUMINACLIP:adapter.fa:2:30:10 SLIDINGWINDOW:4:30 TRAILING:30 ILLUMINACLIP:TruSeq3-SE.fa:2:30:5

# To run it in paired end mode both files need to filtered and clipped simultaneously.
# Command line mayhem ensues.
trimmomatic PE SRR519926_1.fastq SRR519926_2.fastq trim_1P.fq trim_1U.fq trim_2P.fq trim_2U.fq ILLUMINACLIP:TruSeq3-PE-2.fa:2:10:10 SLIDINGWINDOW:4:30 TRAILING:20

# A slightly simpler option to provide a -baseout flag

Lecture 10 - Alignments

#
# For this lecture there are no command line tools to run.
# Instead we will perform alignments via the EBI Pairwise alignment site
#
#
# http://www.ebi.ac.uk/Tools/psa/
#
# The following two protein sequences are a fun example
# from the book: Understanding Bioinformatics by Marketa Zvelebil
# http://www.amazon.com/Understanding-Bioinformatics-Marketa-Zvelebil/dp/0815340249
#

>1
THISLINE

#  Second protein.

>2
ISALIGNED

#
# You will need to perform global and local alignments
# with different parameter settings.
#

# Now if we have time we will investigate the ebola viruses a bit.
# Remember those virus genome starts that in some cases looked so different?
# Let's get the first 100 bases from the two most different genomes and align those.
#
# KM034561 and KM233037
#
efetch -db nucleotide -id KM034561 -seq_start 1 -seq_stop 100 -format fasta > 1.fa
efetch -db nucleotide -id KM233037 -seq_start 1 -seq_stop 100 -format fasta > 2.fa

# Align the two sequences with the Needle tool. What do you see?
# Remember the CGAATAACT was the most common start sequence.

Lecture 11 - Installing and using Blast

#
# Get and install BLAST+
#
# URLs given for a MAC, change it accordingly for Linux

# Switch to the src folder
cd ~/src

# Get the blast executable for the Mac
curl -O  ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.2.29+-universal-macosx.tar.gz
tar zxvf ncbi-blast-2.2.29+-universal-macosx.tar.gz

# For Linux download a different package
# curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.2.29+-x64-linux.tar.gz
# tar zxvf ncbi-blast-2.2.29+-x64-linux.tar.gz

# I can either link the executable or add the entire directory to
# my path. When there are lots of programs I usually add the entire
# directory.

echo 'export PATH=$PATH:~/src/ncbi-blast-2.2.29+/bin' >> ~/.profile

# Or on Linux do a
# echo 'export PATH=$PATH:~/src/ncbi-blast-2.2.29+/bin' >> ~/.bashrc

# Load the settings into the current terminal (new terminals will auto load this)
source ~/.profile

# Check that it works. Short help.
blastn -h

# Long help.
blastn -help

# Checklist:
#   1. What are we looking to find? -> query sequence
#   2. Where are we looking for? -> database -> target sequence(s)
#   3. How are we looking for it? -> search type

# Let's make a blast database

# Get a genome sequence of the Ebola virus
# When we index databases it may create more files. It is
# best if we place these in a separate folder. Lets call that refs.
mkdir -p refs
efetch -db nucleotide -id KM233118 --format fasta > refs/KM233118.fa

# The reference may contain one or more Fasta records. To
# turn it into a blast database we need to transform it to a
# format the program is able to operate on.

# Long and short help for the blast database indexer.
makeblastdb -h
makeblastdb -help

# Create a nucleotide database out of the genome.
makeblastdb -in refs/KM233118.fa -dbtype nucl

# See what happened.
ls refs

# Create a query.
head -2 refs/KM233118.fa > query.fa

# Search with the query: nucleotide sequence vs nucleotide database
# Take note of the formats.
blastn -db refs/KM233118.fa -query query.fa | more

# We can change the output format. Tabular forms
# can be used when base by base alignment information is not needed.
blastn -db refs/KM233118.fa -query query.fa -outfmt 6

# This will also write header information.
blastn -db refs/KM233118.fa -query query.fa -outfmt 7

# You may format the output in very different ways.
blastn -db refs/KM233118.fa -query query.fa -outfmt '6 qseqid sseqid qlen length mismatch'

# By default blastn uses the megablast search strategy.
# Shorten the query to:
# >short
# AATCATACCTGG

# There are different search strategies/tasks for blast.
# These represent a pre-packaged set of parameters.
blastn -db refs/KM233118.fa -query query.fa

blastn -db refs/KM233118.fa -query query.fa -task blastn

#
# Make the sequence even shorter.
# >short
# AATCATA
#
# Now -task blastn won't work either, there is another task blastn-short
# Tuning Blast can be a little world on its own.
#
blastn -db refs/KM233118.fa -query query.fa -task blastn-short

# You can have more than one sequence in the target database.
esearch -db nucleotide -query PRJNA257197 | efetch -db nucleotide -format fasta > refs/all-genomes.fa

# Create the blast database of all genomes.
makeblastdb -in refs/all-genomes.fa -dbtype nucl

# Pick any region of a genome
efetch -db nucleotide -id KM233118 -format fasta -seq_start 1 -seq_stop 1000 > 1000.fa

Lecture 12 - blastdbcmd, blastp, blastx, tblastn usage

#
# blastdbcmd, blastp, blastx, tblastn usage
#
#
# Compare the new virus to an outbreak from 1999.
# RefSeq accession number NC_002549.1, Nucleoprotein example: NP_066243.1

# http://www.ncbi.nlm.nih.gov/genome/4887

# The Bioproject for the data from 1999.

# Get a feature table
esearch -db protein -query PRJNA14703 | efetch -format ft

# Search the features of a genome, ft -> feature table.
efetch -db nucleotide -id NC_002549 -format ft | more

# Get the data for the project from the protein database.
esearch -db protein -query PRJNA14703 | efetch -format fasta > NC_002549-prot.fa

# Add the parse seqids option to makeblastdb.
# Create Blast databases for both nucleotides and proteins.
makeblastdb -in refs/NC_002549.fa -dbtype nucl -parse_seqids
makeblastdb -in refs/NC_002549-prot.fa -dbtype prot -parse_seqids

# blastdbcmd can take many parameters that can query and format
# the content of the blast database.

# Lists all the content in the database.
blastdbcmd -db refs/NC_002549.fa -entry 'all' | head -3

# Get a specific entry of the database.
blastdbcmd -db refs/NC_002549.fa -entry '10313991' | head -3

# Get a range of the nucleotides entries in the database.
blastdbcmd -db refs/NC_002549.fa -entry 'all' -outfmt '%a %s' -range 1-10 -strand minus

# Format the content of the database.
blastdbcmd -db refs/NC_002549.fa -entry 'all' -outfmt '%a %l'

# List each protein with its length.
blastdbcmd -db refs/NC_002549-prot.fa -entry 'all' -outfmt '%a %l'

# Extract an entry of the protein database into a file.
blastdbcmd -db refs/NC_002549-prot.fa -entry 'NP_066243.1' > query-p.fa

# Run blastp.
blastp -db refs/NC_002549-prot.fa -query query-p.fa

# Run tblastn.
# Extract a region that is protein coding. I found the coordinates from the GenBank page.
blastdbcmd -db refs/NC_002549.fa -entry 'NC_002549' -range 470-2689  > nucleotide.fa

# Align the nucleotide to the protein database.
blastx -db refs/NC_002549-prot.fa -query nucleotide.fa | more

# Fetch the nucleoprotein.
efetch -db protein -id NP_066243.1 -format fasta > NP_066243.fa
tblastn -db refs/NC_002549.fa -query NP_066243.fa  | more

# Get all proteins of the 2014 ebola project.
esearch -db protein -query PRJNA257197 | efetch -format fasta > refs/ebola-2014.fa
makeblastdb -in refs/ebola-2014.fa -dbtye prot -parse_seqids

# Fetch the nucleoprotein.
efetch -db protein -id NP_066243.1 -format fasta > NP_066243.fa

Lecture 13 - installing and using bwa

#
# Get bwa from SourceForge.
#
# The URL you get from SourceForge is complicated because SourceForge really wants you
# to visit the site and download from there so that you see the advertising there.
#
# Alternatively visit the webpage and download manually.
#
cd ~/src

curl -OL http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.10.tar.bz2

# It is compressed with bzip2 so use the -j flag to uncompress.
tar jxvf bwa.0.7.10.tar.bz2

# We need to compile the program
cd bwa-0.7.10/

# The make command invokes the compilation.
make

# Link the bwa program to the bin folder
ln -s ~/src/bwa-0.7.10/bwa ~/bin

# Run the program to verify that it works.
mkdir -p ~/edu/lec13; cd ~/edu/lec13

# Run the tool to see how it works.
bwa

# Nice manual too!
man ~/src/bwa-0.7.10/bwa.1

# Make a global directory to store the references in.
mkdir -p ~/refs/852

# Get the reference genome for the 1999 outbreak.
efetch -db nucleotide -id NC_002549 -format fasta > ~/refs/852/ebola-1999.fa

# Index the genome to make it searchable with bwa.
bwa index ~/refs/852/ebola-1999.fa

# See the results.
ls ~/refs/852/

# Also make it searchable via blast.
makeblastdb -in ~/refs/852/ebola-1999.fa -dbtype nucl -parse_seqids

# Now look what happened. Whoa, thanks a bunch! (not!)
ls ~/refs/852/

# 16 files from a single one. That's why is best to be kept
# in a separate folder.
ls -l ~/refs/852/ | wc -l

# Get the first line of the ebola genome.
head -2 ~/refs/852/ebola-1999.fa > query.fa

# Map it back to the genome itself this time with bwa-mem
bwa mem ~/refs/852/ebola-1999.fa query.fa > results.sam

# Compare the process to blasting the same sequence
blastn -db ~/refs/852/ebola-1999.fa -query query.fa > results.blastn

Lecture 14 - understanding the SAM format

# Continuing where we left off in lecture 13
head -2 ~/refs/852/ebola-1999.fa  > query.fa

# Run the alignment.
bwa mem ~/refs/852/ebola-1999.fa query.fa > results.sam

# Note how bwa still prints on the stream.
# There are two output streams. They are called: the standard out and standard error.
# Here is how you would redirect both into a file.
bwa mem ~/refs/852/ebola-1999.fa query.fa 1> results.sam 2> errors.txt

# There are many tricks to redirect one stream into another, although in the
# vast majority of cases this is not necessary. Sometimes you really need to
# store the error message.
#
# http://www.tldp.org/LDP/abs/html/io-redirection.html

# Create queries based on the line you have.
# Add or remove bases, create insertions and deletions
# Run the alignment.

bwa mem ~/refs/852/ebola-1999.fa query.fa > results.sam

# To view certain columns you can cut out only those.
# It is important that you understand the meaning of each column.

# Query id, alignment flag and target id
cat results.sam | cut -f 1,2,3

# Alignment start, mapping quality, CIGAR string
cat results.sam | cut -f 4,5,6

# Paired end read information: name, position and length of template
cat results.sam | cut -f 7,8,9

# Query sequence, query quality, other attributes
cat results.sam | cut -f 10,11,12,13,14

# If you list two files to bwa-mem it will align them in paired and mode
# Get the example dataset from the course website.
curl -O http://www.personal.psu.edu/iua1/courses/files/2014/data-14.tar.gz

# Unpack and align this dataset then use the results to answer the homework questions.
bwa mem ~/refs/852/ebola-1999.fa read1.fq read2.fq > results.sam

Lecture 15 - BAM files and samtools

#
# Install a short read simulator. wgsim by Heng Li.
#
cd ~/src
git clone https://github.com/lh3/wgsim.git
cd wgsim
gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm
ln -s ~/src/wgsim/wgsim ~/bin/wgsim

# Check that it works
wgsim

# Now download and nstall the samtools package.
# Link samtools to the ~/bin directory

cd ~/src
curl -OL http://sourceforge.net/projects/samtools/files/samtools/1.1/samtools-1.1.tar.bz2
tar jxvf samtools-1.1.tar.bz2
cd samtools-1.1
make
ln -s ~/src/samtools-1.1/samtools ~/bin/

# Check that it works.
samtools

# Look at the samtools manual.
man ~/src/samtools-1.1/samtools.1

# We'll generate short reads from a mutated reference genome
# then map these back to it.
mkdir ~/edu/lec15
cd ~/edu/lec15

# Generate some data, put the output into a file
wgsim -N 10000 ~/refs/852/ebola-1999.fa read1.fq read2.fq > mutations.txt

# Run the alignment
bwa mem ~/refs/852/ebola-1999.fa read1.fq read2.fq > results.sam

# Start the conversion to a BAM file.
samtools view -Sb results.sam > temp.bam

# Sort the alignment.
samtools sort -f temp.bam results.bam

# Index the alignment.
samtools index results.bam

# Typically we put all commands into a single script and run it as such.
# See script at the end.
bash align.sh read1.fq read2.fq results.bam

# Filtering with samtools
# -f match the flag (keep the reads that match the flags),
# -F filter the flags (remove reads that match the flag and keep the rest)

# Reads that align to the reverse strand.
samtools view -f 16 results.bam

# Reads that align to the forward strand.
samtools view -F 16 results.bam

# The -c flag counts
samtools view -F 16 results.bam | wc -l
samtools view -c -F 16 results.bam

# Filter by quality. BWA shows a mapping quality of zero
# for reads that map to multiple locations. q>=1 thus means
# uniquely mapping reads.
samtools view -c -q 1 results.bam

# Count the high quality alignment.
samtools view -c -q 40 results.bam

# The reference name  is gi|10313991|ref|NC_002549.1|
# It is tedious to type that. Create a shorcut.
CHR='gi|10313991|ref|NC_002549.1|'

# Slice into the datafile.
samtools view results.bam $CHR:1-100

# View specific columns of the file.
samtools view results.bam $CHR:1-100 | cut -f 4 |  more

Lecture 16 - Converting data with ReadSeq

#
# Install the file format converter
# We have to make a directory for it.
#
mkdir -p x

cd ~/src/readseq

curl -OL http://iubio.bio.indiana.edu/soft/molbio/readseq/java/readseq.jar

# It is invoked similarly to trimmomatic.
# Modify the script for trimmomatic
cp ~/bin/trimmomatic ~/bin/readseq

#
# Replace the program with
# java -jar ~/src/readseq/readseq.jar $@
#
# You can also add
# java -cp ~/src/readseq/readseq.jar app $@
#
# to run it with a graphical user interface.

# Get the genome for the 1999 ebola genome as a full GenBank record.
# http://www.ncbi.nlm.nih.gov/genome/4887

# Get the full genbank file.
mkdir -p ~/edu/lec16

cd -p ~/edu/lec16

# Get the complete data.
efetch -db nucleotide -id NC_002549.1 -format gb > NC.gb

# Format the data as GFF (Generic Feature Format).
# Autodetects the input format.
readseq -format=GFF NC.gb

# You may also set the output into a filename.
readseq -format=GFF -o NC-all.gff NC.gb

# Extract into a fasta file.
readseq -format=FASTA -o NC.fa NC.gb

# Retain only the gene rows from the GFF file.
cat NC-all.gff | egrep '\tgene\t'

# Really what we want is also the first two lines since those
echo '##gff-version 2' > NC-genes.gff
cat NC-all.gff | egrep '\tgene\t' >> NC-genes.gff

#
# Chromosomal coordinates do not match! We need to fix that.
#
# We either need to redo the alignment, or swap the coordinates
# of the existing alignment or features.
#
# Reindex and realign the data
cp NC.fa ~/refs/852/
bwa index ~/refs/852/NC.fa
cp ../lec15/*.fq .

# Edit the align.sh to make use of the new file and rerun everyhing.
bash align.sh read1.fq read2.fq results.bam

# Load the data and view it in IGV
#
# Look at the depth in the 100-500 bp region.
# The output is sequence id, genomic index, coverage
#
samtools depth -r NC_002549:100-500 results.bam

Lecture 17 - Simple programming with awk

#
# Simple programming with awk
#
# We did this already but let's do it again so that we
# know what we are working with.
# Extract the Ebola coding features, genes and coding sequences
efetch -db nucleotide -id NC_002549.1 -format gb > NC.gb
readseq -format=GFF -o NC.gff NC.gb

# Find the lenght of each feature. Note how "magic"
# variable names get created
cat NC.gff | awk ' { print $1, $2, $3 } ' | head -5

# This is (almost) equivalent to cutting columns
cat NC.gff | cut -f 1,2,3 | head -5

# We can use operations between the fields. How long is each feature?
cat NC.gff | awk ' { print $3, $5-$4 + 1 } ' | head -5

# We can make use of the pattern matching to extract only CDS features.
cat NC.gff | awk '$3 =="gene" { print $3, $5-$4 + 1, $9 } '

# What is the cumulative length of all genes
cat NC.gff | awk '$3 =="gene" { len=$5-$4 + 1; size += len; print "Size:", size } '

# What is the cumulative length of all coding sequences
cat NC.gff | awk '$3 =="CDS" { len=$5-$4 + 1; size += len; print "Size:", size } '

# How big is the genome? There are many ways to find out, perhaps from the SAM headers.
samtools view -h ../lec16/results.bam | head -2

# What percent of the genome is covered by genes.
cat NC.gff | awk '$3 =="CDS" { len=$5-$4 + 1; size += len; perc=size/18959; print "Size:", size, perc } '

# What we need is to force awk to only split by tabs.
# Splitting by all whitespace is useful but in data analysis
# can lead to very annoying bugs.
# Tell awk to separate by tab both as F (Field Separator) and as OFS (Output Field Separator)

# Put this into your .profile or .bashrc file
alias awk="awk -F '\t' -v OFS='\t'"

# Apply the setting.
source ~/.profile

# Make a new gff file with the genes and coding sequences.
# The first line specifies what kind of file this is.
head -1 NC.gff  > NC-genes.gff
head -1 NC.gff  > NC-cds.gff

# Separate the files for various features.
cat NC.gff | awk ' $3=="gene" { print $0 }' >> NC-genes.gff
cat NC.gff | awk ' $3=="CDS" { print $0 }'  >> NC-cds.gff

# Sam files are tab delimited and can be processed via awk.

# In last week's data how many bases have the coverage of more than 50?
samtools depth ../lec16/results.bam | awk '$3 > 50 { print $0 } ' | wc -l

# How many templates are longer than 50 bp.
# The template lenght can be negative though...
samtools view ../lec16/results.bam | awk ' $9 > 50 { print $0 } '  | wc -l

Split a GFF file to extract the gene name

$3 == "gene" {

    # Split the 9th column by ;
    split($9, x, ";")

    # The gene name is in the first resulting element.
    # Split that by space. The gene name is the second
    # element.
    split(x[1], y, " ")

    # Remove the double quotes around the gene name
    name = y[2]

    # Global substituion of " with empty space.
    # Since " is also a special character we have to
    # write it as \"
    gsub("\"", "", name)

    # Print the type of the feature, the name and the size
    print $3, name, $5 - $4 + 1
}

Lecture 18 - Comparing alignment tools

#
# Download and install bowtie2 aligner
#
cd ~/src

# On a Mac OSX use
curl -OL http://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.2.4/bowtie2-2.2.4-macos-x86_64.zip
# On Linux use
#curl -OL http://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.2.4/bowtie2-2.2.4-linux-x86_64.zip

# This is already a binary so it is executable
unzip bowtie2-2.2.4-macos-x86_64.zip

# Link the aligner and other related programs into the bin folder
ln -s ~/src/bowtie2-2.2.4/bowtie2 ~/bin/
ln -s ~/src/bowtie2-2.2.4/bowtie2-align ~/bin/
ln -s ~/src/bowtie2-2.2.4/bowtie2-build ~/bin/

# Build the index to our reference genome
# This is now the bowtie2 specific index
bowtie2-build ~/refs/852/NC.fa NC.fa

# Format the mutations so that these can be displayed in the browser
# Make it into a gff file.
cat mutations.txt | awk ' {print $1, "wgsim", "mutation", $2, $2, ".", "+",  ".", "." }' > mutations.gff

# Run the comparisons
bash compare.sh

# Modify the script to have larger error rates.
# Count the number of mapper reads as you change the error rate.
samtools view -cF 4 bwa.bam
samtools view -cF 4 bow.bam

The final script

# Compare the output of two aligners.
#
# Usage: bash compare.sh

# Stop on errors.
set -ue

# Data file names.
READ1=r1.fq
READ2=r2.fq

# Reference file. Must be indexed with both aligners.
REFS=~/refs/852/NC.fa

# Simulate reads from a genome.
# Edit the error rates and rerun the pipeline.
wgsim -N 10000 -e 0.1 $REFS $READ1 $READ2 > mutations.txt

# convert mutations into a GFF file.
cat mutations.txt | awk -F '\t' ' BEGIN { OFS="\t"; print "##gff-version 2" } { print $1, "wgsim", "mutation", $2, $2, ".", "+",  ".", "name " $3 "/" $4 }' > mutations.gff

# We need to add a read group to the mapping
GROUP='@RG\tID:123\tSM:liver\tLB:library1'

# Run bwa and create the bwa.sam file.
bwa mem -R $GROUP $REFS $READ1 $READ2 > bwa.sam

# Run bowtie2 and create the bow.sam file.
bowtie2 -x $REFS -1 $READ1 -2 $READ2 > bow.sam

# Tuning bowtie2
#bowtie2 -D 20 -R 3 -N 1 -L 20 -i S,1,0.50 -x $REFS -1 $READ1 -2 $READ2 > bow.sam

# For each sam file convert it to bam format.
for name in *.sam;
do
    samtools view -Sb $name > tmp.bam
    samtools sort -f tmp.bam $name.bam
done

# Get rid of intermediate files.
rm -f tmp.sam tmp.bam

# Fix names so that they don't have two extensions.
mv bwa.sam.bam bwa.bam
mv bow.sam.bam bow.bam

# Index the data.
for name in *.bam
do
    samtools index $name
done

# Clean all the data generated by this program
# rm -f *.bam *.bai *fq *.txt *.sam *.gff

Lecture 19 - samtools faidx and pileups

#
# Continuing with the code from lecture 18
#
#
# We have the bwa.bam and bow.bam files.

#
# Pileup output. The wgsim simulator produces reads with low quality
# that would otherwise be ignored by samtools.
#

# index the fasta file with samtools
samtools faidx ~/refs/852/NC.fa

# Query the fasta file with samtools
samtools faidx ~/refs/852/NC.fa NC_002549:1-10

# You can list multiple ranges
samtools faidx ~/refs/852/NC.fa NC_002549:1-10 NC_002549:100-110


# Investigate the output with or without a fasta reference
#
# look at the help file
#
samtools mpileup

# pileup with no reference
samtools mpileup -Q 0 bwa.bam | more

# pileup with reference
samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam | more

Lecture 20 - Pileups and coverages

#
# First off install bcftools a suite for interacting with variant call formats.
#
#
cd ~/src
curl -OL http://sourceforge.net/projects/samtools/files/samtools/1.1/bcftools-1.1.tar.bz2
tar jxvf bcftools-1.1.tar.bz2
cd bcftools-1.1
make
ln -s ~/src/bcftools-1.1/bcftools ~/bin

# Note the similarity in the usage of bcftools and samtools.
bcftools

#
# Using the alignment script for lecture 18.
#
#
bash compare.sh

#
# What is the average coverage for each run.
#
# NR is the number or records that may not be equal to the size of the genome.
# By default samtools depth only outputs regions that have non-zero coverage.
samtools depth bwa.bam | awk '{ sum += $3 } END { print sum/NR } '

#
# How to find the size of the genome.
# Many ways, none of them particularly pretty.
#

# The @SQ header contains the lengths.
samtools view -h bwa.bam | head | grep @SQ

# Now we know the size of the chromosome.
samtools depth bwa.bam | awk '{ sum += $3 } END { print sum/18959 } '

#
# Pileups with or without a reference
#
# Look at the usage
#
samtools mpileup

# pileup with no reference
samtools mpileup -Q 0 bwa.bam | more

# Pileup with reference.
samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam | more

#
# Samtools text genome browser.
# Press ? for help, q or ESC to exit.
samtools tview bwa.bam

# Generate a variant Call format for the entire genome.
samtools mpileup -Q 0 -f ~/refs/852/NC.fa -uv bwa.bam | more

# Generate genotypes then call variants.
# Samtools is designed for human genomes - it does not seem to
# work all that well with viral genomes.
samtools mpileup -Q 0 -ugf ~/refs/852/NC.fa bwa.bam | bcftools call -vm > snps.vcf

#
# Welcome to your next fileformat.
#
# A column by column guide.
#
samtools mpileup -Q 0 -f ~/refs/852/NC.fa -uv bwa.bam | grep -v "##" | cut -f 1,2,3,4,5 | head

The updated script

# Compare the output of two aligners.
#
# Usage: bash compare.sh

# Stop on errors.
set -ue

# Data file names.
READ1=r1.fq
READ2=r2.fq

# Reference file. Must be indexed with both aligners.
REFS=~/refs/852/NC.fa

# Simulate reads from a genome.
# Edit the error rates and rerun the pipeline.
wgsim -N 10000 -e 0.1 $REFS $READ1 $READ2 > mutations.txt

# convert mutations into a GFF file.
cat mutations.txt | awk -F '\t' ' BEGIN { OFS="\t"; print "##gff-version 2" } { print $1, "wgsim", "mutation", $2, $2, ".", "+",  ".", "name " $3 "/" $4 }' > mutations.gff

# We need to add a read group to the mapping
GROUP='@RG\tID:123\tSM:liver\tLB:library1'

# Run bwa and create the bwa.sam file.
bwa mem -R $GROUP $REFS $READ1 $READ2 > bwa.sam

# Run bowtie2 and create the bow.sam file.
bowtie2 -x $REFS -1 $READ1 -2 $READ2 > bow.sam

# Tuning bowtie2
#bowtie2 -D 20 -R 3 -N 1 -L 20 -i S,1,0.50 -x $REFS -1 $READ1 -2 $READ2 > bow.sam

# For each sam file convert it to bam format.
for name in *.sam;
do
    samtools view -Sb $name > tmp.bam
    samtools sort -f tmp.bam $name.bam
done

# Get rid of intermediate files.
rm -f tmp.sam tmp.bam

# Fix names so that they don't have two extensions.
mv bwa.sam.bam bwa.bam
mv bow.sam.bam bow.bam

# Index the data.
for name in *.bam
do
    samtools index $name
done

# Clean all the data generated by this program
# rm -f *.bam *.bai *fq *.txt *.sam *.gff

Lecture 21 - Variant calling with samtools

# Shortcut to convert and sort a bam file.
alias tobam='samtools view -b - | samtools sort -o - tmp '

# Install bcftools.
cd ~/src
curl -OL http://sourceforge.net/projects/samtools/files/samtools/1.1/bcftools-1.1.tar.bz2
tar jxvf bcftools-1.1.tar.bz2
cd bcftools-1.1
make
ln -s ~/src/bcftools-1.1/bcftools ~/bin

#
# Using the alignment script for lecture 18.
#
#
bash compare.sh

# Pileup with reference.
samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam | more

# Generate genotypes then call variants.
# Samtools is designed for diploid human genomes
# It may have tacit assumptions associated with it.
samtools mpileup -Q 0 -ugf ~/refs/852/NC.fa bwa.bam | bcftools call -vc > samtools.vcf

# View the snp calls.

# How does snp calling work?
# A DIY snp caller I wrote as a simple demo (see website)
samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam > bwa.pileup
cat bwa.pileup | python snpcaller.py > diy.vcf

# Look at that my simplistic caller produces the same
# result as samtools on default settings -- how about that ...

#
# Welcome to your next fileformat.
#
# A column by column guide.
#
cat samtools.vcf| grep -v "##" | cut -f 1,2,3,4,5 | head

# Lets install Freebayes

# On a Mac we need another program to compile freebayes
brew install cmake

# Now onto installing FreeBayes
cd ~/src
git clone --recursive git://github.com/ekg/freebayes.git
cd freebayes

# This produces an error on a Mac.
make

# But will work when run the second time. It is not clear what is going
# on. The progam seems to work - but problbay some functionality is missing.
make

ln -sf ~/src/freebayes/bin/freebayes ~/bin

# Now switch back to the main directory and call snps with FreeBayes
cd ~/edu/lec21/

# How does the tool work?
freebayes

# Call snps with it
freebayes -f ~/refs/852/NC.fa bwa.bam > freebayes.vcf

# Visualize the results.

DIY snp caller

#
# DIY snip caller
#
import sys
from collections import defaultdict

# This code was written in about 20 minutes to demonstrate
# a really simple snp caller. As it is it calls only
# homozygous mutation defined as over 50% of calls.

print "##fileformat=VCFv4.2"
print "\t".join("#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT bwa.bam".split())

for line in sys.stdin:
    # Strip the newline then split by tabs.
    column = line.strip().split("\t")

    # The fifth base contains the basecalls
    # (zero based counting).
    bases = column[4]

    # The quals column will tell us how many bases are called.
    quals = column[5]

    # Upper case everything.
    # We do lose strand information with that.
    bases = bases.upper()
    count = defaultdict(int)
    for c in bases:
        count[c] += 1

    # How many bases cover the site
    depth = float(len(quals))

    for m in "ATGC":
        if count[m]/depth > 0.50:
            # We got a homozygous mutations
            # produce the CVF records
            # Collect output into an array.

            info = "DP=%s" % (int(depth))
            format = "GT:PL"
            values = "1/1:30"
            out = [column[0], column[1], ".", column[2], m, 30, ".", info, format, values]

            # Make all elements of the array a string.
            out = map(str, out)

            # Join the fields of the output array by tabs
            # Print the joined fields.
            print "\t".join(out)

Lecture 22 - Variant effect prediction

# Install, build and link bioawk
cd ~/src
git clone https://github.com/lh3/bioawk.git
cd bioawk
make
ln -sf ~/src/bioawk/bioawk ~/bin

# There is a manual that comes with bioawk.
man ~/src/bioawk/awk.1

# Bioawk knows how to handle bioinformatics types
# It knows not just how to split records by column,
# It can populate records by name.

#
# After you run compare.sh script that simulates reads
# and aligns them (see previous lectures).
# Now process them with bioawk. Help on formats.
#
bioawk -c help

#
cat r1.fq | bioawk -c fastx ' { print $seq } ' | head -1

cat mutations.gff | bioawk -c gff ' { print $seqname, $start } ' | head -1

# Code becomes more readable instead of writing:
cat mutations.gff | grep -v "#" | awk -F '\t' -v OFS='\t' ' { print $1, $5-$4+1 } ' | head -1

# We can write the following:
cat mutations.gff | bioawk -c gff ' { print $seqname, $end-$start+1 } ' | head -1

# We'll use bioawk from now on. We'll cover seqtk, tabtk, tabix at later times.


# Let's download snpEff
cd ~/src
curl -OL http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
unzip snpEff_latest_core.zip

# we can make a script to run snpEff, like trimmomatic and readseq
# for simplicity we'll make an alias for now
alias snpEff='java -jar ~/src/snpEff/snpEff.jar'

# Run snpEff
snpEff

# You may get an error of the form: UnsupportedClassVersionError
# This means that you need to update your java program to at least verion 7
java -version

# Search download and install the Java JDK (Java Development Kit, but not the JRE, Java Runtime Environment)
# Now snpEff should work.
snpEff

# There are prebuilt snpEff databases
snpEff databases | more

# Search for the ebola data
snpEff databases | grep ebola

# Download the ebola database
# -v makes the process verbose: prints more messages
snpEff download -v ebola_zaire

# But what are those annotations?
snpEff dump ebola_zaire

#
# Ok I so the annotations are built relative to a different genome, not
# what we currently use. So we need to get that. Oy.
#
# Also get the genbank file and extract the coding sequences into a gff.
readseq -format=FASTA -o ~/refs/852/KJ660346.fa KJ660346.gb
readseq -format=GFF  -o KJ660346.gff KJ660346.gb

# The query will be the old genome from 1999 that is in NC.fa

# Index the genome with bwa and samtools
bash align-genomes.sh ~/refs/852/KJ660346.fa ~/refs/852/NC.fa

# Now we need to create a script that takes two files as input
# and generates a VCF file as output.

# Run snpEff
java -jar ~/src/snpEff/snpEff.jar -v  ebola_zaire aln.vcf > annotated.vcf

Align two genomes

#
# General single end aligner and snp caller.
#
# First parameter is the reference, second parameter the query
#
#

# This will stop the program from
# running if no parameters are passed
set -ue

# Parameters from the command line.

# First parameter is the first genome (reference) file.
GENOME1=$1

# The second paramters is the second genome.
GENOME2=$2

# This is only needed to be performed once
bwa index $GENOME1

# Run bwa and create the bwa.sam file
bwa mem $GENOME1 $GENOME2 > aln.sam

# If you have installed the lastz aligner you can use that.
#lastz $GENOME1[unmask] $READS[unmask] --format=sam > aln.sam

samtools view -Sb aln.sam > tmp.bam
samtools sort -f tmp.bam aln.bam
samtools index aln.bam

# Get rid of intermediate files.
rm -f tmp.bam

# Call snps
samtools mpileup -f $GENOME1 -ug aln.bam | bcftools call -vm > results.vcf

Lecture 23 - Call SNPs from the ebola data

#
# Call SNPS on multiple samples
#
# Get multiple datasets from the Ebola project
#
#
#

#
# Ouch! The data was aligned to another sequence as reference!
#
# Prepare the new reference then. No other workarounds.
#
efetch -id KM034562 -format gb -db nucleotide > KM034562.gb
readseq -format=FASTA -o ~/refs/852/KM034562.fa KM034562.gb
readseq -format=GFF -o KM034562.gff KM034562.gb
bwa index ~/refs/852/KM034562.fa
samtools faidx ~/refs/852/KM034562.fa

# Get the run info file
#
esearch -db sra -query PRJNA257197  | efetch -format runinfo > runinfo.csv

#
# Select a few SRR (try more if feeling adventurous)
#
cat runinfo.csv | cut -f 1 -d ',' | grep SRR | head -10 > sraids.txt

#
# Let's put the data into a different folder.
# There is going to be a lot of it.
#
mkdir -p sra
mkdir -p bam

#
# Fetch each SRA id from the file
#
cat sraids.txt | awk ' { print "fastq-dump -X 20000 --split-files -O sra " $1 } ' > get-data.sh

# This is the script that will execute the data.
bash get-data.sh

# Now align each pair.
# The align-ebola script should be present. That script aligns an SRA file
# and generates a BAM file from it.
cat sraids.txt | awk ' { print "bash align-ebola.sh " $1 } ' > align-data.sh
bash align-data.sh

# Call snps on all the datasets.
freebayes -p 1 -f ~/refs/852/KM034562.fa bam/*.bam > freebayes.vcf

Alignment script: align-ebola.sh

# Compare the output of two aligners.
#
# Usage: bash align-ebola.sh SRA-RUN-ID
#

# Put the alignment files into a separate folder
mkdir -p bam

# Stop on errors.
set -ue

# Data file names.
NAME=$1

# Create read names from the incoming parater.
# ABC will become sra/ABC_1.fastq and sra/ABC_2.fastq

# Form the input names
READ1=sra/"$NAME"_1.fastq
READ2=sra/"$NAME"_2.fastq

# Reference file. Must be indexed with both aligners.
REFS=~/refs/852/KM034562.fa

# We need to add a read group to the mapping
GROUP="@RG\tID:$NAME\tSM:$NAME\tLB:$NAME"

# Run bwa and create the bwa.sam file.
# Create a sorted bamfile as output.

# This will be the bamfile name.
BAM=bam/$NAME.bam

bwa mem -R $GROUP $REFS $READ1 $READ2 2> logfile.txt | samtools view -b - | samtools sort -o - tmp > $BAM

echo $?

samtools index $BAM

echo "*** Created alignment file $BAM"

# print out a simple statistic line
samtools flagstat $BAM | grep proper

Lecture 24 - Creating interval data

#
# Get the ebola genome data and features
# (if you do not already have it)
#

#
# A demo genome. It could be any other.
#
efetch -id KM034562 -format gb -db nucleotide > KM034562.gb
readseq -format=FASTA -o ~/refs/852/KM034562.fa KM034562.gb
readseq -format=GFF -o KM034562.gff KM034562.gb

#
# We will build BED and GFF files by hand
# with an editor and display them the browser
#
#
# Example BED file, columns are tab delimited:
# (when you copy/paste with the browser they usually get turned into spaces)
#
KM034562	100	200	Happy	0	-

#
# Example BEDgraph file
#
KM034562	100	200	50
KM034562	150	250	-50

#
# Example GTF file
#
KM034562	lecture	CDS	100	200	.	-	.	gene "Happy"; protein_id "HAP2"

#
# Example GFF file
#
##gff-version 3
KM034562	lecture	CDS	100	200	.	-	.	gene=Happy; protein_id=HAP2

#
# Hierachical data representations: for example splicing
#
#
# GTF representation
#

##gff-version 2
KM034562	demo	exon	1000	2000	.	+	.	gene_id "Happy"; transcript_id "Sad";
KM034562	demo	exon	3000	4000	.	+	.	gene_id "Happy"; transcript_id "Sad";
KM034562	demo	exon	5000	6000	.	+	.	gene_id "Happy"; transcript_id "Sad";
KM034562	demo	exon	7000	8000	.	+	.	gene_id "Happy"; transcript_id "Sad";

KM034562	demo	exon	1000	2000	.	+	.	gene_id "Happy"; transcript_id "Mad";
KM034562	demo	exon	5000	6000	.	+	.	gene_id "Happy"; transcript_id "Mad";
KM034562	demo	exon	7000	8000	.	+	.	gene_id "Happy"; transcript_id "Mad";

KM034562	demo	exon	1000	2000	.	+	.	gene_id "Happy"; transcript_id "Bad";
KM034562	demo	exon	5000	6000	.	+	.	gene_id "Happy"; transcript_id "Bad";

#
# GFF representation
#
##gff-version 3
KM034562	demo	exon	1000	2000	.	+	.	Parent=Sad,Mad,Bad;
KM034562	demo	exon	3000	4000	.	+	.	Parent=Sad;
KM034562	demo	exon	5000	6000	.	+	.	Parent=Sad,Mad,Bad;
KM034562	demo	exon	7000	8000	.	+	.	Parent=Sad,Mad;

#
# BED representation
#
KM034562	999	8000	Sad	0	+	999	8000	.	4	1000,1000,1000,1000	0,2000,4000,6000
KM034562	999	8000	Mad	0	+	999	8000	.	3	1000,1000,1000	0,4000,6000
KM034562	999	8000	Bad	0	+	999	8000	.	3	1000,1000	0,4000

Lecture 25 - Introduction to Bedtools

#
# Install bedtools
#
cd ~/src
curl -OL https://github.com/arq5x/bedtools2/releases/download/v2.22.0/bedtools-2.22.0.tar.gz
tar zxvf bedtools-2.22.0.tar.gz
cd bedtools2
make
ln -sf ~/src/bedtools2/bin/bedtools ~/bin/bedtools

# Demo bed file
KM034562	100	200	one	0	+
KM034562	400	500	two	0	-

# Our genome file
KM034562	18959

# Unstranded operation.
bedtools slop -i demo.bed -g genome.txt -l 10 -r 0

# Strand aware operation
bedtools slop -i demo.bed -g genome.txt -l 10 -r 0 -s

# Transform BED coordinates to GFF
# This is not really a correct BED to GFF tranformation
cat demo.bed | bioawk -c bed '{print $chrom, ".", ".", $start+1, $end, $score, $strand, ".", "." }' > demo.gff

# Look MA! It works seamlessly with other formats! How is that possible? Magic.
bedtools slop -i demo.gff -g genome.txt -l 10 -r 0 -s

# Flank operation. Redirects the output into a file.
bedtools flank -i demo.gff -g genome.txt -l 10 -r 0 -s > flank.gff

# Complement operation.
bedtools complement -i demo.gff -g genome.txt > complement.gff

#
# Get the Ebola genome.
#
efetch -id KM034562 -format gb -db nucleotide > KM034562.gb
readseq -format=FASTA -o ~/refs/852/KM034562.fa KM034562.gb
readseq -format=GFF -o KM034562.gff KM034562.gb

# Reminding ourselves of what formats the tool can take.
bioawk -c help

# Get the genes from the entire file.
cat KM034562.gff | bioawk -c gff ' $feature=="gene" { print $0 } ' > genes.gff

# Sequence extraction. Get sequences that correspond to intervals
bedtools flank -i genes.gff -g genome.txt -l 10 -r 0 -s > flank.gff

# Extract the sequences that correspond to the intervals in genes.gff
bedtools getfasta -fi ~/refs/852/KM034562.fa -bed genes.gff -fo genes.fa

# Look at the resulting file.
head genes.fa

Lecture 26 - Bedtools Tutorial

#
# Go to lecture 23 where we computed alignments of the Ebola short read data
#
cd ~/edu/lec23

#
# Create an interval file called region.bed
#
# KM034562	1000	2000
#
# We can use this interval file to intersect the data with.

bedtools intersect -a bam/SRR1553593.bam -b region.bed > region.bam
samtools index region.bam

# We can output this file as a bed file as well.
bedtools intersect -a bam/SRR1553593.bam -b region.bed -bed > region.bed

# Or alternatively we can turn the bamfile into a bedfile
bedtools bamtobed -i bam/SRR1553595.bam > SRR1553595.bed

#
# We will follow the bedtools tutorial from
#
# http://quinlanlab.org/tutorials/cshl2014/bedtools.html
#
mkdir ~/edu/lec26
cd ~/edu/lec26

#
# Get all the data then unpack it
#
curl -O http://quinlanlab.cs.virginia.edu/cshl2013/maurano.dnaseI.tgz
curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed
curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed
curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed
curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt
curl -O http://quinlanlab.cs.virginia.edu/cshl2013/hesc.chromHmm.bed

#
# This dataset contains more data.
# Fetal tissue samples from the brain, heart, intestine, kidney, lung, muscle, skin, and stomach.
#
tar -zxvf maurano.dnaseI.tgz

# We will follow the tutorial in class.

# A few tips for the homework.

#
# Select the promoters from the annotation file
#
cat hesc.chromHmm.bed | grep  Promoter > promoters.bed

#
# Find the closest promoter to each exon
#
# the -d flag prints the column
#
bedtools closest -a exons.bed -b promoters.bed  -d | head -1

#
# Find the CPG island that has the most exons covering it
#
bedtools intersect -a cpg.bed -b exons.bed -c | sort -k5,5nr | head -1

#
# Cover the human genome with windows of 5kb size
#
bedtools makewindows -g genome.txt -w 50000 > windows.bed

Lecture 27 - Introduction to RNA-Seq

#
# Install TopHat
#
cd ~/src

# Mac OSX
curl -OL http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.13.OSX_x86_64.tar.gz

# Linux
# curl -OL http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.13.Linux_x86_64.tar.gz
#

# Unpack and install it
tar zxvf tophat-2.0.13.OSX_x86_64.tar.gz
ln -s ~/src/tophat-2.0.13.OSX_x86_64/tophat ~/bin

#
# Following the RNA-Seq bootcamp by Stephen Turner
#
# http://bioconnector.github.io/workshops/lessons/rnaseq-1day/#setup
#

#
# Download genomic data information from Ensemble
#
# Make a directory for the human genome
mkdir -p ~/refs/hg38

# Get the genome sequence
curl -L ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.4.fa.gz > ~/refs/hg38/chr4.fa.gz

# Get the annotations
curl -L ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz > ~/refs/hg38/hg38.gtf.gz

# Unzip the files in the reference
gunzip ~/refs/hg38/*.gz

# We can build a simpler GTF file that only contains the data for chromosome 4
cat ~/refs/hg38/hg38.gtf | awk ' $1 =='4' { print $0 }' > chr4.gtf

# You can also extract parts of the annotation file, say exons.
cat chr4.gtf | awk '$3=="exon" { print $0 } ' > exons.gtf

# Build the bowtie2 reference
bowtie2-build ~/refs/hg38/chr4.fa  ~/refs/hg38/chr4

# Build the bwa reference
bwa index ~/refs/hg38/chr4.fa

# Get the and unpack the data
curl -OL http://www.personal.psu.edu/iua1/courses/files/2014/rnaseq-data.tar.gz
tar zxvf rnaseq-data.tar.gz

#
# Let's align one dataset first
#
#
# Shortcut to convert and sort a SAM file into a BAM  file.
#
alias tobam='samtools view -b - | samtools sort -o - tmp '

time bwa mem ~/refs/hg38/chr4.fa  data/ctl1.fastq.gz | tobam > ctl1.bwa.bam
samtools index ctl1.bwa.bam

#
# What is the highest covered region
#
bedtools coverage -abam ctl1.bwa.bam -b chr4.gtf | more

# Put it into a file
bedtools coverage -abam ctl1.bwa.bam -b chr4.gtf > coverages.txt

#
# Sort by column 10 (the column that contains the number of reads mapped)
# Tab delimited files need to get the construct below to work
# otherwise sort will split on any space.
#
cat coverages.txt | sort -t$'\t' -rn -k10,10 | head

#
# Load your data into IGV
# Find gene RPS3A
#

#
# Let's align the same data with a splice aware aligner: TopHat
#
tophat ~/refs/hg38/chr4 data/ctl1.fastq.gz

#
# Index the results
#
samtools index tophat_out/accepted_hits.bam
#
# Investigate the results.
# with IGV

Lecture 28 - Running the Tuxedo suite

#
# Get and install cuffdiff and cufflinks
#
cd ~/src

#
# Download and install the Mac version
#
curl -OL http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.1.1.OSX_x86_64.tar.gz
tar zxvf cufflinks-2.1.1.OSX_x86_64.tar.gz
ln -fs ~/src/cufflinks-2.1.1.OSX_x86_64/cufflinks ~/bin
ln -fs ~/src/cufflinks-2.1.1.OSX_x86_64/cuffdiff ~/bin

#
# On Linux download a different version. Note the change in the name of directory.
# Adjust the commands above accordingly.
#
# curl -OL http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.1.1.Linux_x86_64.tar.gz
#


# Check that it works.
cuffdiff

#
# We need to generate alignments with tophat then quantify these with cuffdiff
#
# We assume the genome has been downloaded and indexed as shown in lecture 27
#
curl -OL http://www.personal.psu.edu/iua1/courses/files/2014/rnaseq-data.tar.gz
tar zxvf rnaseq-data.tar.gz
rm rnaseq-data.tar.gz

# Get this only if you did not already
# curl -L ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz > ~/refs/hg38/hg38.gtf.gz
#gunzip ~/refs/hg38/*.gz

cat ~/refs/hg38/hg38.gtf | awk ' $1=="4" { print $0 } ' > chr4.gtf

#
# Run tophat on each of the files
#
tophat -G chr4.gtf -o data/ctl1-tophat ~/refs/hg38/chr4 data/ctl1.fastq.gz

#
# Copy the resulting file to the bam directory under a different name
#
cp data/ctl1-tophat/accepted_hits.bam bam/ctl1.bam
samtools index bam/ctl1.bam

#
# Now repeat each step.
# You could create script file, either one that lists all commands ore
# a more automatic one (see below).
#
# bash run-tophat.sh

#
# Now let's look for differential expression with cuffdiff
#
cuffdiff

#
# Run cuffdiff and place run into the bam/cuffdiff_out folder
#
cd bam
cuffdiff -o cuffdiff_out chr4.gtf ctl1.bam,ctl2.bam,ctl3.bam uvb1.bam,uvb2.bam,uvb3.bam

#
# Look at the files that cuffdiff produced
#
cd cuffdiff_out
ls

#
# Open the gene_exp.diff file sort by the 10th column and only show a few columns
#
cat gene_exp.diff | grep yes | sort -k10n | cut -f 3,8,9,10 | head
Tophat alignment script: run-tophat.sh
# Runs tophat on each data file

# Stop on errors.
set -ue

REF=~/refs/hg38/chr4

# This folder will contain the alignments
mkdir -p bam

for name in ctl1 ctl2 ctl3 uvb1 uvb2 uvb3
do
	# Create the fastq file name.
	FASTQ=data/$name.fastq

	# Output directory name.
	OUTPUT=data/$name-tophat

	# Run tophat and put the results in OUTPUT directory.
	tophat -G chr4.gtf -o $OUTPUT $REF $FASTQ

	# Move the resulting accepted_hits.bam file to a different name.
	mv -f $OUTPUT/accepted_hits.bam bam/$name.bam

done;

Lecture 29 - Comparing feature counting software

#
# Download and install eXpress
#
cd ~/src
curl -OL http://bio.math.berkeley.edu/eXpress/downloads/express-1.5.1/express-1.5.1-macosx_x86_64.tgz
tar zxvf express-1.5.1-macosx_x86_64.tgz
ln -fs ~/src/express-1.5.1-macosx_x86_64/express ~/bin

#
# For Linux
#
# curl -OL http://bio.math.berkeley.edu/eXpress/downloads/express-1.5.1/express-1.5.1-linux_x86_64.tgz

# Check that the eXpress program works.
express

#
# Install the subread package
#
cd ~/src
curl -OL http://downloads.sourceforge.net/project/subread/subread-1.4.6/subread-1.4.6-source.tar.gz
tar zxvf subread-1.4.6-source.tar.gz
cd subread-1.4.6-source/src
make -f Makefile.MacOS
ln -fs ~/src/subread-1.4.6-source/bin/featureCounts ~/bin
ln -fs ~/src/subread-1.4.6-source/bin/subread-align ~/bin
ln -fs ~/src/subread-1.4.6-source/bin/subread-buildindex ~/bin

#
# On Linux
#
# make -f Makefile.Linux

# Check that the featureCounts program works
featureCounts
subread-align

#
# Data preparation.
#
# Get the data if you don't already have it
curl -OL http://www.personal.psu.edu/iua1/courses/files/2014/rnaseq-data.tar.gz
tar zxvf rnaseq-data.tar.gz
rm -f rnaseq-data.tar.gz

# Get the GTF file if you don't already have it
curl -L ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz | gunzip -c > ~/refs/hg38/hg38.gtf.gz

# Make the GTF file for just chromosome 4
# Since we know our data only contains data that aligns to this chromosome.
cat ~/refs/hg38/hg38.gtf | awk ' $1=="4" { print $0 } ' > chr4.gtf


#
# Quantify with TopHat
# --------------------
tophat -G chr4.gtf -o data/ctl1-tophat ~/refs/hg38/chr4 data/ctl1.fastq.gz
tophat -G chr4.gtf -o data/ctl2-tophat ~/refs/hg38/chr4 data/ctl2.fastq.gz

# Compare two control samples ... see what happens. Should anything be differentially expressed?
cuffdiff -o diff chr4.gtf data/ctl1-tophat/accepted_hits.bam data/ctl2-tophat/accepted_hits.bam


# Quantify with eXpress
# ---------------------

#
# eXpress aligns against a transcript.
# We can download the transcript data from a data source
# or generate it from the known annotations.
#
# There is a GFF to FASTA converter pacakged with CuffLinks
# This will make use of the reference FASTA and the chr4.gtf file
#
#
ln -sf ~/src/cufflinks-2.1.1.OSX_x86_64/gffread ~/bin

# Run it to see the help
gffread -h

#
# Extract sequences for the exonic coordinates.
#
gffread chr4.gtf -t exon -g ~/refs/hg38/chr4.fa -w ~/refs/hg38/chr4-transcripts.fa

#
# We could also get the transcripts directly from Ensembl.
#
# ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
# gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz
#

#
# eXpress needs an alignment file to quantify.
# The alignment file needs to be sorted by read name.
# Create an alignment file via bwa. You could use other aligners as well.
#
mkdir -p bwa
bwa index ~/refs/hg38/chr4-transcripts.fa

#
# Note how this BAM file will not sorted by position.
# It cannot be visualized in IGV!
#
bwa mem ~/refs/hg38/chr4-transcripts.fa data/ctl1.fastq.gz | samtools view -b - > bwa/ctl1.bam

#
# Quantify with express
#
express ~/refs/hg38/chr4-transcripts.fa bwa/ctl1.bam -o bwa

#
# Quantify with SubRead
# ---------------------
#
# For this we need alignments against genome!
#

#
# Create an alignment with tophat
#
tophat -G chr4.gtf -o data/ctl1-tophat ~/refs/hg38/chr4 data/ctl1.fastq.gz

#
# Count features with featureCounts
#
mkdir -p sub
featureCounts -a chr4.gtf -g transcript_id data/ctl1-tophat/accepted_hits.bam  -o sub/ctl1.tophat.txt

#
# Let's compare the outputs
#

# The counts produced by featureCounts with TopHat
cat sub/ctl1.tophat.txt | sort -rn -k 7,7 | cut -f 1,6,7 | head | cat -n

# The counts produces by eXpress on alignment with bwa
cat bwa/results.xprs | sort -rn -k5,5 | cut -f 2,3,5 | head | cat -n

# To get FPKM from featureCounts output you need to divide by lenght
# and normalize to library size: 60683
cat sub/ctl1.tophat.txt | grep "ENST" |  awk '{ print $1, $7/$6/60683 * 10^9 } ' | sort -rn -k2 | head

Lecture 30 - Gene Ontology

#
# Gene Ontologies. Making sense of data.
#

# Work on the results of TopHat.

# What columns do we have?

cat gene_exp.diff | head -1

# We have options to sort by various columns.
# Largest fold change (smallest) fold change, coverage in sample1 or sample2 etc.

# We typically want to complile a list of gene names with p-values.

# Sometime you need to separate genes by a certain property. Upregulated, downregulated
# and perform enrichment on each gene set separately.

#  Genes with positive foldchange.
cat gene_exp.diff |  grep yes | awk '$10 > 0 { print $0 }' | cut -f 3,13 > positive.txt

# Genes with negative foldchange.
cat gene_exp.diff |  grep yes | awk '$10 < 0 { print $0 }' | cut -f 3,13 > negative.txt

# There are other tools that need all datasets as they compute
# the background distribution from all gene values. In this case
# we cut column 2 (ensemble ID).
cat gene_exp.diff | cut -f 2,13 > all-ensemble.txt

# Take these lists and paste them into GO ontolgy search programs.


# Install ErmineJ

cd ~/src

curl -OL http://www.chibi.ubc.ca/ermineJ/distributions/ermineJ-3.0.2-generic-bundle.zip
unzip ermineJ-3.0.2-generic-bundle.zip
ln -sf ~/src/ermineJ-3.0.2/bin/ermineJ.sh ~/bin

# Seems that we need to make it executable.
chmod +x ~/bin/ermineJ.sh

# Run it with graphical user interface.
ermineJ.sh --gui


# ermine

Ubuntu Linux on a Chromebook

It is possible to install Linux on a low cost Chromebook ($239) and thus bypass the need to boot from a USB drive, dual boot or virtualize Linux. This is what you need:

  1. Get an Intel processor based Chromebook. Choose a model that is known to run Linux. After some research I chose the Acer C720 ChromeBook from Amazon. I picked the 32GB storage version.
  2. Linux can be enabled by via a tool named Crouton: Chromium OS Universal Chroot Environment as described at https://github.com/dnschneid/crouton. The installation is an extremely simple three step process.
  3. Apply the software setup commands listed below.

Linux setup commands. Check back for changes as we proceed with the course.

# Install a number of required libraries.
sudo apt-get install -y curl build-essential ncurses-dev byacc zlib1g-dev python-dev git cmake

# Install java and python libraries.
sudo apt-get install -y unzip default-jre python-pip

# Install the required perl modules.
sudo perl -MCPAN -e 'install "LWP::Simple"'
sudo perl -MCPAN -e 'install "HTML::Entities"'

Success!

ChromeBook Screenshot

Bash Customization

To customize your bash enviroment edit your the file named `.profile` (on a Mac) or `.bashrc` (on Linux). Make sure tab-completion works. Never type full filenames. Instead type them partially then then *tap the *TAB*, if nothing seems to happen double tap TAB to see all options.

export PS1='\[\e]0;\w\a\]\n\[\e[32m\]\u@\h \[\e[33m\]\w\[\e[0m\]\n\$ '

# Aliases are a way to invoke more complex
# On linux use ls='ls -h --color'
alias ls='ls -hG'

# Safer versions of the default commands.
alias rm='rm -i'
alias mv='mv -i'
alias cp='cp -i'

# Extend the path.
export PATH=~/bin/:$PATH


Created by Istvan Albert • Last updated on Thursday, December 11, 2014 • Site powered by PyBlue