Code Repository 2014

Code Repository for BMMB 852, Fall 2014

This page is a collection of most of the examples shown during lectures.

Lecture 1 - Basic Command Line

# Lines that start with the # sign are comments that won't be executed!

# What folder are we in

# Make a directory
mkdir work

# See the directory

# See more details
ls -l

# Change to the work directory
cd work

# Where am I

# Let's go back one level
cd ..

# Where am I

# Remove the directory
rmdir work

# Create a file, then list it.
touch abc.txt

# Rename (move the file)
mv abc.txt xyz.txt

# How does the mv command work?
man mv

# Learn about cp, ls, mv, rm, mkdir, rmdir
# via the manual pages
man ls

Lecture 2 - Simple Data Manipulations

# Lines that start with the # sign are comments that won't be executed!

# Get the data from SGD.
# Download the URL with curl and store it in the sc.gff file.
curl -o sc.gff

# Or if the site seems slow get the data from the course site. But then ask yourself
# isn't it odd when a class website is faster than the official data repository? Way to go SGD!
# curl -o sc.gff

# How many lines does the file have.
wc -l sc.gff

# Page through the file with more (or less)
# SPACE and f moves forward, b moves backward, q quits, h help, / allows you to search
more sc.gff

# Let's look at the head of the file.
head sc.gff

# And the end of the file.
tail sc.gff

# Find simple patterns in the file
grep YAL060W sc.gff

# We can place the results of the matching into a new file.
grep YAL060W sc.gff > match.gff

# Look at your files.

# Paginate the match file
head match.gff

# Piping (chaining) commands together.
# How many lines match the word gene?
grep gene sc.gff | wc -l

# How many lines match both the word gene and the word chrVI (chromosome six)?
grep gene sc.gff | grep chrVI | wc -l

# How many lines of the above DO NOT match the word Dubious?
grep gene sc.gff | grep chrVI | grep -v Dubious | wc -l

# This file is a bit odd. It contains two sections, a tabular
# section with 9 tab delimited columns and a sequence section
# that lists the entire genome.

# Break up the file into two parts. Put all lines that occur before
# the genome  information into features.gff.
# Our spies from behind enemy lines have reported that the word FASTA
# in the file indicates that the genome information is about to start
cat -n sc.gff | head

# Find the position of the word FASTA in the file. The -n flag adds line numbers.
cat -n sc.gff | grep FASTA

# Keep only those number of lines and redirect into a new file that
# contains only the features. Also remove the comment characters from the file.
head -22994 sc.gff | grep -v '#' > features.gff

# We did all this work to have simple tabular formatted data that only
# contains columnar information. So how many features are there?
wc -l features.gff

# Take the first three columns.
cut -f 1,2,3 features.gff | head

# Find unique words in column 3 of GFF.
# The uniq command collapses consecutive identical words into one.
cut -f 3 features.gff | sort | uniq | head

# It can also count how many identical consecutive words are there.
cut -f 3 features.gff | sort | uniq -c | head

Lecture 3 - Installing and Using Entrez Direct

# Installing and running the Entrez Direct suite

# Go to your home directory.

# Equivalent to using ~/ that represents your home directory.
cd ~/

# Create an src directory that will contain the
# programs we will install during this lecture.
mkdir ~/src

# go to the src directory
cd ~/src

# Get the entrez direct toolkit, the capital -O flag instructs curl
# to figure out the filename from the url.
curl -O

# The above is equivalent to:
curl -o

# Unzip the tools.

# Investigate the new tool.
cd edirect

# To make your system become aware of where
# your new tools are you need to add the path to
# the tool to your environment. There are different
# ways to do this. To add all tools at once add to the PATH variable.
export PATH=$PATH:~/src/edirect

# Create the work folder for the current lecture.
mkdir -p ~/edu/lec3
cd  ~/edu/lec3

# Run einfo
einfo -help

# Fetch descrptions then look at them.
einfo -dbs > einfo-dbs.txt
more einfo-dbs.txt

einfo -db sra > einfo-sra.txt
more einfo-sra.txt

# Run an esearch.
esearch -help
esearch -db nucleotide -query PRJNA257197
esearch -db sra -query PRJNA257197
esearch -db protein -query PRJNA257197

# Fetch the data for nucleotides.
esearch -db nucleotide -query PRJNA257197 | efetch -format fasta > ebola.fasta

# How many sequences in the file
cat ebola.fasta | grep ">" | wc -l

# Get the data in genbank format.
esearch -db nucleotide -query PRJNA257197 | efetch -format gb >

Lecture 4 - Installing and Using the SRA tookit

# Go to your source directory.
cd ~/srrc

# Download the SRA toolkit (make sure to put the link that is relevant to your platform)
# Mac OSX.
curl -O

# Linux.
curl -O

# Unpack the archive.
tar xzvf sratoolkit.2.3.5-2-mac64.tar.gz

# Switch to the directory and look at the files there
cd sratoolkit.2.3.5-2-mac64

# Programs are collected in the bin directory.
cd bin

# See what is there.

# Let's add these paths permanently to our PATH.

# There is a special file that is read when your shell starts. It applies
# the settings to each terminal that you start. The file is called ~/.profile
# on a Mac and ~/.bashrc on linux. The >> symbol appends to a file rather than
# overwriting it. You may also edit this file with a text editor.

# On the Mac:
echo export PATH=$PATH:~/src/edirect:~/src/sratoolkit.2.3.5-2-mac64/bin >> ~/.profile

# On Linux:
echo export PATH=$PATH:~/src/edirect:~/src/sratoolkit.2.3.5-2-mac64/bin >> ~/.bashrc

# You need to open a new terminal to apply the settings or do:
source ~/.profile

# Go to the lecture folder. What does the -p flag do? Look at the manual page.
mkdir -p ~/edu/lec4
cd ~/edu/lec4

# Where is the prefetch command?
which prefetch

# The prefetch command download files from the remote site.
# Look the help for the file.
prefetch -h

# Now let's obtain the file.
prefetch SRR1553610

# Where did the file go? There is a default folder in your home directory.
ls ~/ncbi

# What files have been added there? Use the find tool.
find ~/ncbi

# We unpack the file with the fastq-dump program.
fastq-dump -h

# Unpack the file
fastq-dump --split-files SRR1553610

# The raw data files in FASTQ format are in the current folder.

# Pattern matching in the shell. The * (star) indicates match anything.
wc -l *.fastq

# look at the file.
head SRR1553610_1.fastq

cat *.fastq | grep @SRR | wc -l

# How to download multiple files? Create a file that contains SRR runs.
echo SRR1553608 > sra.ids
echo SRR1553605 >> sra.ids

# Use this file to prefetch the runs.
prefetch --option-file sra.ids

# Unpack all the downloaded files. Note how this is not quite right since
# there may be more prefetched files than what we had in sra.ids
fastq-dump --split-files ~/ncbi/public/sra/SRR15536*

# The right solution is to only download the files in sra.ids. Alas
# that is not directly supported by fastq-dump? Grr!

# The right solution will take a little more string processing at command line..
# This will use the sed (streamline editor) tool to replace the patterns SRR with
# the command to extract the file.
cat sra.ids | sed 's/SRR/fastq-dump --split-files SRR/'

# Now pass the output of it into a bash process
cat sra.ids | sed 's/SRR/fastq-dump --split-files SRR/' | bash

# We still do better. Why copy all those SRR ids. We could get them all from
# the command line.
# There seems to be a bug in the efetch program. It produces the data but
# then also an error at the end! Why? Probably a bug! Grr! x 2
esearch -db sra -query PRJNA257197  | efetch -format runinfo

# The command to put the results into a file.
esearch -db sra -query PRJNA257197  | efetch -format runinfo > runinfo.txt

# Also since it is comma separated file we need to specify the
# delimiter (column separator) to the cut program.
cat runinfo.txt | cut -f 1 -d ","

# The files seem to be all there. It does produce 195 files as reported in the search.
# Conundrum: do we trust this or not. We could get the file from the BioProject site
# as well by selecting Send -> RunInfo but you have to be in the right place for it
# to work.
cat runinfo.txt | cut -f 1 -d ',' | grep SRR | wc -l

# We'll trust it but you may have a nagging feeling that
# you might have made a mistake. Welcome to the club!
cat runinfo.txt | cut -f 1 -d ',' | grep SRR > sra.ids

# Warning! This will now download a lot of files (195 to be precise).
# Depending on the connection speed may take hours.
prefetch --option-file sra.ids

# Check the total size of the download (about 15Gb)
du -hs ~/ncbi

# Let's convert them all
cat sra.ids | sed 's/SRR/fastq-dump --split-files SRR/' | bash

# How many lines in total in all the files?
wc -l *.fastq

# And with that we got all the sequencing data for the entire PRJNA257197 project .

Lecture 5 - Writing reusable scripts. Fetching subsequences.

# Get a sequence by locus.
efetch -db nucleotide -id KM233090 -format fasta > KM233090.fa

# The same sequence is also available by accession number.
efetch -db nucleotide -id 667853062 -format fasta > 667853062.fa

# One can fetch multiple ids at the same time
efetch -db nucleotide -id KM233090,KM233066,KM233113.1 -format fasta > multi.fa

# How many lines in the file
wc -l multi.fa

# Count the header files.
cat multi.fa | grep ">" | wc -l

# See the header files.
cat multi.fa | grep ">"

# We can also fetch a subset from the sequences.
efetch -db nucleotide -id KM233090,KM233066,KM233113.1 -format fasta -seq_start 1 -seq_stop 10 > multi.fa

# Let's get the start of all ebola genomic builds.
esearch -db nucleotide -query PRJNA257197 | efetch -db nucleotide -format fasta -seq_start 1 -seq_stop 10 > starts.fa

# This is too much to print at each time. Time to solve it
# at a higher level. Start to write shell scripts.
# A shell script is a collection of commands that we can
# repeatedly execute.
# Put the line below into a file, usually the extension is .sh, I'll call it
# esearch -db nucleotide -query PRJNA257197 | efetch -db nucleotide -format fasta -seq_start 1 -seq_stop 10

# You can now run this entire command from a single script.
bash > starts.fa

# What would be nice is to be able to specify some parts of this script from the outside.
# The two dollar signs will be substituted with the parameters from command line. See the
# included files below.
bash 1 10 > starts.fa

# Now extract just the sequences from this file.
# the -A 1 flag prints the match and the one line after the match.
# The -v flag prints lines that do not match the pattern.
# Never type these in directly. Type one statement, check the results, type a second one and so on.
cat starts.fa | grep ">" -A 1 | grep -v ">" | grep -v "-"

The first version of our script:

# Fetches nucleotide subsequences from all sequences
# deposited in the PRJNA257197 project.
# Expects two parameters start and stop.

# This setting makes the script stop on errors and on undefined variables.
set -ue

# Access NCBI via Entrez Direct.
esearch -db nucleotide -query PRJNA257197 | efetch -db nucleotide -format fasta -seq_start $1 -seq_stop $2

A more advanced version of our script. As you add functions it becomes easier to build different commands:

# Fetches nucleotide subsequences from all sequences
# deposited in the PRJNA257197 project.
# Expects two parameters start and stop.

# This setting makes the script stop on errors and on undefined variables.
set -ue

function esearch_ebola() {
    # Search the nucleotide sequences for ebola.
    esearch -db nucleotide -query PRJNA257197

function efetch_limits() {
    # You can make it more explicit where varaibles get assigned to.
    # For more complex command lines this may help.
	# Run the efetch.
    efetch -db nucleotide -format fasta -seq_start $START -seq_stop $END

# Chain the two commands to fetch.
# Note how parameters are passed to functions.
esearch_ebola | efetch_limits $1 $2

Lecture 6 - Encodings and the FASTQ format.

# This command executes the string following the -c using python.
python -c 'print "Hello World"'

# The ordinal (integer code) of a character.
python -c 'print ord("A")'

# The character (letter code) of an integer.
python -c "print chr(65)"

# Find the quality value of a character in the Sanger (+33) encoding.
python -c 'print ord("I") - 33'

# Find the quality value of a character in the Illumina 1.5 (+64) encoding.
python -c 'print ord("I") - 64'

# Find the quality character of the value 40 in the Sanger (+33) encoding.
python -c 'print chr(40 + 33)'

# Find the quality character of the value 40 in the Illumina 1.5 (+64) encoding.
python -c 'print chr(40 + 64)'

# Get a test dataset from the course website.
curl -O http://localhost:8080/courses/files/2014/illumina-data.fq.gz

# You can unzip this with gzip illumina-data.fq.gz or keep it compressed
# and open it with gzcat (Mac OSX) or zcat (Linux). When you have
# lots of datasets it may be worth saving space but reading the files will be
# a little slower.

time gzcat illumina-data.fq.gz > tmp

# Unzip the data while keeping the compressed version.
# Withouth the -c flag you would not need to redirect the output and
# it would unzip in place.
gunzip -c illumina-data.fq.gz > illumina-data.fq
time cat illumina-data.fq > tmp

# Show one FASTQ record (Mac OSX)
gzcat illumina-data.fq.gz | head -4

# Show one FASTQ record (Linux)
zcat illumina-data.fq.gz | head -4

# Unpack an SRA data and raw dataset from the instrument.
# This generates two files.
fastq-dump --split-files SRR1553605

# Show a single fastq record.
cat SRR1553605_1.fastq | head -4

# "Quick and dirty" quality checks, are there
# stretches of ### characters in the first 10,000 lines?

gzcat illumina-data.fq.gz | head -10000 | grep '###' |  wc -l
cat SRR1553605_1.fastq | head -10000 | grep '###' | wc -l

# Which one looks better?
# It is always a bit risky to look at first/last lines since
# the order of data may not be random. Typically we want to
# get a random sample (we'll see how to do this later)

# We can look for and color patterns in the data.
# Are there start codons? Is the pattern TATA present?
gzcat illumina-data.fq.gz | head -10000 | grep 'ATG' --color=always | head

gzcat illumina-data.fq.gz | head -10000 | grep 'TATA' --color=always | head

# Are there sequences that look like the Illumina adapter: GATCGGA?
gzcat illumina-data.fq.gz | head -10000 | grep 'GATCGGA' --color=always | head

Lecture 7 - Quality control with FastQC. Interpreting the outputs.

# Compressing and decompressing files
# Get an SRA file that we will use later in this exercise.

# Use prefetch if you want to keep the SRA files.
# Otherwise you can use fastq-dump directly.
# In class I have added the -X 10000 flag to limit the number of spots so that we don't have
# to wait that long.
fastq-dump --split-files SRR519926

# See what you got.
ls -l

# Compress the files. You can also time any process by adding the word time
# in front of the command. This is a good command to test overall
# computer performance as it requires both fast disk access and high performance CPU.
# Here are some timings
# - Grit: desktop iMac: 29 seconds (4 core medium speed CPU, 64GB  RAM, medium speed disk)
# - Apollo: high performance linux server at the BCC: 38 seconds (32 cores fast CPUs, 256GB RAM, bit it is also running lots of other commands)
# - Peppy: Chromebook with virtualized Linux via Crouton: 49 seconds (slow 2 core CPU but very fast solid state disk)
# - Airy: MacBook Air: 53 seconds (the CPU seems to be slower than that in the ChromeBook)

# What is your timing?
time gzip *.fastq

# See how above gzip replaces the file with the compressed one.

# Test what the unzipping times are.
# - Airy: 2.1 seconds
# - Grit: 2.8 seconds
# - Apollo: 3.6 seconds
# - Peppy: 6.5 seconds
time gunzip *.gz

# Package both files into a single archive
tar cf mydata.tar *.fastq
gzip mydata.tar

# The two commands can be combined into one
tar cfzv mydata.tar.gz *.fastq

# Remove the fastq files
rm -f *.fastq
ls -l

# Unpacking the archive extracts the files.
tar xzvf mydata.tar.gz

# You can archive entire directory trees. Pack up the entire lecture 7 folder.
cd ..
tar cfzv lec7.tar.gz lec7

# List what is inside of the archive with the t action.
tar tf lec7.tar.gz

# Uncompress the archive at a different location.
mkdir ~/edu/tmp
mv lec7.tar.gz ~/edu/tmp
cd ~/edu/tmp
tar xzvf lec7.tar.gz

# Here comes a tarbomb. Handle with care!
curl -O

# Unpack and enjoy! Always check your tarfiles or put them into a separate directory.

# Quality control.
# Download and install FastQC into the src directory.
cd ~/src
curl -O

# Linux people: do an: sudo apt-get install unzip
cd FastQC
ls -l

# There is an error in the distribution. The executable flag is not set.
chmod +x fastqc

# We want fastqc to be globally available. We have two ways to do this. We could
# add the path to the library to the main PATH (as before), but that gets old when you
# have to add every single program and reload the PATH.
# Alternatively we could add one directory say ~/bin then create links (shortcuts)
# in this directoy to the programs that we want to be available globally.

# Create the ~/bin directory
mkdir -p ~/bin

# Add the ~/bin folder into the PATH

# On a Mac:
echo 'export PATH=~/bin:$PATH' >> ~/.profile
# to apply to the current shell not just future ones:
source ~/.profile

# Under Linux do the following:
# echo 'export PATH=~/bin:$PATH' >> ~/.bashrc
# source ~/.bashrc

# Link the fastqc under an "shortcut" in ~/bin
ln -s ~/src/FastQC/fastqc ~/bin/fastqc

# Test that the tool works.
# Linux people may need to install java.
# See the linux install guide:
fastqc -h

# Run a fastqc report on all files in the directory.
cd ~/edu/lec7
fastqc *.fastq

# This create *.html files. Open and investigate the outputs.

Lecture 8 - Quality control: base quality trimming.

# Install prinseq, trimmomatic and seqtk

# Download and unpack the software in your src directory
cd ~/src

# Install prinseq
# We need to also pass the -L flag since this site uses link redirects and
# we want to follow those
curl -OL

# Unpack the archive.
tar zxvf prinseq-lite-0.20.4.tar.gz

# Go into the archive and read the instructions
# located in the README file. Let's follow what it says there.
cd prinseq-lite-0.20.4

# Make the script executable.
chmod +x

# Link it into you ~/bin folder.
ln -s ~/src/prinseq-lite-0.20.4/ ~/bin/prinseq-lite

# Now you have prinseq running anywhere on your system.

# Install trimmomatic
cd ~/src
curl -O
cd Trimmomatic-0.32

# Alas not much is there. You got to hit the manual at:
# It has a really user unfriendly way to run it. Thanks a bunch!
java -jar trimmomatic-0.32.jar

# Lets make a script that lauches that for us.
# Create a script with Komodo or from the command line that contains the following
# shown below :

# You can also do it from the command line
echo '#!/bin/bash' > ~/bin/trimmomatic
echo 'java -jar ~/src/Trimmomatic-0.32/trimmomatic-0.32.jar $@' >> ~/bin/trimmomatic
chmod +x ~/bin/trimmomatic

# Look Ma! It works.

# Now we have both prinseq and trimmomatic. Let's use them.
fastq-dump --split-files SRR519926

# Run prinseq and trim bases from the right in a window of 5 bases
prinseq-lite -fastq SRR519926_1.fastq -trim_qual_right 30 -trim_qual_window 4 -min_len 35 -out_good prinseq_1

# Trimmomatic uses a different set of paramters and philosophy. Parameters need to be space separated words
# where the sub-parameters  are separated by a colon. Basically they invented their own parameter formats.
trimmomatic SE -phred33 SRR519926_1.fastq trimmomatic_1.fq SLIDINGWINDOW:4:30 MINLEN:35 TRAILING:3

# Time the execution speed for both tools. Note just how radically faster trimmomatic is.
# Depending on the operation it could be 100x faster.
# On the other hand prinseq is more than just a quality control tool, it has
# a whole slew of other functionality.
fastqc SRR519926_1.fastq prinseq_1.fq trimmomatic_1.fq

# By default FastQC creates bins in the plots and that may sometimes
# hide details.
fastqc --nogroup SRR519926_1.fastq prinseq_1.fq trimmomatic_1.fq

# To process files in batch you can make use of simple shell looping constructs.
for name in *.fastq; do echo $name; done

Create the trimmomatic file in your ~/bin folder:


# The line above tells the shell to execute the program with bash.
# The line below the $@ indicates passing all arguments into the program

java -jar ~/src/Trimmomatic-0.32/trimmomatic-0.32.jar $@

Example of a simple looping construct


# This script will run the trimming on
# all fastq files in the directory
for name in *.fastq

	echo "*** currently processing file $name, saving to trimmed-$name"
	trimmomatic SE -phred33 $name trimmed-$name SLIDINGWINDOW:4:30 MINLEN:35 TRAILING:3
	echo "*** done"

# Now run the fastqc report.
# You can add any more commands here.
fastqc trimmed-*.fastq

Lecture 9 - Quality control: pattern matching and adapter trimming

# This is data from last lecture
fastq-dump --split-files SRR519926

# Note: pattern searches take place on lines, if the sequences wrap
# other methods are needed. Most instruments produce reads on one line
# so those would work. But do remeber that this also may match
# the quality string (although usually unlikely). A proper
# solution would first extract the sequence.

# Find  ATG anchored at the start of the line
cat SRR519926_1.fastq | egrep "^ATG" --color=always | head

# Find ATG anchored at the end of the line
cat SRR519926_1.fastq | egrep "ATG\$" --color=always | head

# Find TAATA or TATTA patterns, this is a range of characters
cat SRR519926_1.fastq | egrep "TA[A,T]TA" --color=always | head

# Find TAAATA or TACCTA, these are groups of words
cat SRR519926_1.fastq | egrep "TA(AA,CC)TA" --color=always | head

# Quantify matches with metacharacters
# Find TA followed by zero or or more A followed by TA
cat SRR519926_1.fastq | egrep "TA(A*)TA" --color=always | head

# Find TA followed by one or or more A followed by TA
cat SRR519926_1.fastq | egrep "TA(A+)TA" --color=always | head

# Find TA followed by two to five As followed by TA
cat SRR519926_1.fastq | egrep "TAA{2,5}TA" --color=always | head

# Practice RegExp matches on online regexp testers

# Match Ilumina adaptors at the end of the reads
# Match AGATCGG anywhere followed by any number of bases
cat SRR519926_1.fastq | egrep "AGATCGG.*" --color=always | head

# See the contaminant and adapter files for FastQC and Trimmomatic
$ ls ~/src/FastQC/Configuration/

# The content of this file contains a few known adapter sequences.
more ~/src/FastQC/Configuration/adapter_list.txt

# The content of this file shows known contaminants.
more ~/src/FastQC/Configuration/contaminant_list.txt

# To perform the adapter cutting we need to find a file with adapter sequences.
# You can create your own adapter or use the ones that come with Trimmomatic
# Let's create an Illummina adapter file.
echo ">adapter" > adapter.fa
echo "AGATCGGAAGAG" >> adapter.fa

# Let's do both quality and adapter trimming.
trimmomatic SE SRR519926_1.fastq trimmed.fq ILLUMINACLIP:adapter.fa:2:30:10 SLIDINGWINDOW:4:30 TRAILING:30 ILLUMINACLIP:adapter.fa:2:30:5

# Trimmomatic comes with a number of adapters that we can use.
# Depending on the content of the file different modes of operations may take place.

# This is a palindromic adapter. Trimmomatic detects the
# mode of operation from the name of the adapter.
ln -s ~/src/Trimmomatic-0.32/adapters/TruSeq3-PE.fa

# This is an extended palindromic adapter file that also lists single adapters.
ln -s ~/src/Trimmomatic-0.32/adapters/TruSeq3-PE-2.fa

# These are the classic adapters.
ln -s ~/src/Trimmomatic-0.32/adapters/TruSeq3-SE.fa

trimmomatic SE SRR519926_1.fastq trimmed.fq ILLUMINACLIP:adapter.fa:2:30:10 SLIDINGWINDOW:4:30 TRAILING:30 ILLUMINACLIP:TruSeq3-SE.fa:2:30:5

# To run it in paired end mode both files need to filtered and clipped simultaneously.
# Command line mayhem ensues.
trimmomatic PE SRR519926_1.fastq SRR519926_2.fastq trim_1P.fq trim_1U.fq trim_2P.fq trim_2U.fq ILLUMINACLIP:TruSeq3-PE-2.fa:2:10:10 SLIDINGWINDOW:4:30 TRAILING:20

# A slightly simpler option to provide a -baseout flag

Lecture 10 - Alignments

# For this lecture there are no command line tools to run.
# Instead we will perform alignments via the EBI Pairwise alignment site
# The following two protein sequences are a fun example
# from the book: Understanding Bioinformatics by Marketa Zvelebil


#  Second protein.


# You will need to perform global and local alignments
# with different parameter settings.

# Now if we have time we will investigate the ebola viruses a bit.
# Remember those virus genome starts that in some cases looked so different?
# Let's get the first 100 bases from the two most different genomes and align those.
# KM034561 and KM233037
efetch -db nucleotide -id KM034561 -seq_start 1 -seq_stop 100 -format fasta > 1.fa
efetch -db nucleotide -id KM233037 -seq_start 1 -seq_stop 100 -format fasta > 2.fa

# Align the two sequences with the Needle tool. What do you see?
# Remember the CGAATAACT was the most common start sequence.

Lecture 11 - Installing and using Blast

# Get and install BLAST+
# URLs given for a MAC, change it accordingly for Linux

# Switch to the src folder
cd ~/src

# Get the blast executable for the Mac
curl -O
tar zxvf ncbi-blast-2.2.29+-universal-macosx.tar.gz

# For Linux download a different package
# curl -O
# tar zxvf ncbi-blast-2.2.29+-x64-linux.tar.gz

# I can either link the executable or add the entire directory to
# my path. When there are lots of programs I usually add the entire
# directory.

echo 'export PATH=$PATH:~/src/ncbi-blast-2.2.29+/bin' >> ~/.profile

# Or on Linux do a
# echo 'export PATH=$PATH:~/src/ncbi-blast-2.2.29+/bin' >> ~/.bashrc

# Load the settings into the current terminal (new terminals will auto load this)
source ~/.profile

# Check that it works. Short help.
blastn -h

# Long help.
blastn -help

# Checklist:
#   1. What are we looking to find? -> query sequence
#   2. Where are we looking for? -> database -> target sequence(s)
#   3. How are we looking for it? -> search type

# Let's make a blast database

# Get a genome sequence of the Ebola virus
# When we index databases it may create more files. It is
# best if we place these in a separate folder. Lets call that refs.
mkdir -p refs
efetch -db nucleotide -id KM233118 --format fasta > refs/KM233118.fa

# The reference may contain one or more Fasta records. To
# turn it into a blast database we need to transform it to a
# format the program is able to operate on.

# Long and short help for the blast database indexer.
makeblastdb -h
makeblastdb -help

# Create a nucleotide database out of the genome.
makeblastdb -in refs/KM233118.fa -dbtype nucl

# See what happened.
ls refs

# Create a query.
head -2 refs/KM233118.fa > query.fa

# Search with the query: nucleotide sequence vs nucleotide database
# Take note of the formats.
blastn -db refs/KM233118.fa -query query.fa | more

# We can change the output format. Tabular forms
# can be used when base by base alignment information is not needed.
blastn -db refs/KM233118.fa -query query.fa -outfmt 6

# This will also write header information.
blastn -db refs/KM233118.fa -query query.fa -outfmt 7

# You may format the output in very different ways.
blastn -db refs/KM233118.fa -query query.fa -outfmt '6 qseqid sseqid qlen length mismatch'

# By default blastn uses the megablast search strategy.
# Shorten the query to:
# >short

# There are different search strategies/tasks for blast.
# These represent a pre-packaged set of parameters.
blastn -db refs/KM233118.fa -query query.fa

blastn -db refs/KM233118.fa -query query.fa -task blastn

# Make the sequence even shorter.
# >short
# Now -task blastn won't work either, there is another task blastn-short
# Tuning Blast can be a little world on its own.
blastn -db refs/KM233118.fa -query query.fa -task blastn-short

# You can have more than one sequence in the target database.
esearch -db nucleotide -query PRJNA257197 | efetch -db nucleotide -format fasta > refs/all-genomes.fa

# Create the blast database of all genomes.
makeblastdb -in refs/all-genomes.fa -dbtype nucl

# Pick any region of a genome
efetch -db nucleotide -id KM233118 -format fasta -seq_start 1 -seq_stop 1000 > 1000.fa

Lecture 12 - blastdbcmd, blastp, blastx, tblastn usage

# blastdbcmd, blastp, blastx, tblastn usage
# Compare the new virus to an outbreak from 1999.
# RefSeq accession number NC_002549.1, Nucleoprotein example: NP_066243.1


# The Bioproject for the data from 1999.

# Get a feature table
esearch -db protein -query PRJNA14703 | efetch -format ft

# Search the features of a genome, ft -> feature table.
efetch -db nucleotide -id NC_002549 -format ft | more

# Get the data for the project from the protein database.
esearch -db protein -query PRJNA14703 | efetch -format fasta > NC_002549-prot.fa

# Add the parse seqids option to makeblastdb.
# Create Blast databases for both nucleotides and proteins.
makeblastdb -in refs/NC_002549.fa -dbtype nucl -parse_seqids
makeblastdb -in refs/NC_002549-prot.fa -dbtype prot -parse_seqids

# blastdbcmd can take many parameters that can query and format
# the content of the blast database.

# Lists all the content in the database.
blastdbcmd -db refs/NC_002549.fa -entry 'all' | head -3

# Get a specific entry of the database.
blastdbcmd -db refs/NC_002549.fa -entry '10313991' | head -3

# Get a range of the nucleotides entries in the database.
blastdbcmd -db refs/NC_002549.fa -entry 'all' -outfmt '%a %s' -range 1-10 -strand minus

# Format the content of the database.
blastdbcmd -db refs/NC_002549.fa -entry 'all' -outfmt '%a %l'

# List each protein with its length.
blastdbcmd -db refs/NC_002549-prot.fa -entry 'all' -outfmt '%a %l'

# Extract an entry of the protein database into a file.
blastdbcmd -db refs/NC_002549-prot.fa -entry 'NP_066243.1' > query-p.fa

# Run blastp.
blastp -db refs/NC_002549-prot.fa -query query-p.fa

# Run tblastn.
# Extract a region that is protein coding. I found the coordinates from the GenBank page.
blastdbcmd -db refs/NC_002549.fa -entry 'NC_002549' -range 470-2689  > nucleotide.fa

# Align the nucleotide to the protein database.
blastx -db refs/NC_002549-prot.fa -query nucleotide.fa | more

# Fetch the nucleoprotein.
efetch -db protein -id NP_066243.1 -format fasta > NP_066243.fa
tblastn -db refs/NC_002549.fa -query NP_066243.fa  | more

# Get all proteins of the 2014 ebola project.
esearch -db protein -query PRJNA257197 | efetch -format fasta > refs/ebola-2014.fa
makeblastdb -in refs/ebola-2014.fa -dbtye prot -parse_seqids

# Fetch the nucleoprotein.
efetch -db protein -id NP_066243.1 -format fasta > NP_066243.fa

Lecture 13 - installing and using bwa

# Get bwa from SourceForge.
# The URL you get from SourceForge is complicated because SourceForge really wants you
# to visit the site and download from there so that you see the advertising there.
# Alternatively visit the webpage and download manually.
cd ~/src

curl -OL

# It is compressed with bzip2 so use the -j flag to uncompress.
tar jxvf bwa.0.7.10.tar.bz2

# We need to compile the program
cd bwa-0.7.10/

# The make command invokes the compilation.

# Link the bwa program to the bin folder
ln -s ~/src/bwa-0.7.10/bwa ~/bin

# Run the program to verify that it works.
mkdir -p ~/edu/lec13; cd ~/edu/lec13

# Run the tool to see how it works.

# Nice manual too!
man ~/src/bwa-0.7.10/bwa.1

# Make a global directory to store the references in.
mkdir -p ~/refs/852

# Get the reference genome for the 1999 outbreak.
efetch -db nucleotide -id NC_002549 -format fasta > ~/refs/852/ebola-1999.fa

# Index the genome to make it searchable with bwa.
bwa index ~/refs/852/ebola-1999.fa

# See the results.
ls ~/refs/852/

# Also make it searchable via blast.
makeblastdb -in ~/refs/852/ebola-1999.fa -dbtype nucl -parse_seqids

# Now look what happened. Whoa, thanks a bunch! (not!)
ls ~/refs/852/

# 16 files from a single one. That's why is best to be kept
# in a separate folder.
ls -l ~/refs/852/ | wc -l

# Get the first line of the ebola genome.
head -2 ~/refs/852/ebola-1999.fa > query.fa

# Map it back to the genome itself this time with bwa-mem
bwa mem ~/refs/852/ebola-1999.fa query.fa > results.sam

# Compare the process to blasting the same sequence
blastn -db ~/refs/852/ebola-1999.fa -query query.fa > results.blastn

Lecture 14 - understanding the SAM format

# Continuing where we left off in lecture 13
head -2 ~/refs/852/ebola-1999.fa  > query.fa

# Run the alignment.
bwa mem ~/refs/852/ebola-1999.fa query.fa > results.sam

# Note how bwa still prints on the stream.
# There are two output streams. They are called: the standard out and standard error.
# Here is how you would redirect both into a file.
bwa mem ~/refs/852/ebola-1999.fa query.fa 1> results.sam 2> errors.txt

# There are many tricks to redirect one stream into another, although in the
# vast majority of cases this is not necessary. Sometimes you really need to
# store the error message.

# Create queries based on the line you have.
# Add or remove bases, create insertions and deletions
# Run the alignment.

bwa mem ~/refs/852/ebola-1999.fa query.fa > results.sam

# To view certain columns you can cut out only those.
# It is important that you understand the meaning of each column.

# Query id, alignment flag and target id
cat results.sam | cut -f 1,2,3

# Alignment start, mapping quality, CIGAR string
cat results.sam | cut -f 4,5,6

# Paired end read information: name, position and length of template
cat results.sam | cut -f 7,8,9

# Query sequence, query quality, other attributes
cat results.sam | cut -f 10,11,12,13,14

# If you list two files to bwa-mem it will align them in paired and mode
# Get the example dataset from the course website.
curl -O

# Unpack and align this dataset then use the results to answer the homework questions.
bwa mem ~/refs/852/ebola-1999.fa read1.fq read2.fq > results.sam

Lecture 15 - BAM files and samtools

# Install a short read simulator. wgsim by Heng Li.
cd ~/src
git clone
cd wgsim
gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm
ln -s ~/src/wgsim/wgsim ~/bin/wgsim

# Check that it works

# Now download and nstall the samtools package.
# Link samtools to the ~/bin directory

cd ~/src
curl -OL
tar jxvf samtools-1.1.tar.bz2
cd samtools-1.1
ln -s ~/src/samtools-1.1/samtools ~/bin/

# Check that it works.

# Look at the samtools manual.
man ~/src/samtools-1.1/samtools.1

# We'll generate short reads from a mutated reference genome
# then map these back to it.
mkdir ~/edu/lec15
cd ~/edu/lec15

# Generate some data, put the output into a file
wgsim -N 10000 ~/refs/852/ebola-1999.fa read1.fq read2.fq > mutations.txt

# Run the alignment
bwa mem ~/refs/852/ebola-1999.fa read1.fq read2.fq > results.sam

# Start the conversion to a BAM file.
samtools view -Sb results.sam > temp.bam

# Sort the alignment.
samtools sort -f temp.bam results.bam

# Index the alignment.
samtools index results.bam

# Typically we put all commands into a single script and run it as such.
# See script at the end.
bash read1.fq read2.fq results.bam

# Filtering with samtools
# -f match the flag (keep the reads that match the flags),
# -F filter the flags (remove reads that match the flag and keep the rest)

# Reads that align to the reverse strand.
samtools view -f 16 results.bam

# Reads that align to the forward strand.
samtools view -F 16 results.bam

# The -c flag counts
samtools view -F 16 results.bam | wc -l
samtools view -c -F 16 results.bam

# Filter by quality. BWA shows a mapping quality of zero
# for reads that map to multiple locations. q>=1 thus means
# uniquely mapping reads.
samtools view -c -q 1 results.bam

# Count the high quality alignment.
samtools view -c -q 40 results.bam

# The reference name  is gi|10313991|ref|NC_002549.1|
# It is tedious to type that. Create a shorcut.

# Slice into the datafile.
samtools view results.bam $CHR:1-100

# View specific columns of the file.
samtools view results.bam $CHR:1-100 | cut -f 4 |  more

Lecture 16 - Converting data with ReadSeq

# Install the file format converter
# We have to make a directory for it.
mkdir -p x

cd ~/src/readseq

curl -OL

# It is invoked similarly to trimmomatic.
# Modify the script for trimmomatic
cp ~/bin/trimmomatic ~/bin/readseq

# Replace the program with
# java -jar ~/src/readseq/readseq.jar $@
# You can also add
# java -cp ~/src/readseq/readseq.jar app $@
# to run it with a graphical user interface.

# Get the genome for the 1999 ebola genome as a full GenBank record.

# Get the full genbank file.
mkdir -p ~/edu/lec16

cd -p ~/edu/lec16

# Get the complete data.
efetch -db nucleotide -id NC_002549.1 -format gb >

# Format the data as GFF (Generic Feature Format).
# Autodetects the input format.
readseq -format=GFF

# You may also set the output into a filename.
readseq -format=GFF -o NC-all.gff

# Extract into a fasta file.
readseq -format=FASTA -o NC.fa

# Retain only the gene rows from the GFF file.
cat NC-all.gff | egrep '\tgene\t'

# Really what we want is also the first two lines since those
echo '##gff-version 2' > NC-genes.gff
cat NC-all.gff | egrep '\tgene\t' >> NC-genes.gff

# Chromosomal coordinates do not match! We need to fix that.
# We either need to redo the alignment, or swap the coordinates
# of the existing alignment or features.
# Reindex and realign the data
cp NC.fa ~/refs/852/
bwa index ~/refs/852/NC.fa
cp ../lec15/*.fq .

# Edit the to make use of the new file and rerun everyhing.
bash read1.fq read2.fq results.bam

# Load the data and view it in IGV
# Look at the depth in the 100-500 bp region.
# The output is sequence id, genomic index, coverage
samtools depth -r NC_002549:100-500 results.bam

Lecture 17 - Simple programming with awk

# Simple programming with awk
# We did this already but let's do it again so that we
# know what we are working with.
# Extract the Ebola coding features, genes and coding sequences
efetch -db nucleotide -id NC_002549.1 -format gb >
readseq -format=GFF -o NC.gff

# Find the lenght of each feature. Note how "magic"
# variable names get created
cat NC.gff | awk ' { print $1, $2, $3 } ' | head -5

# This is (almost) equivalent to cutting columns
cat NC.gff | cut -f 1,2,3 | head -5

# We can use operations between the fields. How long is each feature?
cat NC.gff | awk ' { print $3, $5-$4 + 1 } ' | head -5

# We can make use of the pattern matching to extract only CDS features.
cat NC.gff | awk '$3 =="gene" { print $3, $5-$4 + 1, $9 } '

# What is the cumulative length of all genes
cat NC.gff | awk '$3 =="gene" { len=$5-$4 + 1; size += len; print "Size:", size } '

# What is the cumulative length of all coding sequences
cat NC.gff | awk '$3 =="CDS" { len=$5-$4 + 1; size += len; print "Size:", size } '

# How big is the genome? There are many ways to find out, perhaps from the SAM headers.
samtools view -h ../lec16/results.bam | head -2

# What percent of the genome is covered by genes.
cat NC.gff | awk '$3 =="CDS" { len=$5-$4 + 1; size += len; perc=size/18959; print "Size:", size, perc } '

# What we need is to force awk to only split by tabs.
# Splitting by all whitespace is useful but in data analysis
# can lead to very annoying bugs.
# Tell awk to separate by tab both as F (Field Separator) and as OFS (Output Field Separator)

# Put this into your .profile or .bashrc file
alias awk="awk -F '\t' -v OFS='\t'"

# Apply the setting.
source ~/.profile

# Make a new gff file with the genes and coding sequences.
# The first line specifies what kind of file this is.
head -1 NC.gff  > NC-genes.gff
head -1 NC.gff  > NC-cds.gff

# Separate the files for various features.
cat NC.gff | awk ' $3=="gene" { print $0 }' >> NC-genes.gff
cat NC.gff | awk ' $3=="CDS" { print $0 }'  >> NC-cds.gff

# Sam files are tab delimited and can be processed via awk.

# In last week's data how many bases have the coverage of more than 50?
samtools depth ../lec16/results.bam | awk '$3 > 50 { print $0 } ' | wc -l

# How many templates are longer than 50 bp.
# The template lenght can be negative though...
samtools view ../lec16/results.bam | awk ' $9 > 50 { print $0 } '  | wc -l

Split a GFF file to extract the gene name

$3 == "gene" {

    # Split the 9th column by ;
    split($9, x, ";")

    # The gene name is in the first resulting element.
    # Split that by space. The gene name is the second
    # element.
    split(x[1], y, " ")

    # Remove the double quotes around the gene name
    name = y[2]

    # Global substituion of " with empty space.
    # Since " is also a special character we have to
    # write it as \"
    gsub("\"", "", name)

    # Print the type of the feature, the name and the size
    print $3, name, $5 - $4 + 1

Lecture 18 - Comparing alignment tools

# Download and install bowtie2 aligner
cd ~/src

# On a Mac OSX use
curl -OL
# On Linux use
#curl -OL

# This is already a binary so it is executable

# Link the aligner and other related programs into the bin folder
ln -s ~/src/bowtie2-2.2.4/bowtie2 ~/bin/
ln -s ~/src/bowtie2-2.2.4/bowtie2-align ~/bin/
ln -s ~/src/bowtie2-2.2.4/bowtie2-build ~/bin/

# Build the index to our reference genome
# This is now the bowtie2 specific index
bowtie2-build ~/refs/852/NC.fa NC.fa

# Format the mutations so that these can be displayed in the browser
# Make it into a gff file.
cat mutations.txt | awk ' {print $1, "wgsim", "mutation", $2, $2, ".", "+",  ".", "." }' > mutations.gff

# Run the comparisons

# Modify the script to have larger error rates.
# Count the number of mapper reads as you change the error rate.
samtools view -cF 4 bwa.bam
samtools view -cF 4 bow.bam

The final script

# Compare the output of two aligners.
# Usage: bash

# Stop on errors.
set -ue

# Data file names.

# Reference file. Must be indexed with both aligners.

# Simulate reads from a genome.
# Edit the error rates and rerun the pipeline.
wgsim -N 10000 -e 0.1 $REFS $READ1 $READ2 > mutations.txt

# convert mutations into a GFF file.
cat mutations.txt | awk -F '\t' ' BEGIN { OFS="\t"; print "##gff-version 2" } { print $1, "wgsim", "mutation", $2, $2, ".", "+",  ".", "name " $3 "/" $4 }' > mutations.gff

# We need to add a read group to the mapping

# Run bwa and create the bwa.sam file.
bwa mem -R $GROUP $REFS $READ1 $READ2 > bwa.sam

# Run bowtie2 and create the bow.sam file.
bowtie2 -x $REFS -1 $READ1 -2 $READ2 > bow.sam

# Tuning bowtie2
#bowtie2 -D 20 -R 3 -N 1 -L 20 -i S,1,0.50 -x $REFS -1 $READ1 -2 $READ2 > bow.sam

# For each sam file convert it to bam format.
for name in *.sam;
    samtools view -Sb $name > tmp.bam
    samtools sort -f tmp.bam $name.bam

# Get rid of intermediate files.
rm -f tmp.sam tmp.bam

# Fix names so that they don't have two extensions.
mv bwa.sam.bam bwa.bam
mv bow.sam.bam bow.bam

# Index the data.
for name in *.bam
    samtools index $name

# Clean all the data generated by this program
# rm -f *.bam *.bai *fq *.txt *.sam *.gff

Lecture 19 - samtools faidx and pileups

# Continuing with the code from lecture 18
# We have the bwa.bam and bow.bam files.

# Pileup output. The wgsim simulator produces reads with low quality
# that would otherwise be ignored by samtools.

# index the fasta file with samtools
samtools faidx ~/refs/852/NC.fa

# Query the fasta file with samtools
samtools faidx ~/refs/852/NC.fa NC_002549:1-10

# You can list multiple ranges
samtools faidx ~/refs/852/NC.fa NC_002549:1-10 NC_002549:100-110

# Investigate the output with or without a fasta reference
# look at the help file
samtools mpileup

# pileup with no reference
samtools mpileup -Q 0 bwa.bam | more

# pileup with reference
samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam | more

Lecture 20 - Pileups and coverages

# First off install bcftools a suite for interacting with variant call formats.
cd ~/src
curl -OL
tar jxvf bcftools-1.1.tar.bz2
cd bcftools-1.1
ln -s ~/src/bcftools-1.1/bcftools ~/bin

# Note the similarity in the usage of bcftools and samtools.

# Using the alignment script for lecture 18.

# What is the average coverage for each run.
# NR is the number or records that may not be equal to the size of the genome.
# By default samtools depth only outputs regions that have non-zero coverage.
samtools depth bwa.bam | awk '{ sum += $3 } END { print sum/NR } '

# How to find the size of the genome.
# Many ways, none of them particularly pretty.

# The @SQ header contains the lengths.
samtools view -h bwa.bam | head | grep @SQ

# Now we know the size of the chromosome.
samtools depth bwa.bam | awk '{ sum += $3 } END { print sum/18959 } '

# Pileups with or without a reference
# Look at the usage
samtools mpileup

# pileup with no reference
samtools mpileup -Q 0 bwa.bam | more

# Pileup with reference.
samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam | more

# Samtools text genome browser.
# Press ? for help, q or ESC to exit.
samtools tview bwa.bam

# Generate a variant Call format for the entire genome.
samtools mpileup -Q 0 -f ~/refs/852/NC.fa -uv bwa.bam | more

# Generate genotypes then call variants.
# Samtools is designed for human genomes - it does not seem to
# work all that well with viral genomes.
samtools mpileup -Q 0 -ugf ~/refs/852/NC.fa bwa.bam | bcftools call -vm > snps.vcf

# Welcome to your next fileformat.
# A column by column guide.
samtools mpileup -Q 0 -f ~/refs/852/NC.fa -uv bwa.bam | grep -v "##" | cut -f 1,2,3,4,5 | head

The updated script

# Compare the output of two aligners.
# Usage: bash

# Stop on errors.
set -ue

# Data file names.

# Reference file. Must be indexed with both aligners.

# Simulate reads from a genome.
# Edit the error rates and rerun the pipeline.
wgsim -N 10000 -e 0.1 $REFS $READ1 $READ2 > mutations.txt

# convert mutations into a GFF file.
cat mutations.txt | awk -F '\t' ' BEGIN { OFS="\t"; print "##gff-version 2" } { print $1, "wgsim", "mutation", $2, $2, ".", "+",  ".", "name " $3 "/" $4 }' > mutations.gff

# We need to add a read group to the mapping

# Run bwa and create the bwa.sam file.
bwa mem -R $GROUP $REFS $READ1 $READ2 > bwa.sam

# Run bowtie2 and create the bow.sam file.
bowtie2 -x $REFS -1 $READ1 -2 $READ2 > bow.sam

# Tuning bowtie2
#bowtie2 -D 20 -R 3 -N 1 -L 20 -i S,1,0.50 -x $REFS -1 $READ1 -2 $READ2 > bow.sam

# For each sam file convert it to bam format.
for name in *.sam;
    samtools view -Sb $name > tmp.bam
    samtools sort -f tmp.bam $name.bam

# Get rid of intermediate files.
rm -f tmp.sam tmp.bam

# Fix names so that they don't have two extensions.
mv bwa.sam.bam bwa.bam
mv bow.sam.bam bow.bam

# Index the data.
for name in *.bam
    samtools index $name

# Clean all the data generated by this program
# rm -f *.bam *.bai *fq *.txt *.sam *.gff

Lecture 21 - Variant calling with samtools

# Shortcut to convert and sort a bam file.
alias tobam='samtools view -b - | samtools sort -o - tmp '

# Install bcftools.
cd ~/src
curl -OL
tar jxvf bcftools-1.1.tar.bz2
cd bcftools-1.1
ln -s ~/src/bcftools-1.1/bcftools ~/bin

# Using the alignment script for lecture 18.

# Pileup with reference.
samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam | more

# Generate genotypes then call variants.
# Samtools is designed for diploid human genomes
# It may have tacit assumptions associated with it.
samtools mpileup -Q 0 -ugf ~/refs/852/NC.fa bwa.bam | bcftools call -vc > samtools.vcf

# View the snp calls.

# How does snp calling work?
# A DIY snp caller I wrote as a simple demo (see website)
samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam > bwa.pileup
cat bwa.pileup | python > diy.vcf

# Look at that my simplistic caller produces the same
# result as samtools on default settings -- how about that ...

# Welcome to your next fileformat.
# A column by column guide.
cat samtools.vcf| grep -v "##" | cut -f 1,2,3,4,5 | head

# Lets install Freebayes

# On a Mac we need another program to compile freebayes
brew install cmake

# Now onto installing FreeBayes
cd ~/src
git clone --recursive git://
cd freebayes

# This produces an error on a Mac.

# But will work when run the second time. It is not clear what is going
# on. The progam seems to work - but problbay some functionality is missing.

ln -sf ~/src/freebayes/bin/freebayes ~/bin

# Now switch back to the main directory and call snps with FreeBayes
cd ~/edu/lec21/

# How does the tool work?

# Call snps with it
freebayes -f ~/refs/852/NC.fa bwa.bam > freebayes.vcf

# Visualize the results.

DIY snp caller

# DIY snip caller
import sys
from collections import defaultdict

# This code was written in about 20 minutes to demonstrate
# a really simple snp caller. As it is it calls only
# homozygous mutation defined as over 50% of calls.

print "##fileformat=VCFv4.2"
print "\t".join("#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT bwa.bam".split())

for line in sys.stdin:
    # Strip the newline then split by tabs.
    column = line.strip().split("\t")

    # The fifth base contains the basecalls
    # (zero based counting).
    bases = column[4]

    # The quals column will tell us how many bases are called.
    quals = column[5]

    # Upper case everything.
    # We do lose strand information with that.
    bases = bases.upper()
    count = defaultdict(int)
    for c in bases:
        count[c] += 1

    # How many bases cover the site
    depth = float(len(quals))

    for m in "ATGC":
        if count[m]/depth > 0.50:
            # We got a homozygous mutations
            # produce the CVF records
            # Collect output into an array.

            info = "DP=%s" % (int(depth))
            format = "GT:PL"
            values = "1/1:30"
            out = [column[0], column[1], ".", column[2], m, 30, ".", info, format, values]

            # Make all elements of the array a string.
            out = map(str, out)

            # Join the fields of the output array by tabs
            # Print the joined fields.
            print "\t".join(out)

Lecture 22 - Variant effect prediction

# Install, build and link bioawk
cd ~/src
git clone
cd bioawk
ln -sf ~/src/bioawk/bioawk ~/bin

# There is a manual that comes with bioawk.
man ~/src/bioawk/awk.1

# Bioawk knows how to handle bioinformatics types
# It knows not just how to split records by column,
# It can populate records by name.

# After you run script that simulates reads
# and aligns them (see previous lectures).
# Now process them with bioawk. Help on formats.
bioawk -c help

cat r1.fq | bioawk -c fastx ' { print $seq } ' | head -1

cat mutations.gff | bioawk -c gff ' { print $seqname, $start } ' | head -1

# Code becomes more readable instead of writing:
cat mutations.gff | grep -v "#" | awk -F '\t' -v OFS='\t' ' { print $1, $5-$4+1 } ' | head -1

# We can write the following:
cat mutations.gff | bioawk -c gff ' { print $seqname, $end-$start+1 } ' | head -1

# We'll use bioawk from now on. We'll cover seqtk, tabtk, tabix at later times.

# Let's download snpEff
cd ~/src
curl -OL

# we can make a script to run snpEff, like trimmomatic and readseq
# for simplicity we'll make an alias for now
alias snpEff='java -jar ~/src/snpEff/snpEff.jar'

# Run snpEff

# You may get an error of the form: UnsupportedClassVersionError
# This means that you need to update your java program to at least verion 7
java -version

# Search download and install the Java JDK (Java Development Kit, but not the JRE, Java Runtime Environment)
# Now snpEff should work.

# There are prebuilt snpEff databases
snpEff databases | more

# Search for the ebola data
snpEff databases | grep ebola

# Download the ebola database
# -v makes the process verbose: prints more messages
snpEff download -v ebola_zaire

# But what are those annotations?
snpEff dump ebola_zaire

# Ok I so the annotations are built relative to a different genome, not
# what we currently use. So we need to get that. Oy.
# Also get the genbank file and extract the coding sequences into a gff.
readseq -format=FASTA -o ~/refs/852/KJ660346.fa
readseq -format=GFF  -o KJ660346.gff

# The query will be the old genome from 1999 that is in NC.fa

# Index the genome with bwa and samtools
bash ~/refs/852/KJ660346.fa ~/refs/852/NC.fa

# Now we need to create a script that takes two files as input
# and generates a VCF file as output.

# Run snpEff
java -jar ~/src/snpEff/snpEff.jar -v  ebola_zaire aln.vcf > annotated.vcf

Align two genomes

# General single end aligner and snp caller.
# First parameter is the reference, second parameter the query

# This will stop the program from
# running if no parameters are passed
set -ue

# Parameters from the command line.

# First parameter is the first genome (reference) file.

# The second paramters is the second genome.

# This is only needed to be performed once
bwa index $GENOME1

# Run bwa and create the bwa.sam file
bwa mem $GENOME1 $GENOME2 > aln.sam

# If you have installed the lastz aligner you can use that.
#lastz $GENOME1[unmask] $READS[unmask] --format=sam > aln.sam

samtools view -Sb aln.sam > tmp.bam
samtools sort -f tmp.bam aln.bam
samtools index aln.bam

# Get rid of intermediate files.
rm -f tmp.bam

# Call snps
samtools mpileup -f $GENOME1 -ug aln.bam | bcftools call -vm > results.vcf

Lecture 23 - Call SNPs from the ebola data

# Call SNPS on multiple samples
# Get multiple datasets from the Ebola project

# Ouch! The data was aligned to another sequence as reference!
# Prepare the new reference then. No other workarounds.
efetch -id KM034562 -format gb -db nucleotide >
readseq -format=FASTA -o ~/refs/852/KM034562.fa
readseq -format=GFF -o KM034562.gff
bwa index ~/refs/852/KM034562.fa
samtools faidx ~/refs/852/KM034562.fa

# Get the run info file
esearch -db sra -query PRJNA257197  | efetch -format runinfo > runinfo.csv

# Select a few SRR (try more if feeling adventurous)
cat runinfo.csv | cut -f 1 -d ',' | grep SRR | head -10 > sraids.txt

# Let's put the data into a different folder.
# There is going to be a lot of it.
mkdir -p sra
mkdir -p bam

# Fetch each SRA id from the file
cat sraids.txt | awk ' { print "fastq-dump -X 20000 --split-files -O sra " $1 } ' >

# This is the script that will execute the data.

# Now align each pair.
# The align-ebola script should be present. That script aligns an SRA file
# and generates a BAM file from it.
cat sraids.txt | awk ' { print "bash " $1 } ' >

# Call snps on all the datasets.
freebayes -p 1 -f ~/refs/852/KM034562.fa bam/*.bam > freebayes.vcf

Alignment script:

# Compare the output of two aligners.
# Usage: bash SRA-RUN-ID

# Put the alignment files into a separate folder
mkdir -p bam

# Stop on errors.
set -ue

# Data file names.

# Create read names from the incoming parater.
# ABC will become sra/ABC_1.fastq and sra/ABC_2.fastq

# Form the input names

# Reference file. Must be indexed with both aligners.

# We need to add a read group to the mapping

# Run bwa and create the bwa.sam file.
# Create a sorted bamfile as output.

# This will be the bamfile name.

bwa mem -R $GROUP $REFS $READ1 $READ2 2> logfile.txt | samtools view -b - | samtools sort -o - tmp > $BAM

echo $?

samtools index $BAM

echo "*** Created alignment file $BAM"

# print out a simple statistic line
samtools flagstat $BAM | grep proper

Lecture 24 - Creating interval data

# Get the ebola genome data and features
# (if you do not already have it)

# A demo genome. It could be any other.
efetch -id KM034562 -format gb -db nucleotide >
readseq -format=FASTA -o ~/refs/852/KM034562.fa
readseq -format=GFF -o KM034562.gff

# We will build BED and GFF files by hand
# with an editor and display them the browser
# Example BED file, columns are tab delimited:
# (when you copy/paste with the browser they usually get turned into spaces)
KM034562	100	200	Happy	0	-

# Example BEDgraph file
KM034562	100	200	50
KM034562	150	250	-50

# Example GTF file
KM034562	lecture	CDS	100	200	.	-	.	gene "Happy"; protein_id "HAP2"

# Example GFF file
##gff-version 3
KM034562	lecture	CDS	100	200	.	-	.	gene=Happy; protein_id=HAP2

# Hierachical data representations: for example splicing
# GTF representation

##gff-version 2
KM034562	demo	exon	1000	2000	.	+	.	gene_id "Happy"; transcript_id "Sad";
KM034562	demo	exon	3000	4000	.	+	.	gene_id "Happy"; transcript_id "Sad";
KM034562	demo	exon	5000	6000	.	+	.	gene_id "Happy"; transcript_id "Sad";
KM034562	demo	exon	7000	8000	.	+	.	gene_id "Happy"; transcript_id "Sad";

KM034562	demo	exon	1000	2000	.	+	.	gene_id "Happy"; transcript_id "Mad";
KM034562	demo	exon	5000	6000	.	+	.	gene_id "Happy"; transcript_id "Mad";
KM034562	demo	exon	7000	8000	.	+	.	gene_id "Happy"; transcript_id "Mad";

KM034562	demo	exon	1000	2000	.	+	.	gene_id "Happy"; transcript_id "Bad";
KM034562	demo	exon	5000	6000	.	+	.	gene_id "Happy"; transcript_id "Bad";

# GFF representation
##gff-version 3
KM034562	demo	exon	1000	2000	.	+	.	Parent=Sad,Mad,Bad;
KM034562	demo	exon	3000	4000	.	+	.	Parent=Sad;
KM034562	demo	exon	5000	6000	.	+	.	Parent=Sad,Mad,Bad;
KM034562	demo	exon	7000	8000	.	+	.	Parent=Sad,Mad;

# BED representation
KM034562	999	8000	Sad	0	+	999	8000	.	4	1000,1000,1000,1000	0,2000,4000,6000
KM034562	999	8000	Mad	0	+	999	8000	.	3	1000,1000,1000	0,4000,6000
KM034562	999	8000	Bad	0	+	999	8000	.	3	1000,1000	0,4000

Lecture 25 - Introduction to Bedtools

# Install bedtools
cd ~/src
curl -OL
tar zxvf bedtools-2.22.0.tar.gz
cd bedtools2
ln -sf ~/src/bedtools2/bin/bedtools ~/bin/bedtools

# Demo bed file
KM034562	100	200	one	0	+
KM034562	400	500	two	0	-

# Our genome file
KM034562	18959

# Unstranded operation.
bedtools slop -i demo.bed -g genome.txt -l 10 -r 0

# Strand aware operation
bedtools slop -i demo.bed -g genome.txt -l 10 -r 0 -s

# Transform BED coordinates to GFF
# This is not really a correct BED to GFF tranformation
cat demo.bed | bioawk -c bed '{print $chrom, ".", ".", $start+1, $end, $score, $strand, ".", "." }' > demo.gff

# Look MA! It works seamlessly with other formats! How is that possible? Magic.
bedtools slop -i demo.gff -g genome.txt -l 10 -r 0 -s

# Flank operation. Redirects the output into a file.
bedtools flank -i demo.gff -g genome.txt -l 10 -r 0 -s > flank.gff

# Complement operation.
bedtools complement -i demo.gff -g genome.txt > complement.gff

# Get the Ebola genome.
efetch -id KM034562 -format gb -db nucleotide >
readseq -format=FASTA -o ~/refs/852/KM034562.fa
readseq -format=GFF -o KM034562.gff

# Reminding ourselves of what formats the tool can take.
bioawk -c help

# Get the genes from the entire file.
cat KM034562.gff | bioawk -c gff ' $feature=="gene" { print $0 } ' > genes.gff

# Sequence extraction. Get sequences that correspond to intervals
bedtools flank -i genes.gff -g genome.txt -l 10 -r 0 -s > flank.gff

# Extract the sequences that correspond to the intervals in genes.gff
bedtools getfasta -fi ~/refs/852/KM034562.fa -bed genes.gff -fo genes.fa

# Look at the resulting file.
head genes.fa

Lecture 26 - Bedtools Tutorial

# Go to lecture 23 where we computed alignments of the Ebola short read data
cd ~/edu/lec23

# Create an interval file called region.bed
# KM034562	1000	2000
# We can use this interval file to intersect the data with.

bedtools intersect -a bam/SRR1553593.bam -b region.bed > region.bam
samtools index region.bam

# We can output this file as a bed file as well.
bedtools intersect -a bam/SRR1553593.bam -b region.bed -bed > region.bed

# Or alternatively we can turn the bamfile into a bedfile
bedtools bamtobed -i bam/SRR1553595.bam > SRR1553595.bed

# We will follow the bedtools tutorial from
mkdir ~/edu/lec26
cd ~/edu/lec26

# Get all the data then unpack it
curl -O
curl -O
curl -O
curl -O
curl -O
curl -O

# This dataset contains more data.
# Fetal tissue samples from the brain, heart, intestine, kidney, lung, muscle, skin, and stomach.
tar -zxvf maurano.dnaseI.tgz

# We will follow the tutorial in class.

# A few tips for the homework.

# Select the promoters from the annotation file
cat hesc.chromHmm.bed | grep  Promoter > promoters.bed

# Find the closest promoter to each exon
# the -d flag prints the column
bedtools closest -a exons.bed -b promoters.bed  -d | head -1

# Find the CPG island that has the most exons covering it
bedtools intersect -a cpg.bed -b exons.bed -c | sort -k5,5nr | head -1

# Cover the human genome with windows of 5kb size
bedtools makewindows -g genome.txt -w 50000 > windows.bed

Lecture 27 - Introduction to RNA-Seq

# Install TopHat
cd ~/src

# Mac OSX
curl -OL

# Linux
# curl -OL

# Unpack and install it
tar zxvf tophat-2.0.13.OSX_x86_64.tar.gz
ln -s ~/src/tophat-2.0.13.OSX_x86_64/tophat ~/bin

# Following the RNA-Seq bootcamp by Stephen Turner

# Download genomic data information from Ensemble
# Make a directory for the human genome
mkdir -p ~/refs/hg38

# Get the genome sequence
curl -L > ~/refs/hg38/chr4.fa.gz

# Get the annotations
curl -L > ~/refs/hg38/hg38.gtf.gz

# Unzip the files in the reference
gunzip ~/refs/hg38/*.gz

# We can build a simpler GTF file that only contains the data for chromosome 4
cat ~/refs/hg38/hg38.gtf | awk ' $1 =='4' { print $0 }' > chr4.gtf

# You can also extract parts of the annotation file, say exons.
cat chr4.gtf | awk '$3=="exon" { print $0 } ' > exons.gtf

# Build the bowtie2 reference
bowtie2-build ~/refs/hg38/chr4.fa  ~/refs/hg38/chr4

# Build the bwa reference
bwa index ~/refs/hg38/chr4.fa

# Get the and unpack the data
curl -OL
tar zxvf rnaseq-data.tar.gz

# Let's align one dataset first
# Shortcut to convert and sort a SAM file into a BAM  file.
alias tobam='samtools view -b - | samtools sort -o - tmp '

time bwa mem ~/refs/hg38/chr4.fa  data/ctl1.fastq.gz | tobam > ctl1.bwa.bam
samtools index ctl1.bwa.bam

# What is the highest covered region
bedtools coverage -abam ctl1.bwa.bam -b chr4.gtf | more

# Put it into a file
bedtools coverage -abam ctl1.bwa.bam -b chr4.gtf > coverages.txt

# Sort by column 10 (the column that contains the number of reads mapped)
# Tab delimited files need to get the construct below to work
# otherwise sort will split on any space.
cat coverages.txt | sort -t$'\t' -rn -k10,10 | head

# Load your data into IGV
# Find gene RPS3A

# Let's align the same data with a splice aware aligner: TopHat
tophat ~/refs/hg38/chr4 data/ctl1.fastq.gz

# Index the results
samtools index tophat_out/accepted_hits.bam
# Investigate the results.
# with IGV

Lecture 28 - Running the Tuxedo suite

# Get and install cuffdiff and cufflinks
cd ~/src

# Download and install the Mac version
curl -OL
tar zxvf cufflinks-2.1.1.OSX_x86_64.tar.gz
ln -fs ~/src/cufflinks-2.1.1.OSX_x86_64/cufflinks ~/bin
ln -fs ~/src/cufflinks-2.1.1.OSX_x86_64/cuffdiff ~/bin

# On Linux download a different version. Note the change in the name of directory.
# Adjust the commands above accordingly.
# curl -OL

# Check that it works.

# We need to generate alignments with tophat then quantify these with cuffdiff
# We assume the genome has been downloaded and indexed as shown in lecture 27
curl -OL
tar zxvf rnaseq-data.tar.gz
rm rnaseq-data.tar.gz

# Get this only if you did not already
# curl -L > ~/refs/hg38/hg38.gtf.gz
#gunzip ~/refs/hg38/*.gz

cat ~/refs/hg38/hg38.gtf | awk ' $1=="4" { print $0 } ' > chr4.gtf

# Run tophat on each of the files
tophat -G chr4.gtf -o data/ctl1-tophat ~/refs/hg38/chr4 data/ctl1.fastq.gz

# Copy the resulting file to the bam directory under a different name
cp data/ctl1-tophat/accepted_hits.bam bam/ctl1.bam
samtools index bam/ctl1.bam

# Now repeat each step.
# You could create script file, either one that lists all commands ore
# a more automatic one (see below).
# bash

# Now let's look for differential expression with cuffdiff

# Run cuffdiff and place run into the bam/cuffdiff_out folder
cd bam
cuffdiff -o cuffdiff_out chr4.gtf ctl1.bam,ctl2.bam,ctl3.bam uvb1.bam,uvb2.bam,uvb3.bam

# Look at the files that cuffdiff produced
cd cuffdiff_out

# Open the gene_exp.diff file sort by the 10th column and only show a few columns
cat gene_exp.diff | grep yes | sort -k10n | cut -f 3,8,9,10 | head
Tophat alignment script:
# Runs tophat on each data file

# Stop on errors.
set -ue


# This folder will contain the alignments
mkdir -p bam

for name in ctl1 ctl2 ctl3 uvb1 uvb2 uvb3
	# Create the fastq file name.

	# Output directory name.

	# Run tophat and put the results in OUTPUT directory.
	tophat -G chr4.gtf -o $OUTPUT $REF $FASTQ

	# Move the resulting accepted_hits.bam file to a different name.
	mv -f $OUTPUT/accepted_hits.bam bam/$name.bam


Lecture 29 - Comparing feature counting software

# Download and install eXpress
cd ~/src
curl -OL
tar zxvf express-1.5.1-macosx_x86_64.tgz
ln -fs ~/src/express-1.5.1-macosx_x86_64/express ~/bin

# For Linux
# curl -OL

# Check that the eXpress program works.

# Install the subread package
cd ~/src
curl -OL
tar zxvf subread-1.4.6-source.tar.gz
cd subread-1.4.6-source/src
make -f Makefile.MacOS
ln -fs ~/src/subread-1.4.6-source/bin/featureCounts ~/bin
ln -fs ~/src/subread-1.4.6-source/bin/subread-align ~/bin
ln -fs ~/src/subread-1.4.6-source/bin/subread-buildindex ~/bin

# On Linux
# make -f Makefile.Linux

# Check that the featureCounts program works

# Data preparation.
# Get the data if you don't already have it
curl -OL
tar zxvf rnaseq-data.tar.gz
rm -f rnaseq-data.tar.gz

# Get the GTF file if you don't already have it
curl -L | gunzip -c > ~/refs/hg38/hg38.gtf.gz

# Make the GTF file for just chromosome 4
# Since we know our data only contains data that aligns to this chromosome.
cat ~/refs/hg38/hg38.gtf | awk ' $1=="4" { print $0 } ' > chr4.gtf

# Quantify with TopHat
# --------------------
tophat -G chr4.gtf -o data/ctl1-tophat ~/refs/hg38/chr4 data/ctl1.fastq.gz
tophat -G chr4.gtf -o data/ctl2-tophat ~/refs/hg38/chr4 data/ctl2.fastq.gz

# Compare two control samples ... see what happens. Should anything be differentially expressed?
cuffdiff -o diff chr4.gtf data/ctl1-tophat/accepted_hits.bam data/ctl2-tophat/accepted_hits.bam

# Quantify with eXpress
# ---------------------

# eXpress aligns against a transcript.
# We can download the transcript data from a data source
# or generate it from the known annotations.
# There is a GFF to FASTA converter pacakged with CuffLinks
# This will make use of the reference FASTA and the chr4.gtf file
ln -sf ~/src/cufflinks-2.1.1.OSX_x86_64/gffread ~/bin

# Run it to see the help
gffread -h

# Extract sequences for the exonic coordinates.
gffread chr4.gtf -t exon -g ~/refs/hg38/chr4.fa -w ~/refs/hg38/chr4-transcripts.fa

# We could also get the transcripts directly from Ensembl.
# gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz

# eXpress needs an alignment file to quantify.
# The alignment file needs to be sorted by read name.
# Create an alignment file via bwa. You could use other aligners as well.
mkdir -p bwa
bwa index ~/refs/hg38/chr4-transcripts.fa

# Note how this BAM file will not sorted by position.
# It cannot be visualized in IGV!
bwa mem ~/refs/hg38/chr4-transcripts.fa data/ctl1.fastq.gz | samtools view -b - > bwa/ctl1.bam

# Quantify with express
express ~/refs/hg38/chr4-transcripts.fa bwa/ctl1.bam -o bwa

# Quantify with SubRead
# ---------------------
# For this we need alignments against genome!

# Create an alignment with tophat
tophat -G chr4.gtf -o data/ctl1-tophat ~/refs/hg38/chr4 data/ctl1.fastq.gz

# Count features with featureCounts
mkdir -p sub
featureCounts -a chr4.gtf -g transcript_id data/ctl1-tophat/accepted_hits.bam  -o sub/ctl1.tophat.txt

# Let's compare the outputs

# The counts produced by featureCounts with TopHat
cat sub/ctl1.tophat.txt | sort -rn -k 7,7 | cut -f 1,6,7 | head | cat -n

# The counts produces by eXpress on alignment with bwa
cat bwa/results.xprs | sort -rn -k5,5 | cut -f 2,3,5 | head | cat -n

# To get FPKM from featureCounts output you need to divide by lenght
# and normalize to library size: 60683
cat sub/ctl1.tophat.txt | grep "ENST" |  awk '{ print $1, $7/$6/60683 * 10^9 } ' | sort -rn -k2 | head

Lecture 30 - Gene Ontology

# Gene Ontologies. Making sense of data.

# Work on the results of TopHat.

# What columns do we have?

cat gene_exp.diff | head -1

# We have options to sort by various columns.
# Largest fold change (smallest) fold change, coverage in sample1 or sample2 etc.

# We typically want to complile a list of gene names with p-values.

# Sometime you need to separate genes by a certain property. Upregulated, downregulated
# and perform enrichment on each gene set separately.

#  Genes with positive foldchange.
cat gene_exp.diff |  grep yes | awk '$10 > 0 { print $0 }' | cut -f 3,13 > positive.txt

# Genes with negative foldchange.
cat gene_exp.diff |  grep yes | awk '$10 < 0 { print $0 }' | cut -f 3,13 > negative.txt

# There are other tools that need all datasets as they compute
# the background distribution from all gene values. In this case
# we cut column 2 (ensemble ID).
cat gene_exp.diff | cut -f 2,13 > all-ensemble.txt

# Take these lists and paste them into GO ontolgy search programs.

# Install ErmineJ

cd ~/src

curl -OL
ln -sf ~/src/ermineJ-3.0.2/bin/ ~/bin

# Seems that we need to make it executable.
chmod +x ~/bin/

# Run it with graphical user interface. --gui

# ermine

Ubuntu Linux on a Chromebook

It is possible to install Linux on a low cost Chromebook ($239) and thus bypass the need to boot from a USB drive, dual boot or virtualize Linux. This is what you need:

  1. Get an Intel processor based Chromebook. Choose a model that is known to run Linux. After some research I chose the Acer C720 ChromeBook from Amazon. I picked the 32GB storage version.
  2. Linux can be enabled by via a tool named Crouton: Chromium OS Universal Chroot Environment as described at The installation is an extremely simple three step process.
  3. Apply the software setup commands listed below.

Linux setup commands. Check back for changes as we proceed with the course.

# Install a number of required libraries.
sudo apt-get install -y curl build-essential ncurses-dev byacc zlib1g-dev python-dev git cmake

# Install java and python libraries.
sudo apt-get install -y unzip default-jre python-pip

# Install the required perl modules.
sudo perl -MCPAN -e 'install "LWP::Simple"'
sudo perl -MCPAN -e 'install "HTML::Entities"'


ChromeBook Screenshot

Bash Customization

To customize your bash enviroment edit your the file named `.profile` (on a Mac) or `.bashrc` (on Linux). Make sure tab-completion works. Never type full filenames. Instead type them partially then then *tap the *TAB*, if nothing seems to happen double tap TAB to see all options.

export PS1='\[\e]0;\w\a\]\n\[\e[32m\]\u@\h \[\e[33m\]\w\[\e[0m\]\n\$ '

# Aliases are a way to invoke more complex
# On linux use ls='ls -h --color'
alias ls='ls -hG'

# Safer versions of the default commands.
alias rm='rm -i'
alias mv='mv -i'
alias cp='cp -i'

# Extend the path.
export PATH=~/bin/:$PATH

Created by Istvan Albert • Last updated on Thursday, December 11, 2014 • Site powered by PyBlue