Lectures will appear below as they are presented. Homeworks are specified in each handout.
Lecture 1 - slides, handouts.
Course information, homework and project information, introduction to computing, setting up you computer,
basic unix command line usage, organizing your projects, homework 1.
Lecture 2 - slides, handouts,
Biological databases, the GFF format,
sequence ontologies, basic Unix commands:
input and output streams, piping commands, processing a tabular file with UNIX tools, homework 2.
Lecture 3 - slides, handouts,
Inside the data factory: how the Ebola paper was written, the GenBank format, core concepts for
the Short Read Archive (SRA), automated download of data from NCBI, installing and using Entrez Direct, homework 3.
Lecture 4 - slides, handouts,
Installing and using the SRA tookit, settings up paths, install a proper text editor, using the
sra tooling to downloading project wide data, homework 4.
Lecture 5 - slides, handouts,
FASTA format, accession numbers, fetching subsequences from NCBI,
creating scripts and reusable components, bash programming, homework 5.
Lecture 6 - slides, handouts,
An overview of single end sequencing, quality values, encodings, the Phred encoding,
FASTQ format, homework 6.
Lecture 7 - slides, handouts,
Dealing with compressed files and file archives. Using gzip, gunzip and tar.
Installing and running FastQC, interpreting the FastQC outputs, homework 7.
Lecture 8 - slides, handouts,
Base quality trimming, installing tools, evaluating the results of quality control,
paired end sequencing concepts, homework 8.
- Code for lecture 8: Quality controls and corrections
- Biostar Question of the Day: Fastq Quality Control Shootout
- Suggested reading: Illumina sequencing explained.
- Software tools for adapter trimming:
- CutAdapt application note in Embnet Journal, 2011
- fastq-mcf published in The Open Bioinformatics Journal, 2013
- PrinSeq application note in Bioinformatics, 2011
- Trimmomatic application note in Nucleic Acid Research, 2012, web server issue
- Trim Galore - a wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files, with some extra functionality for MspI-digested RRBS-type (Reduced Representation Bisufite-Seq) libraries
- NGS Toolkit published in Plos One, 2012
- Fastx Toolkit: collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing - one of the first tools
- BioPieces a suite of programs for sequence preprocessing
- Scythe a bayesian adaptor trimmer
- FlexBar, Flexible barcode and adapter removal published in Biology, 2012
- SeqPrep - a tool for stripping adaptors and/or merging paired reads with overlap into single reads.
- Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads.
- TagDust published in Bioinformatics, 2009
- TagCleaner published in BMC Bioinformatics, 2010
- Libraries via R (Bioconductor): PIQA, ShortRead
Lecture 9 - slides, handouts,
Advanced pattern matching, regular expressions,
detecting and trimming adaptor sequences homework 9.
Lecture 10 - slides, handouts,
The basics of alignments, global, local and semiglobal alignments,
scoring matrices, pairwise alignments, homework 10.
Lecture 11 - slides, handouts,
Installing and Using BLAST, search strategies, Blast settings and configuration,
Lecture 12 - slides, handouts,
Blast Cookbook, short usage examples, tips and tricks, homework 12.
Lecture 13 - slides, handouts,
installing tools, short read aligners, run and install bwa, homework 13.
Lecture 14 - slides, handouts,
the SAM (Sequence Alignment Map) format, homework 14.
Lecture 15 - slides, handouts,
the SAM/BAM and samtools, filter and select data, homework 15.
Lecture 16 - slides, handouts,
genomic data visualization, IGV, IGB, converting formats, homework 16.
Lecture 17 - slides, handouts,
some programming required, introduction to the AWK programming language, tabular file processing,
filtering by feature types,
Lecture 18 - slides, handouts,
the origins of genomic variation, a case study,
comparing and evaluating alignment tools, homework 18.
Lecture 19 - slides, handouts,
sequencing coverages, pileups and the variant call format, homework 19.
Lecture 20 - slides, handouts,
aligner evaluation, computing coverages, the pileup formats, introduction to VCF formats, homework 20.
Lecture 21 - slides, handouts,
the variant call format, generating variant calls with samtools, homework 21.
Lecture 22 - slides, handouts,
bioinformatics survival toolkit: bioawk, seqtk, tabix, tabtk, align
two genomes, annotate the effect of snps with snpEff, homework 22.
Lecture 23 - slides, handouts,
automating data processing, build and entire snp calling pipeline, homework 23.
Lecture 24 - slides, handouts,
interval datatypes, BED, GFF2, GTF, GFF3, specifying hierarhical relationships, homework 24.
Lecture 25 - slides, handouts,
interval handling, extending, flanking intervals with bedtools, extract sequences, homework 25.
Lecture 26 - slides, handouts,
intersecting and querying intervals data, homework 26.
Lecture 27 - slides, handouts, rnaseq-data.tar.gz
introduction to RNA-Seq, approaches, splice aware alignments , homework 27.
Lecture 28 - slides, handouts, rnaseq-data.tar.gz
running the Tuxedo suite: tophat, cuffdiff, cufflinks, homework 28.
Lecture 29 - slides, handouts, rnaseq-data.tar.gz
comparing different RNA-Seq methdologies eXpress, featureCounts suite, homework 29.
Lecture 30 - slides, handouts
the Gene Ontology, homework 30.
The purpose of this course is to introduce students to the
various applications of high-throughput sequencing including: chip-Seq,
RNA-Seq, SNP calling, metagenomics, de-novo assembly and others.
The course material will concentrate on presenting complete data analysis scenarios
for each of these domains of applications and will introduce students to a wide
variety of existing tools and techniques. We expect that by the end of the
course work students will:
Access to a Mac or Linux computer is necessary to perform the homework.
Only Mac OSX (Tiger/Leopard) and Linux operating systems are supported.
This course will have a total of 30 homeworks that are given out at the end of each lecture
and is due by the first lecture (Tuesday) each week. The final 30th homework will be
a more complex project that requires more effort than a regular homework.
The final grade will be a weighted average of the grades obtained on the homeworks
(the last homework has a weight of 5, the rest have a weight of 1).
For more details please refer to the information presented during the first lecture.