Fall 2011, Analyzing High Throughput Sequencing Data


The purpose of this course is to introduce students to the various applications of high-throughput sequencing including: chip-Seq, RNA-Seq, SNP calling, metagenomics, de-novo assembly and others. The course material will concentrate on presenting complete data analysis scenarios for each of these domains of applications and will introduce students to a wide variety of existing tools and techniques. We expect that by the end of the course work students will:

  • understand common bioinformatics data formats and standards
  • become familiar with the practice of analyzing short-read sequencing data from various instruments:
    • Illumina HiSeq sequencer
    • ABI SOLID sequencer
    • Roche 454 platforms
  • develop a computationally oriented thinking that is necessary to take on large-scale data analysis projects
  • understand data analysis principles of methodologies such as:
    • short read and long read alignments
    • Chip-Seq analysis and peak calling
    • interval query and manipulation
    • SNP calling and genomic variation detection
    • genome assembly with open source tools
    • metagenomics analysis
  • filter, extract and combine data with scripting languages
  • automate tasks with shell scripts to create reusable data pipelines
  • plot and visualize results with R and other packages

A laptop that has sufficient amount of battery power for 25 minute work may be required to perform data analysis tasks in class. We will be able to provide support for Mac OSX (Tiger/Leopard), Windows (XP/Vista) and Linux operating systems.


Practical data analysis for life scientists 
BMMB 597D - Bio Data Analysis (2 cr.)
Schedule #398704
Tuesday/Thursday 2:30-3:20 in 120 Thomas Building
Limit of 25 students.   
Office hours: MW 2-3pm 502B Wartik

Lecture Notes

Lectures will appear below as they are presented. Each week we will cover certain topic over two lectures. Homeworks are included in the handouts. Due to space considerations the links to the datasets for each lecture will be distributed via the class email list and are not included on this website.

  • Lecture 1 - slides: course information, homework and project information, introduction to computing, introduction to the UNIX operating system, homework 1.

  • Lecture 2 - slides: the [GFF format], [sequence ontologies], UNIX input and output streams, piping commands, processing a tabular file with UNIX tools, homework 2.

  • Lecture 3 - slides: quality control, sequencing read file formats, fasta, color space fasta, fastq, using the [FastQC package]. , homework 3.

  • Lecture 4 - slides: quality filtering, writing shell scripts, elements of bash programming, using the [Fastx toolkit], homework 4.

  • Lecture 5 - slides: sequence aligment concepts, general features and charachteristics of short read aligners, using the [BWA aligner] homework 5.

  • Lecture 6 - slides: the SAM - Sequence Alignment/Map format, understanding the [SAM specification], generating a SAM file with the [BWA aligner] homework 6.

  • Lecture 7 - slides: SAM file filtering, using the Samtools_ software suite, generating BAM files (binary SAM), sorting and indexing BAM files, filtering alignment files, depth of coverage tools, querying SAM files, homework 7.

  • Lecture 8 - slides: sequence coverage concepts, paired end and mated-pairs sequencing, aligning and filtering paired-end data, homework 8.

  • Lecture 9 - slides: genome visualization tools, using the [Integrative Genomics Viewer], creating custom genomes for IGV, visualizing paired end alignments, homework 9.

  • Lecture 10 - slides: text parsing and processing, introduction to the [AWK programming] language, homework 10.

  • Lecture 11 - slides: genomic coordinate systems, [BED, GFF and WIG formats], converting between formats, homework 11.

  • Lecture 12 - slides: interval datatypes, coordinate systems, BED and GFF formats, interval operations, intersect, genomic coverage computation with the BEDTools_ package, homework 12.

  • Lecture 13 - slides: more interval operations, flanking, extending, merging intervals with BEDTools_ package, homework 13.

  • Lecture 14 - slides: compressed files and archives, how to install tools, the tabix_ software tool, the [Penn State High Performance Computing Systems] , homework 14.

  • Lecture 15 - slides: human genomic variation, the [Variant Call Format], introduction to SNP calling, homework 15.

  • Lecture 16 - slides: dealing with data duplication, continuing the overview of SNP calling tools, the [Genome Analysis Toolkit], the [inGAP software], homework 16.

  • Lecture 17 - slides: midterm project instructions, introduction to Chip-Seq analysis, DNA fragment sequencing, comparing bound locations of short and long footprint, homeworks 17 and 18 (midterm project).

  • Midterm Projects: the list of proposed midterm project ideas, see Lecture 17 for instructions

  • Lecture 18 - slides: more strategies for Chip-Seq analysis, samtools pileup output and computing coverage measures, creating and indexed queryable coverage file, homework 18 (midterm)

  • Lecture 19 - slides: code repositories, the bioawk_ and chipexo_ repositories, peak calling concepts, peak calling with GeneTrack, homework 19

  • Lecture 20 - slides: chip-seq fragment size estimation (bioawk_), the [Chip-Seq Challenge] running peak callers, evaluating and comparing the output of MACS_, sissrs_, SWEMBL_ and GeneTrack (chipexo_), homework 20

  • Lecture 21 - slides: peak prediction with GeneTrack (chipexo_), the [Cis-regulatory Element Annotation System], generating your custom profiles with bioawk_, homework 21

  • Lecture 22 - slides: p-values and statistical significance, p-value interpretation: [problems and pitfalls], simple strategies for statistical estimation, the [filo package]: groupBy and stats programs, homework 22

  • Lecture 23 - slides: an introduction to genome assembly, the [AMOS toolset], the [Velvet assembler], the [MUMmer aligner], homework 23

  • Lecture 24 - slides: the [ReadSeq utility], the [NCBI short read archive], read mapping quality evaluation with wgsim_, NGS mapper [ROC curves], comparing BWA_, [bowtie and bowtie2] mappers, homework 24

  • Lecture 25 - slides: data analysis with R_ and RStudio_, data types, vectors, factors, data frames, indexing and filtering R objects homework 25

  • Lecture 26 - slides: visualize high dimensionality datasets, using the ggplot2_ software, example plots, genering histograms of distances around 5' feature start sites homework 26

  • Lecture 27 - slides: introduction to metagenomics, methods of metagenomics, phylotyping and OTU based approaches, resources, running BLAST

  • Lecture 28 - slides: final project information, analysis of metagenomics data, read classification via the MetaGenome Analyzer and the [RDP Multiclassifier]

  • Dataset for the final project: see project description at the beginning of lecture 28

  • Lecture 29 - slides: metagenomics data analysis, the QIIME_ and mothur_ packages, the NAST_ algorithm, example of data analysis with mothur_

  • Lecture 30 - slides: quality filtering for metagenomics data analysis, trimming flows and sequences with mothur_, basic workflow, rarefaction curves

Grading and Homework

The final grade will be an average of the grades obtained on homework and two projects. Please refer to the information in :download:[Lecture 1<ppt lecture-1.pdf="">[ for more details on the projects.

Homework will be handed out on most lectures in the form of exercises that will need to be turned in at the beginning of each week. Note that many of these may be solved in class during the exercise session.

We want to emphasize that the primary goal of this course work is to improve students ability to handle and interpret data sets. Therefore the evaluation process is relative to the initial aptitudes. We aim to focus on developing permanent skills and talents that are not just immediately useful but also provide the foundation for further more in depth understanding of informatics in general.

Created by Istvan Albert • Last updated on Tuesday, March 31, 2015 • Site powered by PyBlue