Analyzing Next-Gen Sequencing Data 2013


Practical data analysis for life scientists
BMMB 597D - Bio Data Analysis (2 cr.)
Schedule #231958
Tuesday/Thursday 2:30-3:20 in 316 Wagner Bldg
Limit of 25 students.
Office hours: MW 1-2:30pm 502 Wartik

Supporting Materials

Lecture Notes

Lectures will appear below as they are presented. Homeworks are specified in each handout.

  1. Lecture 1 - slides, handouts. course information, homework and project information, introduction to computing, setting up you computer, basic unix command line usage, organizing your projects, homework 1.

  2. Lecture 2 - slides, handouts, The GFF format, sequence ontologies, basic Unix commands: wc, grep, cut, sort, redirecting input and output streams, piping commands, processing a tabular file with UNIX tools, homework 2

  3. Lecture 3 - slides, handouts. programming languages, download and install an proper editor, introduction to the AWK programming language, tabular file processing, filtering by feature types, Awk onliners explained, another collections of AWK oneliners, homework 3.

  4. Lecture 4 - slides, handouts, sequencing technologies, sequence representations, the FASTA format, processing FASTA files at the command line, homework 4.

  5. Lecture 5 - slides, handouts, string matching, edit distances, regular expressions, local and global alignments, homework 5.

  6. Lecture 6 - slides, handouts, introduction to using blast, legacy blast and blast+, preparing blast databases, performing a blastn query, formatting blast output, homework 6.

  7. Lecture 7 - slides, handouts, using blast, formatting databases, using the blastdbcmd, extract sequences, batch operations, formatting blast queries, homework 7.

  8. Lecture 8 - slides, handouts, blast score and E-values, search strategies, usage examples for blastn, blastp, blastx, tblastn, and tblastx, homework 8.

  9. Lecture 9 - slides, handouts, quality encodings, phred scales, the FASTQ format, homework 9.

  10. Lecture 10 - slides, handouts, file compression, gzip, zip, bz2, file archives, tarbombs, plotting fastq qualities homework 10.

  11. Lecture 11 - slides, handouts installing tools, quality control, adapter trimming, error corrections

  12. Lecture 12 - slides, handouts paired end sequencing, quality control for paired end sequencing, the bioawk language

  13. Lecture 13 - slides, handouts paired end sequencing, read stiching, automating tasks with shell scripts

  14. Lecture 14 - slides, handouts short read alignments, bwa, bowtie and other tools.

  15. Lecture 15 - slides, handouts the sequence alignment map SAM format

  16. Lecture 16 - slides, handouts the SAM/BAM format, sorting and indexing BAM files, using the samtools program

  17. Lecture 17 - slides, handouts aligning paired end reads, comparing and evaluating aligners, simulating sequencing reads with the wgsim tool

  18. Lecture 18 - slides, handouts read duplication, visualizing alignments with IGV and IGB

  19. Lecture 19, guest lecture by Nicholas Stoler - slides, the variant call format (VCF), calling variants with samtools mpileup

  20. Lecture 20,- slides, handouts origins of genome variations, more on SNP calling, successes and failures

  21. Lecture 21,- slides, handouts interval representation, BED and GFF formats, representing data

  22. Lecture 22,- slides, handouts interval operations: complement, extension, flanking, Using the BedTools package

  23. Lecture 23,- slides, handouts interval operations: intersect, window, selecting closest features

  24. Lecture 24,- slides, handouts an introduction to genome assembly, using the velvet assembler, evaluating genome assemblies with QUAST

  25. Lecture 25,- slides, handouts, meta.tar.gz (25MB) an introduction to metagenomics, software packages mothur, QIIME and MetaSim, online tools RDP, MG-RAST

  26. Lecture 26,- slides, handouts, lec26.tar.gz (25MB) an introduction to Chip-Seq technology, peak calling concepts, preprocessing and peak calling methods (part 1)

  27. Lecture 27,- slides, handouts, Chip-Seq peak calling sofware, preprocessing and peak calling methods (part 2)

  28. Lecture 28,- slides, handouts, lec28.tar.gz basic RNA-Seq data analysis concepts, split read mapping

  29. Lecture 29, slides, handouts, lec29.tar.gz RNA-Seq (part 2)

  30. Lecture 30, slides, handouts, bioinformatics beyond the command line: using R for data analysis

  31. Final Project 30, final-project, data for final project pony.tar.gz (17Mb) BMB 597D: Final project, 50% of the final grade, due 5pm Saturday Dec 14th, 2013

Course Syllabus

Instructor: Istvan Albert

Course records: PSU ELion

Course registration: BMMB 597D - Bio Data Analysis

The purpose of this course is to introduce students to the various applications of high-throughput sequencing including: chip-Seq, RNA-Seq, SNP calling, metagenomics, de-novo assembly and others. The course material will concentrate on presenting complete data analysis scenarios for each of these domains of applications and will introduce students to a wide variety of existing tools and techniques. We expect that by the end of the course work students will:

Access to a Mac or Linux computer is necessary to perform the homework. Only Mac OSX (Tiger/Leopard) and Linux operating systems are supported.

Grading and Homework

All Penn State Policies regarding ethics and honorable behavior apply to this course.

The final grade will be an average of the grades obtained on homework and a project. Please refer to the information in the first lecture. Homework will be handed out during each lecture in the form of exercises that will need to be turned in at the beginning of each week.

We want to emphasize that the primary goal of this course work is to improve students ability to handle and interpret data sets. Therefore the evaluation process is relative to the initial aptitudes. We aim to focus on developing permanent skills and talents that are not just immediately useful but also provide the foundation for further more in depth understanding of informatics in general.

Created by Istvan Albert • Last updated on Wednesday, December 03, 2014 • Site powered by PyBlue