Practical data analysis for life scientists
BMMB 597D - Bio Data Analysis (2 cr.)
Schedule #231958
Tuesday/Thursday 2:30-3:20 in 316 Wagner Bldg
Limit of 25 students.
Office hours: MW 1-2:30pm 502 Wartik
Lectures will appear below as they are presented. Homeworks are specified in each handout.
Lecture 1 - slides, handouts. course information, homework and project information, introduction to computing, setting up you computer, basic unix command line usage, organizing your projects, homework 1.
Lecture 2 - slides, handouts,
The GFF format,
sequence ontologies, basic Unix commands: wc
, grep
, cut
, sort
, redirecting
input and output streams, piping commands, processing a tabular file with UNIX tools, homework 2
Lecture 3 - slides, handouts. programming languages, download and install an proper editor, introduction to the AWK programming language, tabular file processing, filtering by feature types, Awk onliners explained, another collections of AWK oneliners, homework 3.
Lecture 4 - slides, handouts, sequencing technologies, sequence representations, the FASTA format, processing FASTA files at the command line, homework 4.
Lecture 5 - slides, handouts, string matching, edit distances, regular expressions, local and global alignments, homework 5.
Lecture 6 - slides, handouts, introduction to using blast, legacy blast and blast+, preparing blast databases, performing a blastn query, formatting blast output, homework 6.
Lecture 7 - slides, handouts, using blast, formatting databases, using the blastdbcmd, extract sequences, batch operations, formatting blast queries, homework 7.
Lecture 8 - slides, handouts,
blast score and E-values, search strategies,
usage examples for blastn
, blastp
, blastx
, tblastn
, and tblastx
,
homework 8.
Applied Bioinformatics
at Penn StateLecture 9 - slides, handouts, quality encodings, phred scales, the FASTQ format, homework 9.
Lecture 10 - slides, handouts, file compression, gzip, zip, bz2, file archives, tarbombs, plotting fastq qualities homework 10.
Lecture 11 - slides, handouts installing tools, quality control, adapter trimming, error corrections
Lecture 12 - slides, handouts paired end sequencing, quality control for paired end sequencing, the bioawk language
Lecture 13 - slides, handouts paired end sequencing, read stiching, automating tasks with shell scripts
Lecture 14 - slides, handouts short read alignments, bwa, bowtie and other tools.
Lecture 15 - slides, handouts the sequence alignment map SAM format
Lecture 16 - slides, handouts
the SAM/BAM format, sorting and indexing BAM files, using the samtools
program
Lecture 17 - slides, handouts
aligning paired end reads, comparing and evaluating aligners, simulating sequencing reads with the wgsim
tool
Lecture 18 - slides, handouts read duplication, visualizing alignments with IGV and IGB
Lecture 19, guest lecture by Nicholas Stoler - slides,
the variant call format (VCF), calling variants with samtools mpileup
Lecture 20,- slides, handouts origins of genome variations, more on SNP calling, successes and failures
Lecture 21,- slides, handouts interval representation, BED and GFF formats, representing data
Lecture 22,- slides, handouts interval operations: complement, extension, flanking, Using the BedTools package
Lecture 23,- slides, handouts interval operations: intersect, window, selecting closest features
Lecture 24,- slides, handouts an introduction to genome assembly, using the velvet assembler, evaluating genome assemblies with QUAST
Lecture 25,- slides, handouts, meta.tar.gz (25MB) an introduction to metagenomics, software packages mothur, QIIME and MetaSim, online tools RDP, MG-RAST
Lecture 26,- slides, handouts, lec26.tar.gz (25MB) an introduction to Chip-Seq technology, peak calling concepts, preprocessing and peak calling methods (part 1)
Recommended reading Applications of next-generation sequencing (Nature, resources)
the bioawk-tools utilities
Lecture 27,- slides, handouts, Chip-Seq peak calling sofware, preprocessing and peak calling methods (part 2)
Lecture 28,- slides, handouts, lec28.tar.gz basic RNA-Seq data analysis concepts, split read mapping
Lecture 29, slides, handouts, lec29.tar.gz RNA-Seq (part 2)
Lecture 30, slides, handouts, bioinformatics beyond the command line: using R for data analysis
Final Project 30, final-project, data for final project pony.tar.gz (17Mb) BMB 597D: Final project, 50% of the final grade, due 5pm Saturday Dec 14th, 2013
Instructor: Istvan Albert
Course records: PSU ELion
Course registration: BMMB 597D - Bio Data Analysis
The purpose of this course is to introduce students to the various applications of high-throughput sequencing including: chip-Seq, RNA-Seq, SNP calling, metagenomics, de-novo assembly and others. The course material will concentrate on presenting complete data analysis scenarios for each of these domains of applications and will introduce students to a wide variety of existing tools and techniques. We expect that by the end of the course work students will:
Access to a Mac or Linux computer is necessary to perform the homework. Only Mac OSX (Tiger/Leopard) and Linux operating systems are supported.
All Penn State Policies regarding ethics and honorable behavior apply to this course.
The final grade will be an average of the grades obtained on homework and a project. Please refer to the information in the first lecture. Homework will be handed out during each lecture in the form of exercises that will need to be turned in at the beginning of each week.
We want to emphasize that the primary goal of this course work is to improve students ability to handle and interpret data sets. Therefore the evaluation process is relative to the initial aptitudes. We aim to focus on developing permanent skills and talents that are not just immediately useful but also provide the foundation for further more in depth understanding of informatics in general.
Created by Istvan Albert • Last updated on Wednesday, December 03, 2014 • Site powered by PyBlue