HomeHuck Institute of the Life SciencesPenn State

Table Of Contents

Previous topic

Site contents

Next topic

Getting Started

This Page

Course: Practical Data Analysis for Life Scientists

The purpose of the course is to introduce life scientists to level appropriate data analysis techniques via computers. It will cover simple informatics training as well as bioinformatics tools and software use.

Schedule

Practical data analysis for life scientists
BMMB 597D - Bio Data Analysis (2 cr.)
Schedule #398704
Tuesday/Thursday 2:30-3:20 in 012 Life Sciences Building
Limit of 20 students.

Office hours: MW 2-3pm 504 Wartik

Lecture Notes

Lectures will appear below as they are presented. Each week we will cover certain topic over two lectures. Homeworks are included in the handouts.

Note

Read the Getting Started page before the first lecture.

  • Week 1 - slides, handout keywords: course information, homework and project information, introduction to computing, names and types, integer and floating point, representation errors, type casting, homework 1
  • Week 2 - slides, handout, dataset keywords: list containers, indexing and slicing, sorting, mapping, reading data from files, function definitions, filtering, map and filter synergy, exercises, homework 2
  • Week 3 - slides, handout, knowledge-master keywords: organizing code, program modularity, functions, modules and packages, creating documentation, import paths, naming conventions, knowledge master slides, homework 3
  • Week 4 - slides, handout, dataset keywords: timing processes, nested lists, small and large datasets, column based files, reading comma separated files, cyclomatic complexity, nesting codeblocks, mapping to a list of lists, processing a realistic microarray dataset, homework 4
  • Week 5 - slides, handout keywords: knowledge master 2, recapitulate map and filter behaviour, namespaces, nested scopes, closures, csv file reader by column name, homework 5
  • Week 6 - slides, handout keywords: string formatting, positional and keyword parameters, creating and appending to files, tuple types, defensive programming, DRY principle, module level scope, knowledge master 3
  • Week 7 - slides, handout keywords: random numbers, shuffling, pigeonhole principle, plotting and charting basics, generating histograms, parallel iteration, flowcharts, zip and map, adding legends to plots, multiplots, homework
  • Week 8 - slides, handout, GSE18455.zip keywords: dataset from Gene Expression Omnibus, absolute and relative paths, catching exceptions, better histograms, explicit loops, functional vs prodcedural approaches, automating tasks, homework
  • Week 9, slides, handout, GPL9270.zip keywords: new dataset from Gene Expression Omnibus, usability improvements, container objects, sets, set operations, dictionaries, building and iterating over dictionaries, fill-in-the blank, homework
  • Week 10, slides, handout, keywords: sorting data, decorate-sort-undecorate pattern, selecting common highly expressed genes from multiple files, collecting replicate data
  • Week 11, slides, handout, keywords: descriptive statistics, normal functions, central limit theorem, t-tests, independent and related samples, chisquare tests, kolmogorov smirnoff tests for identical distributions, project summaries
  • Week 12, slides, handout, keywords: biopython, searching Entrez, fasta and genbank formats, sequence records, sequence objects,
  • Week 13, slides, handout, images, keywords: image formats, color modes, image processing, crop, resize, merge images, channel operations, image filters, blending and subtracting, color histograms
  • Week 14, ...
  • Week 15, ...

Note

A list of recommended resources.

Syllabus

The purpose of this class is to introduce life science students to programming concepts that will allow them process, analyze, visualize and interpret the information encoded in the large datasets that modern life science facilities produce.

We expect that by the end of the course work all students will be able to:

  • read data from arbitrarily large datasets
  • filter rows or columns by certain conditions
  • extract the information of interest
  • combine disparate data contained over multiple files
  • automate the tasks above to operate on hundreds of files
  • plot and visualize results in an meaningful way

We will also explore more advanced topics but we will not require everyone to demonstrate full competency in subjects matters such as:

  • descriptive statistics concepts
  • numerical methods
  • algorithm development
  • database queries

The purpose of these latter lectures is to expose the audience to the next level of complexity, and help guide those who wish to advance their expertise.

Finally there will be presentations on the analysis methods related to the data formats produced by the Penn State life science facilities with special focus on microarray and sequencing technologies.

Grading and Homework

The final grade will be a combination of the grades obtained on homework (60%) and term project (40%).

Homework will be handed out on most lectures in the form of exercises that will need to be turned in at the beginning of each week. Note that many of these may be solved in class during the exercise session (see below).

A term project is required, preferably one that uses data from a project that the student is actively pursuing. We recommend the involvement of the student’s advisor in picking the project and therefore data that is processed.

There is no final exam, instead, during the final week students are expected to make a short (approximately 10 minute) presentation that details some of the characteristics of the data produced by the project as well as the strategies and methodologies that they were able to employ while processing it.

We want to emphasize that the primary goal of this course work is to improve students ability to handle and interpret datasets. Therefore the evaluation process is relative to the initial aptitudes. We aim to focus on developing permanent skills and talents that are not just immediately useful but also provide the foundation for further more in depth understanding of informatics in general.

Requirements

The usual lecture format consists of a 30 minute presentation followed by approximately 20 minute in class experimentation with the programming concepts that have been presented. In class exercise sheets will be provided.

A laptop that has sufficient amount of battery power for a 20 minute work will be required during each lecture. We will be able to provide support for Mac OSX (Tiger/Leopard), Windows (XP/Vista) and Linux operating systems.

Prior to coming to the first lecture students will need to install the software packages listed on the Getting Started page.