NAME

dupGene.pl - A program to infer gene duplications and losses based solely on the number of paralogs in extant species. The program is designed for running on a large-scale analysis.


DESCRIPTION

This program is to infer the number of paralogs in all internal nodes and gene duplication/loss events along each branch in the species tree using Maximum Parsimony method.

Briefly, given a species tree and the number of paralogs in extant species (leaf nodes in a species tree) in a gene family, one number is assigned to each internal node to represent the inferred number of paralogs at that node. After that, the number of duplications or losses on each branch can be calculated by subtracting the number of paralogs in the child node from the number in the parent node. Summing up all the duplications and losses on all branches gives the total number of duplications/losses (dubbed cost) for this set of assigned number of paralogs for internal nodes. Since there are more than one possible values as candidates for the number of paralogs in each internal node and more combinations among different internal nodes, the Maximum Parsimony method is trying to find the sets of inferred number of paralogs in internal nodes which give the minimal cost among all possible sets of values.

To efficiently find the set of values giving the minimal cost, we implemented the dynamic progaming algorithm, which is introduced in the paper Durand et al, 2005.


COPYRIGHT

© 2012 by Zhenguo Zhang http://www.personal.psu.edu/zuz17/ and the Pennsylvania State University http://www.psu.edu/.


LICENSE

GNU General Public License version 3 or later


AUTHOR

 Zhenguo Zhang
 Institute of Molecular Evolutionary Genetics
 Department of Biology
 311 Mueller Laboratory
 The Pennsylvania State University
 University Park, PA 16802 USA
 Email: zuz17@psu.edu, fortunezzg@gmail.com


INSTALLATION

The program is written in Perl, so it can execute in any platform (Linux/Unix, MacOS and Windows) where the Perl program is installed. If you have not installed perl, download and install it from http://www.perl.org/get.html

After the Perl is installed, do the following to install dupGene

1. Download the latest version from Dr. Nei's website https://homes.bio.psu.edu/people/faculty/nei/software.htm

2. Uncompress the files into a local directory, say ./mybin e.g., tar -xzf dupGene.tar.gz Now you should see dupGene.pl in ./mybin/

3. Run the program perl ./mybin/dupGene.pl to see more options.

For more information, please see accompanying documents.


SYNOPSIS

perl dupGene.pl -s example.nwk -p example.in >example.out

run program without any arguments to see the usage information


INPUT

The input needs two files, one is the species tree in newick format and the other is the number of paralogs for extant species (leaf nodes in the species tree) for each gene family (or homologous group).

In the newick tree file, each node, including internal node, should have a unique id specified, because these ids are print out in output to represent nodes where duplicaions/losses happened. Example:


 ((((hum, mou)Tn3, cow)Tn5,chi)Tn7,(zeb,(tet,med)Fn3)Fn5)An;

Ids are case insensitive.

The paralog-number file should be tab-delimited and provide the species names in the first line followed by the paralogous copies from the second line for each gene family. An example input file as follows containing three gene families:


 hum     mou     cow     chi     zeb     tet     med
 4       4       4       4       4       4       4
 2       2       1       1       3       2       2
 1       1       1       1       1       1       1

See example.nwk and example.in for data format.


OUTPUT

The output is print to the standard output (ussually screen) but you can redirect it into a file by using '> myoutput'.

The output (tab-delimited) contains the inferred number of paralogs in internal nodes and the duplications and losses along branches. Since there may be more than one best equivalently parsimonious sets of values for internal nodes in a gene family, we output each of the inference with five lines , e.g., the following lines are shown the result for the 6th gene family in the example.in

 # 6.1
 type    an      tn7     tn5     tn3     hum     mou     cow     chi   fn5     zeb     fn3     tet     med
 copy    1       2       2       2       2       2       2       3      1       3       1       1       0 
 dups    0       1       0       0       0       0       0       1      0       2       0       0       0
 loss    0       0       0       0       0       0       0       0      0       0       0       0       1

The 1st line starts with a '#' followed by a num1.num2 string. num1 represents the order of the gene families in input file. num2 gives the order of parsimonious results for this given family. When there are more than one set of inferred number of paralogs in internal nodes, the num2 increases, such as 6.1, 6.2 in example.out, both of which give the minimal total number of duplications and losses.

The 2nd line specify the data type in the following lines and species ids.

The 3rd to 5th lines give the number of paralogous copies (copy line), of duplications (dups line) and of losses (loss line) occurred in each node. For duplications/losses, the numbers at a given node represent the number of events taking place on the branch from its parent node to itself.


Suggested Citation

Sayaka Miura*, Masafumi Nozawa*, Zhenguo Zhang, and Masatoshi Nei Patterns of Duplication of MicroRNA Genes Support the Hypothesis of Genome Duplication in the Teleost Fish Lineage, (submitted)


Reference

A Hybrid Micro-Macroevolutionary Approach to Gene Tree Reconstruction. D. Durand, B. V. Halldorsson, B. Vernot, 2005. Journal of Computational Biology, 13 (2): 320-335