Home | CV | Databases | IMEG Seminars | Journals
 
MEP-online | People | Publications | SoftwareText only version



Software - Readme File

 

 
ANC-GENE:
Inference of Ancestral Gene Sequences by the Distance-Based Bayesian Method
 
(c) Copyright April, 1998 by Jianzhi Zhang and the Pennsylvania State University. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed. ANC-GENE is distributed free of charge by:
 

Jianzhi Zhang
Institute of Molecular Evolutionary Genetics
Department of Biology
322 Mueller Laboratory
The Pennsylvania State University
University Park, PA 16802 USA
 

Current Address:

Associate Professor of Ecology

and Evolutionary Biology

University of Michigan

Ann Arbor, MI
E-mail: jianzhi@umich.edu

Jianzhi Zhang Homepage

 
Suggested citation
    
Zhang J, Nei M (1997) Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods. J Mol Evol 44(Suppl 1):S139-S146

     Zhang J, Rosenberg HF, Nei M (1998) Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. USA 95:3708-3713.
 
Introduction
     ANC-GENE is designed for inference of ancestral nucleotide sequences of protein coding genes from a set of present-day sequences whose phylogenetic relationships are known. The program first infers the amino acids by the distance-based Bayesian method, and then infers the underlying nucleotide sequences by fixing the inferred amino acids. The program is written in C language and can be used on IBM PC compatible computers with the windows95 operating system.
 
Installation
     First make sure that the diskette you have received contains the following files.

anc-gene.c (source code)
anc-gene.exe (executable file)
jtt.pro (JTT substitution matrix)
poisson.pro (Poisson substitution matrix)
Rnase.seq (example data file)
manual (this file)
result (output file)

     To install ANC-GENE on your computer's hard disk drive ("C" drive given here, for example), you should create a directory where the files of this package will be present. To do this, type the following c:\md anc-gene (Enter)

     To copy the ANC-GENE files onto your hard disk drive, insert the floppy disk containing the programs into your floppy drive ("A" drive given here, for example). Then, enter the following command c:\copy a:*.* c:\anc-gene\*.* (Enter)
 
Input file
     To use the program, you need a input file containing the DNA sequences and the tree topology of these sequences (see rnase.seq for an example). This file begins with two numbers: the number of sequences and the number of nucleotides per sequence (sequence length). The second line will be the name of the first sequence, and the third line will be the first sequence, and so on. Only A, T, C, G, a, t, c, and g are allowed in the sequences. The sequences should be aligned and gaps or any other symbols be removed. The last line of the file is the tree topology of the sequences. The tree format is the same as that used in PHYLIP package (Felsenstein 1995). Note that the tree is unrooted, so trification rather than bification is required for the deepest branching node. For example, the topology of the following tree can be expressed by

(((1,3),2),6,((4,7),(5,8)))
 
11 |----------- 1
10 |-----------|
|----------| |---------------- 3
| |------------------------ 2
| |----------------------------- 6
|---------------| |---------- 4
9 | |--------|
| | 13 |
| | |------ 7
|-----|
12 | |---- 5
|------|
14 |----- 8


     Note that in the topology expression, the numbers refer to the order of the present-day sequences given in the input file. Also note that in the topology expression, there are only numbers and ", " without any space.

     The tree of the ribonuclease sequences in the example data file is 
(((((1,2),3),4),5),((((6,7),8),9),10),11)
 

16|------------1 human-ECP
15|-----|
|---| |------------2 chimp-ECP
14| |
|---| |------------------3 gorilla-ECP
13| |
|--| |----------------------4 orangutan-ECP
| |
| |--------------------------5 macaque-ECP
|
|
| 20|------------6 human-EDN
|---|12 19|-----|
| | |---| |------------7 chimp-EDN
| | 18| |
| | |---| |------------------8 gorilla-EDN
| | | |
| |--| |----------------------9 orangutan-EDN
| 17|
| |--------------------------10 macaque-EDN
|
|---------------------------------11 tamarin-EDN

     The ancestral nodes are denoted by numbers N+1, N+2, ..., 2N-2, where N is the number of present-day sequences. It is not difficult to figure out which node is which by reading the output file.

 
Computation
     To infer the ancestral sequences from the data file, type c:\anc-gene\anc-gene filename
For example, to try the ribonuclease data, type c:\anc-gene\ance-gene rnase.seq

     You will be asked to input one of the three search modes. Mode 1 is suggested when the number of sequences is <=8. Mode 2 is suggested when the number of sequences is >=9 and <=16. Mode 3 is suggested when the number of sequences is >=17. When mode 3 is chosen, pathway reconstructions will not be presented. When the number of pathways are very large (e.g., 10 million), modes 1 and 2 do not work, but mode 3 always works.
The Poisson-f and JTT-f models of amino acid substitution (see Zhang and Nei 1997) can be chosen for the inference of amino acids, and the Jukes-Cantor model is used for inference of ancestral nucleotides given the amino acids. In the estimation of ancestral amino acids, branch lengths are estimated from protein distances (gamma distance with alpha=2.4). In the estimation of ancestral nucleotides, branch lengths are estimated from JC corrected synonymous distances.
 
Output file
     The output of the ancestor.exe is given in the file named "result". The ancestral sequences are presented in three different formats.
(1) Site by site and pathway by pathway. From this format, one can see the probability of a pathway (amino acids at all ancestral nodes) at a given site.
(2) Site by site and node by node. From this format, one can see the most likely amino acid and its probability for a given node at a given site.
(3) The entire sequence for each node. The average probability of the entire sequence is also given.
 
Limitations of the program
     The program is designed for inferring ancestral amino acid sequences. A program specifically for inferring ancestral protein coding nucleotide sequences is under development.
 
References
     Zhang J, Nei M (1997) Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods. J. Mol. Evol. 44(Suppl. 1):S139-S146

     Zhang J, Rosenberg HF, Nei M (1998) Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. USA 95:3708-3713.

 Home | CV | Databases | IMEG Seminars | Journals
 
MEP-online | People | Publications | SoftwareText only version


| Department of Biology  |  Eberly College of Science |
 
| Institute of Molecular Evolutionary Genetics | Penn State |
2002 The Pennsylvania State University
This page was last updated 6/10/09 by M. Ricardo.