Home | CV | Databases | IMEG Seminars | Journals
 
MEP-online | People | Publications | SoftwareText only version



Software - Read me File

 

 
GZ-GAMMA: 
Estimation of the Expected Number of Substitutions at each Amino Acid (Nucleotide) Site and the Parameter for Rate Variation among Sites.
 
(c) Copyright December, 1997 by Jianzhi Zhang and the Pennsylvania State University. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed. It is distributed free of charge by:
 

Jianzhi Zhang
Institute of Molecular Evolutionary Genetics
Department of Biology
322 Mueller Laboratory
The Pennsylvania State University
University Park, PA 16802 USA
 

Current Address:

Associate Professor of Ecology

and Evolutionary Biology

University of Michigan

Ann Arbor, MI
E-mail: jianzhi@umich.edu

Jianzhi Zhang Homepage

Xun Gu
Institute of Molecular Evolutionary Genetics
Department of Biology
322 Mueller Laboratory
The Pennsylvania State University
University Park, PA 16802, USA
 

Current Address:

Department of Zoology/Genetics

332 Science II Hall

Iowa State University

Ames, IA 5001

E-mail: xgu@iastate.edu

Xun Gu Homepage

 
Suggested citation:
     Gu, X. and J. Zhang J (1997) A simple method for estimating the parameter of substitution rate variation among sites. Mol. Biol. Evol. 14:1106-1113
 
Introduction
     GZ-gamma is designed to estimate the expected number of substitutions of each amino acid (nucleotide) site, and the gamma shape parameter for the rate variation among sites, using a combination of ancestral sequence inference and maximum likelihood estimation when the phylogenetic relationships of these homologous sequences are known. This package contains two programs: gz-aa.exe for amino acid sequences, and gz-DNA.exe for DNA sequences, which are encoded in C language.  The program can be used on IBM PC compatible computers with Window 95 and Window NT operating systems.
 
 Installation
     First make sure that the diskette you have received contains the following files.

    gz-aa.c             (source code)
    gz-DNA.c        (source code)
    gz-aa.exe          (executable file)
    gz-DNA.exe     (executable file)
    jtt.pro                (JTT substitution matrix, for amino acid sequences)
    atp6.aa             (an example data file for amino acid sequences)
    cox1.dna           (an example data file for DNA sequences)
    manual              (this file)
    alpha                 (output file from running gz-aa.exe)

     To install GZ-gamma on your computer's hard disk drive ("C" drive given here, for example), you should create a directory where the files of this package will be present.  To do this, type the following c:\md GZ-gamma  (Enter)
     To copy the GZ-gamma files onto your hard disk drive, insert the floppy disk containing the programs into your floppy drive ("A" drive given here, for example).  Then, enter the following command c:\copy a:*.* c:\GZ-gamma\*.*  (Enter)
 
Input file
    
To use the program, you need one input file containing the amino acid (or nucleotide) sequences and the tree topology of these sequences (see atp6.aa for an example).  This file begins with two numbers: the number of sequences and the number of amino acid or nucleotide sites (sequence length).  The second line will be the name of the first sequence, and the third line will be the first sequence, and so on. Each sequence should occupy a line without any interruption. Only the letters (capitalized) for the 20 amino acids (or 4 nucleotides) are allowed in the sequences.  The gaps or any other symbols should have been already removed. The last line of the file is the tree topology of the sequences.  The tree format is the same as that used in PHYLIP package (Felsenstein 1995). Note that the tree is unrooted, so trification rather than bification is required for the deepest branching node.  For example, the topology of the following tree can be expressed by

(((1,3),2),6,((4,7),(5,8)))
 

                             11 |----------- 1
                 10 |-----------|
         |----------|           |---------------- 3
         |          |------------------------ 2
         |               |----------------------------- 6
         |---------------|              |---------- 4
                       9 |     |--------|
                         |     |     13 |
                         |     |        |------ 7
                         |-----|     
                            12 |      |---- 5
                               |------|
                                   14 |----- 8


     Note that in the topology expression, the numbers refer to the order of the sequences given in the input file.
The tree of the atp6 and cox1 sequences in the example data files atp6.aa  and cox1.dna is 

(((1,2),((3,4),(5,6))),(7,8),9)
 

                |------------------1 mouse
          |-----|
      |---|     |------------------2 rat
      |   |
  |---|   |      |----------------3 human
  |   |   |   |--|
  |   |   |---|  |-------------4 gibbon
  |   |       |       |------------5 whale
  |   |       |-------|
  |   |               |------------6 cow
  |   |           |---------------- 7 opossum 
  |   |-----------|
  |               |---------------- 8 wallaroo
  |
  |---------------------------------- 9 platypus
 
Computation
     Click the MS-DOS prompt in the window (Window 95 or Window NT), then for amino acid sequences, type  c:\GZ-gamma\gz-aa filename or for DNA sequences, type  c:\GZ-gamma\gz-DNA filename,  where filename is the name of the data file. In the case of atp6.aa data, for example, type  c:\GZ-gamma\gz-aa atp6.aa
The detailed procedure for the computation has been described in Gu and Zhang (1997). First, the ancestral sequence for each node is inferred by a fast Bayesian approach developed by Zhang and Nei (1997); the JTT-f model of amino acid substitutions is used for amino acid sequences, and Kimura two-parameter model is used for DNA sequences. Second, the expected number of substitutions for each site is estimated by the maximum likelihood approach under the Poisson model for amino acids and Jukes-Cantor model for nucleotides. Third, the ML estimate of the gamma shape parameter (alpha) is obtained from the distribution of expected number of substitutions. Note that the parsimony estimate of the gamma shape parameter (alpha) is obtained from the distribution of minimum-required number of substitutions.
 
Output file
     The output of the gz-aa.exe or gz-DNA.exe is given in the file named "alpha". The estimate for the gamma shape parameter (alpha) is presented in the first line. Since then, the first column (#) indicates the position numbers of amino acid (nucleotide) sites, the second column (m') presents the minimum-required substitutions inferred by the conventional parsimony method (Fitch 1971); the third column (m) presents the minimum-required substitutions inferred by Zhang-Nei (1997)'s method, and the forth column (k) presents the  expected numbers of substitutions estimated by Gu and Zhang (1997) which are used for estimating alpha.
 
Usefulness
     From the current program, we can obtain two results, the estimate of gamma shape parameter (alpha) for the rate variation among sites, and the expected number of substitutions of each amino acid (or nucleotide) site. These results are useful in molecular evolutionary analysis.

   (1) Distance estimation
   (2) Divergence time dating between genes and species
   (3) Phylogenetic reconstruction
       The estimate of alpha is useful to rule out the possibility that the phylogenetic tree inferred is not misleading by the negligence of rate variation among sites. An iteration is suggested as follows: first, estimate the alpha by the current program according to the tree reconstructed under the assumption of a uniform rate among sites. Second, re-compute the distance-matrix, considering the gamma distribution for the rate variation among sites, and infer the phylogenetic tree.
   (4) Profile of rate variability with sites
       The output file (alpha) can be used as the input for most commercially available software (e.g., EXCEL) so that the profile of rate variability with sites can be easily presented graphically by plotting k against the position of site. 
   (5) Comparison of evolutionary rates between different regions (domains)
 
References
     Gu, X. and J. Zhang (1997) A simple method for estimating the parameter of substitution rate variation among sites. Mol. Biol. Evol. 14:1106-1113

 Home | CV | Databases | IMEG Seminars | Journals
 
MEP-online | People | Publications | SoftwareText only version


| Department of Biology  |  Eberly College of Science |
 
| Institute of Molecular Evolutionary Genetics | Penn State |
2002 The Pennsylvania State University
This page was last updated 6/10/09 by M. Ricardo.