Home | CV | Databases | IMEG Seminars | Journals
MEP-online | People | Publications | SoftwareText only version

Software - Readme File


Program package for inferring and testing minimum evolution trees     Version 1.2
(c) Copyright 1993 by Andrey Rzhetsky, Masatoshi Nei, and the Pennsylvania State University.  Permission is granted to copy this document provided that no fee is charged for it and that this statement is not removed. 

Andrey Rzhetsky
Institute of Molecular Evolutionary Genetics
The Pennsylvania State University
328 Mueller Laboratory
University Park, PA 16802

Current Address:

Professor of Medicine and Human Genetics
Computation Institute and
Institute for Genomics and Systems Biology
University of Chicago
920 East 58th Street, CLSC 400A
Chicago, IL 60637

Email: arzhetsk@medicine.bsd.uchicago.edu

Andrey Rzhetsky Homepage

     This program package is intended to find the minimum evolution (ME) tree that has the smallest value of the sum of branch lengths (S) for a set of  sequences, identify a set of trees whose S are not significantly different from that of the ME tree, and print out the resulting trees in a publishable form. The branch lengths for each topology are estimated by the ordinary least-squares method from a matrix of evolutionary distances.  The algorithms for finding and testing the minimum evolution tree have been described by Rzhetsky and Nei (1992; 1993).
     Neighbor-joining (NJ) tree.  Phylogenetic tree obtained by Saitou and Nei's (1987) method. 
Temporary minimum evolution tree.  In the search for the ME tree the program may identify a tree with the minimum S value for the subset of topologies.  Until a new tree with a smaller S value is found, this tree is called "a temporary ME tree". 
S  or S(T).  The sum of branch lengths for the temporary ME tree computed with the ordinary least squares method. 
D, sd(D).  D is the difference S - S , where S is the sum of branch lengths computed for a topology under consideration, and sd(D) is the standard error of D. 
Topological distance.  Topological distance between two trees as measured by the "partition metric" (see Penny and Hendy 1985;  Rzhetsky and Nei 1992).
Files included

TETRAPOD.SET - an example of input file;
ME_TREE.EXE, LS_2.EXE, INVERT.EXE - a set of tree-making programs; TREEVIEW.EXE, TREESHOW.EXE, TREEVIEW.DOC - programs for visualizing trees and a manual for the TREEVIEW program.  These two programs were written and  kindly provided by Koichiro Tamura. 
Input file
     If your input data are in a form of distance matrix
The input file with the distance matrix should have name extension "DIS" (as "CRAB.DIS" or "TETRAPOD.DIS") and contain the following information. 

LINE #1  Title 
LINE #2  The number of sequences, blank symbol, sequence length (number of nucleotides or amino acids used per sequence) 
LINE #3  First species name 
LINE #4  Second species name 
LINE #(N+2)  Last species name
LINE #(N+3)  Distance between species 2 and 1, comma
LINE #(N+4)  Distances between species 3 and 1 (followed by comma), and species 3 and 2 (followed by comma), respectively.
LINE #(2N+2) All distances between species N and all other species (each distance followed by comma). 
See example file "TETRAPOD.DIS".

If your input data are sequences, the input file should have name extension 
"SET" (as "TETRAPOD.SET") and contain the following information. 

LINE #1  Title
LINE #2  The number of sequences, blank symbol, sequence length
LINE #3  First sequence name
LINE #4  Second sequence name
LINE #(N+2)  Last sequence name
LINE #(N+3)  Sequences start from this line in an arbitrary format (end-of-line and blank symbols are ignored).  Sequences could be either in upper case or lower case letters.  For nucleotide sequences, "-" stands for gap, "N", "*", or "?" stand for ambiguous or missing data.  For amino acid sequences the use of "N" for ambiguous data is not allowed. 
See example file "TETRAPOD.SET". 

Starting the program
     Make directory "ME" (or with any other name) on your hard drive.  Copy all files from the diskette into this directory.  Arrange your sequence data into "SET"-file as described above.  Then, type ME_TREE, strike "Enter" and respond to queries. 
     If you select the first line of the menu, program LS_2.EXE will be started.  Selection of the second and third lines invokes TREEVIEW.EXE and TREESHOW.EXE programs, respectively (see the enclosed manual on TREEVIEW.EXE written by Koichiro Tamura). 
     If you choose the first item of the first menu, a prompt appears that allows you to choose either distance matrix data or sequence data input file type (see "Input format" section above).  If you make this decision, another menu appears. 
     Computing NJ-tree with standard errors of branch lengths.  This procedure allows you to compute a NJ-tree and then estimate branch lengths with the ordinary least squares method.  You can repeat this procedure several times for different distance measures.  The trees produced will be saved in different files with name extension "TRE".  Note that the topology of the NJ tree is often identical with that of the ME tree unless the number of sequences is very large (Rzhetsky and Nei 1992, 1993). 
     Quick search for ME tree.  To speed up the search, no covariance matrix and no standard errors of branch lengths are computed in this case.  The program simply examines different tree topologies starting from the NJ-tree with the aim of finding the ME tree.  Only temporary and the final ME trees are saved. 
     Search for ME tree (computing s.e.'s of branch lengths).  In this case the standard errors of branch lengths are computed for each temporary ME tree.  This may slow down the search considerably if the number of temporary ME trees is large.  With respect to other aspects, this option is identical with the previous one. 
     Search for ME tree AND alternative trees.  In this case sd(D) is computed for each alternative topology tested.  If the value of D is not significantly different from 0 (with default significance level 95%), the alternative tree is saved into "TRE"-file. 
     Calculate standard errors of branch lengths for the final ME tree.  If you find the ME tree with the Quick search procedure, you may use this option to compute the standard errors of branch lengths.  This option also can be used to recompute branch lengths for the final ME tree using different distance measures. 
Generating alternative trees
      There are two ways of finding alternative trees in this package.  One is to use topological distance as described by Rzhetsky and Nei (1992; 1993) and the other is to use a bootstrap method. 
     In the case of the "topological distance" method all trees with d  = 2 and d = 4 in the neighborhood of the temporary ME tree are examined.  If one of these trees has an S value smaller than those for the temporary ME tree, this tree becomes the next temporary ME tree.  The process is repeated until no trees with a smaller S values are found for the trees with d  = 2 and 4 in the neighborhood of the temporary ME tree. 
      In the case of the bootstrap method, it is used as a tool to generate tree topologies around the temporary minimum evolution (ME) tree rather than to study the reliability of a particular topology.  The S value for each bootstrap-generated topology is computed by using the entire original data set, and this S is compared with that of the temporary ME tree. 
     Our bootstrap method is different from the standard one and uses a scaling factor, f, to increase or decrease the expected topological distance between the temporary ME tree and a bootstrap-generated topology.  This f may be either < 1 or > 1.  Suppose that we need to analyze a set of sequences where M homologous sites are available for comparison.  In our bootstrap method one should just pick up MWf sites at random (where some sites could be sampled more than once) and compute the distance matrix between the sequences using only these MWf sites.  Then the resulting distance matrix is used to construct NJ tree (see Saitou and Nei 1987).  Obviously, when f = 1 we have the standard bootstrap method.  When f < 1, the distance matrix obtained tends to be more erratic than in the case of the standard bootstrap.  Thus, bootstrap-generated trees tend to have a larger average topological distance from the ME tree as f becomes smaller.  In contrast, when f > 1, the distances between the sequences tend to be more stable than in the usual bootstrapping.  Therefore, a set of topologies closer to the temporary ME tree are examined.
     Depending on the data set analyzed, it might be reasonable to choose either f > 1 or f < 1.  For a relatively small number of sequences, the use of bootstrap with a small f may generate virtually all plausible topologies for a given data set and find all topologies that are "good candidates" for being the true tree.  For a large number of short sequences (that usually provide very unstable distance matrix) the bootstrap technique with f > 1 may facilitate the search for the ME tree (which is typically located in a close topological neighborhood of the NJ tree or coincides with the NJ tree).
Output file
     After using the programs you may find a number of new files in your directory.  The following is a brief description of these files. 

     "TRE"-files (For example, "TETR0000.TRE" or "TETR0001.TRE".)  A file with this name extension contains detailed information about one of the trees.  "TRE"-file has the following format. 

1)  The number of sequences.
2)  Sequence names.
3)  Tree topology and branch length estimates. Normally, the standard errors and significance levels of branch lengths are also included.  In the case of  "Quick search for the ME tree", however, these quantities are not computed.  In this case zero standard errors are written in the file. 
4)  Distance measure used for computing evolutionary distances. 
5)  Sum of branch lengths, S, (for temporary ME tree) or D, sd(D) and the significance level according to the Z-test. 
6)  Tree topology defined by partition codes.  "TRE"-files are used as input files for programs TREEVIEW and TREESHOW and are not intended for manual editing.  You may need to look through these files only to find the S value for a particular tree or the distance measure used. 

     Files with name extensions "000", "002", "004", "006", etc.  These files contain a list of topologies already examined.  For example, file "TETRAPOD.000" contains the topology of the temporary ME tree for the "TETRAPOD.SET" data file, "TETRAPOD.006" contains the list of all topologies that have topological distance 6 from the temporary ME tree.  The topologies of the trees are stored in "partition codes" and are not intended for manual editing. 
     File "LS.TMP" contains the variance-covariance matrix in binary format. 
In files with name extension "ME" (for example, "TETRAPOD.ME") the topology of the temporary ME tree is stored.
     Jukes, T. H. and C.R. Cantor (1969) Evolution of protein molecules. Pp. 21-132  in H. M. Munro, ed. Mammalian protein metabolism.  Academic Press, New York.  

     Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.  

     Penny, D., and M. D. Hendy (1985) The use of tree comparison metrics. Syst. Zool. 34:75-82.  

     Rzhetsky, A. and M. Nei (1992) A simple method for estimating and testing minimum-evolution trees.  Mol. Biol. Evol. 9:945-967.  

     Rzhetsky, A. and M. Nei (1993) Theoretical foundation of the minimum evolution method of phylogenetic inference. Mol. Biol. Evol.  

     Saitou, N. and M. Nei (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425.
System requirements and "license agreement"
     Programs ME_TREE, LS_2, and INVERT were written for IBM-compatible computers with Intel 80286 processor and 80287 co-processor (or more advanced).  The programming language used is Borland Turbo C (Version 2.0).  The minimum system requirements for the package are dictated by size of the data set that one intends to study.  For a small number of sequences (up to 15) a relatively slow AT-type computer can be used.  However, our programs are able to handle large data sets (more than 100 sequences).  In the latter case a quicker computer (with CPU 386/387 or 486) with a sufficiently large hard drive (more than 40 Mb) is recommended.  Generally, we don't advise to start these programs from a floppy disk, even for a small data set as computation becomes extremely slow and there is always some danger of running out of disk space.  The minimum size of RAM required is 640K. 

     You are free to distribute this program package as long as no fee is charged. 
     We thank Koichiro Tamura for providing programs TREEVIEW and TREESHOW. We would appreciate any comments on the performance and possible improvements of the programs from the users.  

 Home | CV | Databases | IMEG Seminars | Journals
MEP-online | People | Publications | SoftwareText only version

| Department of Biology  |  Eberly College of Science |
| Institute of Molecular Evolutionary Genetics | Penn State |
2002 The Pennsylvania State University
This page was last updated 6/10/09 by M. Ricardo.