fdnadist 
Please help by correcting and extending the Wiki pages.
The program reads in nucleotide sequences and writes an output file containing the distance matrix, or else a table of similarity between sequences. The four models of nucleotide substitution are those of Jukes and Cantor (1969), Kimura (1980), the F84 model (Kishino and Hasegawa, 1989; Felsenstein and Churchill, 1996), and the model underlying the LogDet distance (Barry and Hartigan, 1987; Lake, 1994; Steel, 1994; Lockhart et. al., 1994). All except the LogDet distance can be made to allow for for unequal rates of substitution at different sites, as Jin and Nei (1990) did for the JukesCantor model. The program correctly takes into account a variety of sequence ambiguities, although in cases where they exist it can be slow.
Jukes and Cantor's (1969) model assumes that there is independent change at all sites, with equal probability. Whether a base changes is independent of its identity, and when it changes there is an equal probability of ending up with each of the other three bases. Thus the transition probability matrix (this is a technical term from probability theory and has nothing to do with transitions as opposed to transversions) for a short period of time dt is:
To: A G C T  A  13a a a a From: G  a 13a a a C  a a 13a a T  a a a 13a
where a is u dt, the product of the rate of substitution per unit time (u) and the length dt of the time interval. For longer periods of time this implies that the probability that two sequences will differ at a given site is:
p = 3/4 ( 1  e 4/3 u t)
and hence that if we observe p, we can compute an estimate of the branch length ut by inverting this to get
ut =  3/4 loge ( 1  4/3 p )
The Kimura "2parameter" model is almost as symmetric as this, but allows for a difference between transition and transversion rates. Its transition probability matrix for a short interval of time is:
To: A G C T  A  1a2b a b b From: G  a 1a2b b b C  b b 1a2b a T  b b a 1a2b
where a is u dt, the product of the rate of transitions per unit time and dt is the length dt of the time interval, and b is v dt, the product of half the rate of transversions (i.e., the rate of a specific transversion) and the length dt of the time interval.
The F84 model incorporates different rates of transition and transversion, but also allowing for different frequencies of the four nucleotides. It is the model which is used in DNAML, the maximum likelihood nucelotide sequence phylogenies program in this package. You will find the model described in the document for that program. The transition probabilities for this model are given by Kishino and Hasegawa (1989), and further explained in a paper by me and Gary Churchill (1996).
The LogDet distance allows a fairly general model of substitution. It computes the distance from the determinant of the empirically observed matrix of joint probabilities of nucleotides in the two species. An explanation of it is available in the chapter by Swofford et, al. (1996).
The first three models are closely related. The DNAML model reduces to Kimura's twoparameter model if we assume that the equilibrium frequencies of the four bases are equal. The JukesCantor model in turn is a special case of the Kimura 2parameter model where a = b. Thus each model is a special case of the ones that follow it, JukesCantor being a special case of both of the others.
The Jin and Nei (1990) correction for variation in rate of evolution from site to site can be adapted to all of the first three models. It assumes that the rate of substitution varies from site to site according to a gamma distribution, with a coefficient of variation that is specified by the user. The user is asked for it when choosing this option in the menu.
Each distance that is calculated is an estimate, from that particular pair of species, of the divergence time between those two species. For the Jukes Cantor model, the estimate is computed using the formula for ut given above, as long as the nucleotide symbols in the two sequences are all either A, C, G, T, U, N, X, ?, or  (the latter four indicate a deletion or an unknown nucleotide. This estimate is a maximum likelihood estimate for that model. For the Kimura 2parameter model, with only these nucleotide symbols, formulas special to that estimate are also computed. These are also, in effect, computing the maximum likelihood estimate for that model. In the Kimura case it depends on the observed sequences only through the sequence length and the observed number of transition and transversion differences between those two sequences. The calculation in that case is a maximum likelihood estimate and will differ somewhat from the estimate obtained from the formulas in Kimura's original paper. That formula was also a maximum likelihood estimate, but with the transition/transversion ratio estimated empirically, separately for each pair of sequences. In the present case, one overall preset transition/transversion ratio is used which makes the computations harder but achieves greater consistency between different comparisons.
For the F84 model, or for any of the models where one or both sequences contain at least one of the other ambiguity codons such as Y, R, etc., a maximum likelihood calculation is also done using code which was originally written for DNAML. Its disadvantage is that it is slow. The resulting distance is in effect a maximum likelihood estimate of the divergence time (total branch length between) the two sequences. However the present program will be much faster than versions earlier than 3.5, because I have speeded up the iterations.
The LogDet model computes the distance from the determinant of the matrix of cooccurrence of nucleotides in the two species, according to the formula
D =  1/4(loge(F)  1/2loge(fA1fC1fG1fT1fA2fC2fG2fT2))
Where F is a matrix whose (i,j) element is the fraction of sites at which base i occurs in one species and base j occurs in the other. fji is the fraction of sites at which species i has base j. The LogDet distance cannot cope with ambiguity codes. It must have completely defined sequences. One limitation of the LogDet distance is that it may be infinite sometimes, if there are too many changes between certain pairs of nucleotides. This can be particularly noticeable with distances computed from bootstrapped sequences. Note that there is an assumption that we are looking at all sites, including those that have not changed at all. It is important not to restrict attention to some sites based on whether or not they have changed; doing that would bias the distances by making them too large, and that in turn would cause the distances to misinterpret the meaning of those sites that had changed.
For all of these distance methods, the program allows us to specify that "third position" bases have a different rate of substitution than first and second positions, that introns have a different rate than exons, and so on. The Categories option which does this allows us to make up to 9 categories of sites and specify different rates of change for them.
In addition to the four distance calculations, the program can also compute a table of similarities between nucleotide sequences. These values are the fractions of sites identical between the sequences. The diagonal values are 1.0000. No attempt is made to count similarity of nonidentical nucleotides, so that no credit is given for having (for example) different purines at corresponding sites in the two sequences. This option has been requested by many users, who need it for descriptive purposes. It is not intended that the table be used for inferring the tree.
% fdnadist Nucleic acid sequence distance matrix program Input (aligned) nucleotide sequence set(s): dnadist.dat Distance methods f : F84 distance model k : Kimura 2parameter distance j : JukesCantor distance l : LogDet distance s : Similarity table Choose the method to use [F84 distance model]: Phylip distance matrix output file [dnadist.fdnadist]: Distances calculated for species Alpha .... Beta ... Gamma .. Delta . Epsilon Distances written to file "dnadist.fdnadist" Done. 
Go to the input files for this example
Go to the output files for this example
Nucleic acid sequence distance matrix program Version: EMBOSS:6.6.0.0 Standard (Mandatory) qualifiers: [sequence] seqsetall File containing one or more sequence alignments method menu [F84 distance model] Choose the method to use (Values: f (F84 distance model); k (Kimura 2parameter distance); j (JukesCantor distance); l (LogDet distance); s (Similarity table)) [outfile] outfile [*.fdnadist] Phylip distance matrix output file Additional (Optional) qualifiers (* if not always prompted): * gammatype menu [No distribution parameters used] Gamma distribution (Values: g (Gamma distributed rates); i (Gamma+invariant sites); n (No distribution parameters used)) * ncategories integer [1] Number of substitution rate categories (Integer from 1 to 9) * rate array [1.0] Category rates * categories properties File of substitution rate categories weights properties Weights file * gammacoefficient float [1] Coefficient of variation of substitution rate among sites (Number 0.001 or more) * invarfrac float [0.0] Fraction of invariant sites (Number from 0.000 to 1.000) * ttratio float [2.0] Transition/transversion ratio (Number 0.001 or more) * [no]freqsfrom toggle [Y] Use empirical base frequencies from seqeunce input * basefreq array [0.25 0.25 0.25 0.25] Base frequencies for A C G T/U (use blanks to separate) lower boolean [N] Output as a lower triangular distance matrix humanreadable boolean [@($(method)==s?Y:N)] Output as a humanreadable distance matrix printdata boolean [N] Print data at start of run [no]progress boolean [Y] Print indications of progress of run Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "sequence" associated qualifiers sbegin1 integer Start of each sequence to be used send1 integer End of each sequence to be used sreverse1 boolean Reverse (if DNA) sask1 boolean Ask for begin/end/reverse snucleotide1 boolean Sequence is nucleotide sprotein1 boolean Sequence is protein slower1 boolean Make lower case supper1 boolean Make upper case scircular1 boolean Sequence is circular squick1 boolean Read id and sequence only sformat1 string Input sequence format iquery1 string Input query fields or ID list ioffset1 integer Input start position offset sdbname1 string Database name sid1 string Entryname ufo1 string UFO features fformat1 string Features format fopenfile1 string Features file name "outfile" associated qualifiers odirectory2 string Output directory General qualifiers: auto boolean Turn off prompts stdout boolean Write first file to standard output filter boolean Read first file from standard input, write first file to standard output options boolean Prompt for standard and additional values debug boolean Write debug output to program.dbg verbose boolean Report some/full command line options help boolean Report command line options and exit. More information on associated and general qualifiers can be found with help verbose warning boolean Report warnings error boolean Report errors fatal boolean Report fatal errors die boolean Report dying program messages version boolean Report version number and exit 
Qualifier  Type  Description  Allowed values  Default  

Standard (Mandatory) qualifiers  
[sequence] (Parameter 1) 
seqsetall  File containing one or more sequence alignments  Readable sets of sequences  Required  
method  list  Choose the method to use 

F84 distance model  
[outfile] (Parameter 2) 
outfile  Phylip distance matrix output file  Output file  <*>.fdnadist  
Additional (Optional) qualifiers  
gammatype  list  Gamma distribution 

No distribution parameters used  
ncategories  integer  Number of substitution rate categories  Integer from 1 to 9  1  
rate  array  Category rates  List of floating point numbers  1.0  
categories  properties  File of substitution rate categories  Property value(s)  
weights  properties  Weights file  Property value(s)  
gammacoefficient  float  Coefficient of variation of substitution rate among sites  Number 0.001 or more  1  
invarfrac  float  Fraction of invariant sites  Number from 0.000 to 1.000  0.0  
ttratio  float  Transition/transversion ratio  Number 0.001 or more  2.0  
[no]freqsfrom  toggle  Use empirical base frequencies from seqeunce input  Toggle value Yes/No  Yes  
basefreq  array  Base frequencies for A C G T/U (use blanks to separate)  List of floating point numbers  0.25 0.25 0.25 0.25  
lower  boolean  Output as a lower triangular distance matrix  Boolean value Yes/No  No  
humanreadable  boolean  Output as a humanreadable distance matrix  Boolean value Yes/No  @($(method)==s?Y:N)  
printdata  boolean  Print data at start of run  Boolean value Yes/No  No  
[no]progress  boolean  Print indications of progress of run  Boolean value Yes/No  Yes  
Advanced (Unprompted) qualifiers  
(none)  
Associated qualifiers  
"sequence" associated seqsetall qualifiers  
sbegin1 sbegin_sequence 
integer  Start of each sequence to be used  Any integer value  0  
send1 send_sequence 
integer  End of each sequence to be used  Any integer value  0  
sreverse1 sreverse_sequence 
boolean  Reverse (if DNA)  Boolean value Yes/No  N  
sask1 sask_sequence 
boolean  Ask for begin/end/reverse  Boolean value Yes/No  N  
snucleotide1 snucleotide_sequence 
boolean  Sequence is nucleotide  Boolean value Yes/No  N  
sprotein1 sprotein_sequence 
boolean  Sequence is protein  Boolean value Yes/No  N  
slower1 slower_sequence 
boolean  Make lower case  Boolean value Yes/No  N  
supper1 supper_sequence 
boolean  Make upper case  Boolean value Yes/No  N  
scircular1 scircular_sequence 
boolean  Sequence is circular  Boolean value Yes/No  N  
squick1 squick_sequence 
boolean  Read id and sequence only  Boolean value Yes/No  N  
sformat1 sformat_sequence 
string  Input sequence format  Any string  
iquery1 iquery_sequence 
string  Input query fields or ID list  Any string  
ioffset1 ioffset_sequence 
integer  Input start position offset  Any integer value  0  
sdbname1 sdbname_sequence 
string  Database name  Any string  
sid1 sid_sequence 
string  Entryname  Any string  
ufo1 ufo_sequence 
string  UFO features  Any string  
fformat1 fformat_sequence 
string  Features format  Any string  
fopenfile1 fopenfile_sequence 
string  Features file name  Any string  
"outfile" associated outfile qualifiers  
odirectory2 odirectory_outfile 
string  Output directory  Any string  
General qualifiers  
auto  boolean  Turn off prompts  Boolean value Yes/No  N  
stdout  boolean  Write first file to standard output  Boolean value Yes/No  N  
filter  boolean  Read first file from standard input, write first file to standard output  Boolean value Yes/No  N  
options  boolean  Prompt for standard and additional values  Boolean value Yes/No  N  
debug  boolean  Write debug output to program.dbg  Boolean value Yes/No  N  
verbose  boolean  Report some/full command line options  Boolean value Yes/No  Y  
help  boolean  Report command line options and exit. More information on associated and general qualifiers can be found with help verbose  Boolean value Yes/No  N  
warning  boolean  Report warnings  Boolean value Yes/No  Y  
error  boolean  Report errors  Boolean value Yes/No  Y  
fatal  boolean  Report fatal errors  Boolean value Yes/No  Y  
die  boolean  Report dying program messages  Boolean value Yes/No  Y  
version  boolean  Report version number and exit  Boolean value Yes/No  N 
5 13 Alpha AACGTGGCCACAT Beta AAGGTCGCCACAC Gamma CAGTTCGCCACAA Delta GAGATTTCCGCCT Epsilon GAGATCTCCGCCC 
If the option to print out the data is selected, the output file will precede the data by more complete information on the input and the menu selections. The output file begins by giving the number of species and the number of characters, and the identity of the distance measure that is being used.
If the C (Categories) option is used a table of the relative rates of expected substitution at each category of sites is printed, and a listing of the categories each site is in.
There will then follow the equilibrium frequencies of the four bases. If the JukesCantor or Kimura distances are used, these will necessarily be 0.25 : 0.25 : 0.25 : 0.25. The output then shows the transition/transversion ratio that was specified or used by default. In the case of the JukesCantor distance this will always be 0.5. The transitiontransversion parameter (as opposed to the ratio) is also printed out: this is used within the program and can be ignored. There then follow the data sequences, with the base sequences printed in groups of ten bases along the lines of the Genbank and EMBL formats.
The distances printed out are scaled in terms of expected numbers of substitutions, counting both transitions and transversions but not replacements of a base by itself, and scaled so that the average rate of change, averaged over all sites analyzed, is set to 1.0 if there are multiple categories of sites. This means that whether or not there are multiple categories of sites, the expected fraction of change for very small branches is equal to the branch length. Of course, when a branch is twice as long this does not mean that there will be twice as much net change expected along it, since some of the changes may occur in the same site and overlie or even reverse each other. The branch lengths estimates here are in terms of the expected underlying numbers of changes. That means that a branch of length 0.26 is 26 times as long as one which would show a 1% difference between the nucleotide sequences at the beginning and end of the branch. But we would not expect the sequences at the beginning and end of the branch to be 26% different, as there would be some overlaying of changes.
One problem that can arise is that two or more of the species can be so dissimilar that the distance between them would have to be infinite, as the likelihood rises indefinitely as the estimated divergence time increases. For example, with the JukesCantor model, if the two sequences differ in 75% or more of their positions then the estimate of dovergence time would be infinite. Since there is no way to represent an infinite distance in the output file, the program regards this as an error, issues an error message indicating which pair of species are causing the problem, and stops. It might be that, had it continued running, it would have also run into the same problem with other pairs of species. If the Kimura distance is being used there may be no error message; the program may simply give a large distance value (it is iterating towards infinity and the value is just where the iteration stopped). Likewise some maximum likelihood estimates may also become large for the same reason (the sequences showing more divergence than is expected even with infinite branch length). I hope in the future to add more warning messages that would alert the user the this.
If the similarity table is selected, the table that is produced is not in a format that can be used as input to the distance matrix programs. it has a heading, and the species names are also put at the tops of the columns of the table (or rather, the first 8 characters of each species name is there, the other two characters omitted to save space). There is not an option to put the table into a format that can be read by the distance matrix programs, nor is there one to make it into a table of fractions of difference by subtracting the similarity values from 1. This is done deliberately to make it more difficult for the use to use these values to construct trees. The similarity values are not corrected for multiple changes, and their use to construct trees (even after converting them to fractions of difference) would be wrong, as it would lead to severe conflict between the distant pairs of sequences and the close pairs of sequences.
5 Alpha 0.000000 0.303900 0.857544 1.158927 1.542899 Beta 0.303900 0.000000 0.339727 0.913522 0.619671 Gamma 0.857544 0.339727 0.000000 1.631729 1.293713 Delta 1.158927 0.913522 1.631729 0.000000 0.165882 Epsilon 1.542899 0.619671 1.293713 0.165882 0.000000 
Program name  Description 

distmat  Create a distance matrix from a multiple sequence alignment 
ednacomp  DNA compatibility algorithm 
ednadist  Nucleic acid sequence distance matrix program 
ednainvar  Nucleic acid sequence invariants method 
ednaml  Phylogenies from nucleic acid maximum likelihood 
ednamlk  Phylogenies from nucleic acid maximum likelihood with clock 
ednapars  DNA parsimony algorithm 
ednapenny  Penny algorithm for DNA 
eprotdist  Protein distance algorithm 
eprotpars  Protein parsimony algorithm 
erestml  Restriction site maximum likelihood method 
eseqboot  Bootstrapped sequences algorithm 
fdiscboot  Bootstrapped discrete sites algorithm 
fdnacomp  DNA compatibility algorithm 
fdnainvar  Nucleic acid sequence invariants method 
fdnaml  Estimate nucleotide phylogeny by maximum likelihood 
fdnamlk  Estimates nucleotide phylogeny by maximum likelihood 
fdnamove  Interactive DNA parsimony 
fdnapars  DNA parsimony algorithm 
fdnapenny  Penny algorithm for DNA 
fdolmove  Interactive Dollo or polymorphism parsimony 
ffreqboot  Bootstrapped genetic frequencies algorithm 
fproml  Protein phylogeny by maximum likelihood 
fpromlk  Protein phylogeny by maximum likelihood 
fprotdist  Protein distance algorithm 
fprotpars  Protein parsimony algorithm 
frestboot  Bootstrapped restriction sites algorithm 
frestdist  Calculate distance matrix from restriction sites or fragments 
frestml  Restriction site maximum likelihood method 
fseqboot  Bootstrapped sequences algorithm 
fseqbootall  Bootstrapped sequences algorithm 
Please report all bugs to the EMBOSS bug team (embossbug © emboss.openbio.org) not to the original author.
Converted (August 2004) to an EMBASSY program by the EMBOSS team.
None