LineageEvolver - Usage

Currently, a more usable LineageEvolver front-end is under development. As such, the current method for running LineageEvolver is much more an exercise in program development and modification than it is in software usage. In essence, this means that LineageEvolver is not yet ready for the light of heart. Expect to do some Java editing in order to customize LineageEvolver parameters until a more usable front-end has been released.

For those of you simply looking to run a generic evolution simulation on an input genome using default parameters (see 'Defaults' below for details), there is a much more simple solution not requiring any programming experience whatsoever. The Enhanced LineageEvolver Configuration Tool, or ELECT, can be used to generate input suitable for LineageEvolver. Currently, ELECT is only capable of detailing root genome data; all other inputs to LineageEvolver are left as defaults.

Note: This usage guide assumes basic shell/terminal knowledge.

Using ELECT to Generate Input

Because ELECT is web-based and makes very few assumptions about the browsing environment (no JavaScript, no cookies, etc.), it should be usable from virtually any internet-capable computer.

To use ELECT, simply choose an original name for your genome (there is no passwording, so making your genome's name original is important) to begin editing. Adding/removing genes and editing environmental variables is as easy as filling out the necessary forms. Once the genome has been configured to satisfaction, simply output the genome using the form at the bottom of the "Genome Details" page, and save the output to disk. This is the file to be used as input for LineageEvolver.

In summary, the following steps are necessary to generate input suitable for running LineageEvolver:

  1. Choose a unique name for the new genome.
  2. Add genes to the new genome.
  3. Set the environmental variables for the new genome.
  4. Output the new genome, saving it as a file on your computer.

Please note that ELECT is currently still a beta product, and as such is not anywhere near complete. Currently, ELECT is only capable of specifying root genome data, specifically: gene amino acid sequences with among site rate variation data; and environmental variables.

Manually Providing Input

Coming soon...

Running LineageEvolver

Once a suitable configuration file has been generated and saved on your computer (such as this sample configuration file), LineageEvolver can be run by executing 'java -jar /path/to/LineageEvolver-version.jar /path/to/ELECT-your-genome-name.xml' in a terminal. Of course, replace '/path/to/LineageEvolver-version.jar' with the filesystem path to your downloaded copy of the LineageEvolver jar file, and '/path/to/ELECT-your-genome-name.xml' with the filesystem path to your ELECT-generated configuration file. All LineageEvolver output is displayed within the terminal (and should look something like this sample output). You may want to pipe the output to a file, using your shell's pipe operator. Following is an example invocation of LineageEvolver, piping all data to the 'LineageEvolver.out.txt' file.

$ java \
    -jar /usr/local/bin/LineageEvolver-0.9_beta1.jar \
    /home/john/ELECT-Ecoli\ Sample.xml \
    > /home/john/LineageEvolver.out.txt

Defaults

Because ELECT is currently able to specify only the root genome of the simulation, LineageEvolver must rely on its defaults for all other simulation parameters and inputs. To change these currently requires changes to LineageEvolver's source code. For the adventurous, this is certainly an option, as source code is freely available via the LineageEvolver project page.

Currently, LineageEvolver uses the following default parameters and inputs during execution.

Interpreting LineageEvolver Output

LineageEvolver output is divided into two sections: information relating to the simulation process and complete extant genome information. At the top of the output file, you will find information relating to the evolutionary process, such as how many substitutions were processed on each genome, and when horizontal gene transfer was processed and between which genomes and genes.

A sample of an output file is provided below.

Processed 2 time intervals of substitutions on 'internal_E-F' at time 20.
Processed 2 time intervals of substitutions on 'internal_G-H' at time 20.
Processed HGT With Gene8
Successfully Processed HGT between Genomes internal_C-D (index 1) and internal_A-B (index 0)
Processed HGT at time 22.
Processed 8 time intervals of substitutions on 'internal_A-B' at time 22.
Processed 8 time intervals of substitutions on 'internal_C-D' at time 22.
Processed 8 time intervals of substitutions on 'internal_E-F' at time 22.
Processed 8 time intervals of substitutions on 'internal_G-H' at time 22.
Processed ForkEvent on 'level1_0->level2_0' at time 30.
Processed ForkEvent on 'level1_0->level2_1' at time 30.

In this portion of the simulation, we see that the ancestor of extant genomes E and F and the ancestor of extant genomes G and H underwent twenty amino acid substitutions apiece. Afterwards, horizontal gene transfer occurred between the ancestor of extant genomes C and D and the ancestor of extant genomes A and B. The gene transferred was gene number 8. The time of the horizontal transfer event was 22 time interval units after the start of the program. Following the horizontal transfer event were a number of substitutions and then a "ForkEvent", or bifurcation. This indicates that the simulation has moved from processing the portion of the tree between level 1 and 2 onto the processing of the portion of the tree between levels 2 and 3. Note, in this case, level 3 represents the leaf nodes, whereas level 0 represents the root of the tree.

The second portion of the simulation output represents information related to the extant genome sequences produced, and starts below the following section divider:

---- Tree Processing Complete ----

Individual genome information is provided as follows:

  1. Genome ID
  2. Environmental Category values
  3. Individual Gene Information:
    1. Gene ID:Paralog ID,Relative Gene Selection Frequency
    2. Sequence
    3. ASRV values
    4. Number of Existing Paralogs

For example:

Genome:
Environmental Variables: (1.2890124,,1.3413042,1.3713064,1.4086154,1.2573907,1.4615619,1.6519208,2.4669125,2.9307246,1.937844)
Genes:
(0:1,0.17388159,(MESKNKLKRG...),(0.4286,1.7181,2.7706,2.7706,1.7181,1.7181,0.0875,1.221,0.4286,1.7181,...) Paralogs: 1

The first line in this example indicates that this section marks the beginning of information on a new genome. This is followed by the following ten environmental category values:

  1. Relative Genome Size
  2. G/C Content
  3. Carbon Utilization
  4. Oxygen Utilization
  5. Maximum Growth Temperature
  6. Minimum Growth Temperature
  7. Optimum Growth Temperature
  8. Salinity
  9. pH
  10. log10(Pressure)

The next line, beginning with (0:1,0.17388159, specifies that the Gene number is 0 (ranging from 0 onwards), the paralog number is 1 (indicating that this is the first paralog of this gene in the genome), and this gene is selected for substitution and horizontal gene transfer events at a relative frequency of 0.17388159. This number should be interpreted with caution, as the classification of a slow, medium or fast-evolving gene is completely dependent on the relative selection frequencies of the other genes in the organism. In this case, the highest frequency was 0.9780598 (not shown in excerpt above), and the smallest frequency was 0.06384802 (not shown in excerpt above), indicating that this particular gene (Gene 0) is evolving at a slow rate. The next lines up until the specification for the next Gene indicate the sequence and full ASRV of the gene. The final value after the closing brace for the ASRV data specifies that there exists only one paralog (the current gene) for this gene.

Extracting/Processing LineageEvolver Output

The following script serves as an example of can be done with the LineageEvolver output via command-line automation. While it is not intended as a be-all-end-all script for LineageEvolver output parsing and processing, it is good at what it does: the script parses the LineageEvolver output and generates FASTA-formatted sequence files for each gene family encountered, and then generates consensus trees via clustalw, seqboot, protpars, consense and phyml.

#!/bin/bash
#
# Extracts gene families from LineageEvolver output and generates consensus trees via phyml and seqboot/protpars/consense.
#

INFILE="LineageEvolver.out"
GENES="0 1 2 3 4 5 6 7 8 9 10"
for gene in $GENES; do
    [ -f $gene.faa ] && rm $gene.faa
    genome=0
    cat $INFILE | grep "^($gene:" | awk -F , '{ print $3 }' | cut -d "(" -f 2 | cut -d ")" -f 1 | while read seq; do
        genome=`expr $genome + 1`
        echo -e ">$genome $gene\n$seq" >> $gene.faa
    done

    # Sequence Alignment
    clustalw -infile=$gene.faa -outfile=$gene.phy -align -output=phylip

    # Parsimony Analysis
    echo -n "Running seqboot on $gene... "
        echo -e "$phy.phy\ny\n17\n" | seqboot > /dev/null
        mv outfile $phy.phy.seqboot.outfile
    echo "Done."
    echo -n "Running protpars on $gene... "
        echo -e "$phy.phy.seqboot.outfile\nj\n17\n2\nm\nd\n100\ny\n" | protpars > /dev/null
        mv outfile $phy.phy.seqboot.protpars.outfile
        mv outtree $phy.phy.seqboot.protpars.trees
    echo "Done."
    echo -n "Running consense on $gene... "
        echo -e "$phy.phy.seqboot.protpars.trees\ny\n" | consense > /dev/null
        rm outfile
        mv outtree $phy.phy.seqboot.protpars.consense.tree
    echo "Done."

    # ML Analysis
    echo -n "Running phyml on $gene... "
        phyml $gene.phy 1 i 1 100 JTT e 8 e BIONJ y y > $gene.phyml.out 2>&1
    echo "Done."
done

Troubleshooting

Both LineageEvolver and ELECT are currently beta products. Please report problems and errors to the bug tracker on the LineageEvolver project page.