LineageEvolver - How It Works

This page outlines the implementation details for each evolutionary process simulated by LineageEvolver. For greater detail on how any given portion of the program works, feel free to reference the JavaDoc, as well as the source code (released via the project page).

Amino Acid Substitution

While processing an amino acid substitution event, a gene is chosen via weighted, random selection and a site is picked using yet another (different) weighted random selection mechanism. The amino acid at that site is then replaced with a randomly selected amino acid weighted according to a module-based substitution matrix.

Per-Gene Evolution Rates

Each gene is assigned a mutable evolution rate that governs its selection frequency relative to all other genes within the same genome. Genes are selected at random according to this relative weighting scheme when an amino acid substitution event is processed.

Among Site Rate Variation (ASRV)

Each amino acid is assigned a mutable evolution rate that governs its selection frequency relative to all other amino acids within the same gene. The sequence of these rates for a given gene is called the ASRV for that gene, and is specified via input. ASRV values should be specified as relative rates, and not as rate categories.

Note: These values can be obtained using a program such as tree-puzzle.

ASRV Mutation: A Covarion Model

The ASRV for each gene mutates automatically with amino acid substitutions and genome duplication events.

Module-Based Substitution Matrices

Implemented as a swappable, module-based subsystem, substitution matrices govern the choice of a replacement amino acid given a current state. Once a particular site has been selected for substitution, its current state (amino acid) is used as a parameter for a weighted random selection process governed by a substitution matrix. By default, the JTT substitution matrix is used. Because of the modular design of the system, other substitution matrices can be constructed easily and require no changes to interface code.

Horizontal Gene Transfer (HGT)

Also known as 'Lateral Gene Transfer', this process involves the transfer of genetic material between two otherwise unrelated genomes. Once an HGT event has been triggered (see 'Scheduling' below), a weighted, random selection mechanism based on environmental variables is used to find two participant genomes (see 'Environmental Variables' below). Once two genomes have been selected for HGT, the actual transfer of genetic material commences (see 'Recombination' below).

Scheduling

HGT is scheduled using total time intervals along the tree as a whole. Because LinageEvolver synchronizes evolutionary events using time intervals rather than number of substitutions (so as to avoid assuming a global molecular clock), HGT events are able to be synchronized against the entire tree in this way. As the genomes effected are selected at runtime, HGT events cannot be attached to any particular branch on the tree. Thus, these events are simply pinned down using total time intervals (the equivalent to, for example, strapping them to a global 'timeline' branch).

As an example, assume the tree, ((A:2,B:2):3,(C:2,D:2):3)) (branch lengths specified in time intervals, and not number of substitutions). Suppose HGT events are scheduled at time intervals 2 and 4. This would mean that one HGT event must be processed between (a) the common ancestor of genomes A and B and (b) the common ancestor of genomes B and C, and another HGT event must be processed between any of A, B, C, or D.

Environmental Variables

The selection of genomes for horizontal transfer is not entirely random. It depends on a number of environmental and cellular factors. These factors have been divided into the following ten groups as suggested by James Lake, and are summarized below. Each group is weighted according to the significance level as determined by Lake.

Note that there are no predefined ranges for environmental variables. Initial values may be arbitrary, as they will mutate automatically. Input values need not reflect actual measurements. Selection of genomes for HGT is based on similarities between corresponding variables in different genomes, not on the precise value of the variables themselves.

For instance, suppose three genomes have oxygen utilization values of 1, 1.5 and 10. The genomes with oxygen utilization values of 1 and 1.5 are most likely to undergo HGT based on this parameter alone. This scenario would be unchanged if the oxygen utilization values for each genome were 0.1, 0.15 and 1.

Relative Genome Size

Genome size has a strong positive associative influence on genome selection. This may be a side-effect of the size difference between heterotrophic and autotrophic genomes.

G/C Content

Nucleotide preferences can affect regulatory signals, favoring incorporation of genes with similar G/C ratios.

Carbon Utilization

Heterotrophs are more likely to exchange genes with other heterotrophs, while autotrophs are more likely to exchange genes with other autotrophs.

Oxygen Utilization

Many enzymes are sensitive to oxygen levels.

Maximum/Minimum/Optimum Growth Temperature (Three Separate Variables)

A mesophilic protein may be inactivated at high temperatures, whereas a thermophilic protein may need those high temperatures for enzyme catalysis. Therefore transfer of genes from a mesophilic organism to a thermophilic organism, or vice versa, may result in an inactive protein.

Salinity

Salinity has a slight positive associativity. The lack of significant associativity may be due to internal salt concentrations not being strongly correlated with external environments. Nevertheless, this characteristic is included for the sake of completeness, but weighted appropriately to its decreased significance.

pH

Optimum pH value of the growth medium has little effect on HGT, as the pH in any given environment can vary.

log10(Pressure)

The pressure at which organisms live has the least influence on HGT of the characteristics accounted for, although there is some positive associativity. As with salinity and pH, it has been included for the sake of completeness and weighted appropriately to its decreased significance.

Recombination

In its current implementation, recombination consists of unidirectional transfer of genetic material between gene segments. Once two genomes have been chosen for HGT, a process identical to the gene selection mechanism for amino acid substitution is used to select genes for recombination. Currently, only orthologous genes are allowed to recombine. After selecting genes for recombination, a recombination length (specified as a sequence length) is chosen using a weighted random number generator; both genes are then searched at random for identical 'padding' regions separated by a sequence of this length. The sequence between these regions is then subject to recombination, overwriting the destination gene region with the corresponding region from the origin gene. A gene's status as origin or destination is determined randomly at runtime.

Gene Deletion

Note: The algorithms involved in this process are currently being refined. As such, the following information is subject to change.

Currently, gene deletion events are associated with a given branch in the input topology, thus the genome from which a gene is deleted must be specified. A gene is selected for deletion in the same manner as genes are selected for amino acid substitutions.

Gene Duplication

Note: The algorithms involved in this process are currently being refined. As such, the following information is subject to change.

Currently, gene duplication events are associated with a given branch in the input topology, thus the genome from which a gene is duplicated must be specified. A gene is selected for duplication in the same manner as genes are selected for amino acid substitutions. Paralogs are kept track of via gene identification numbers as well as paralog identification numbers and counts. This tracking is visible in the LineageEvolver output.