Recombination estimation by HMM and PAC

Last updated: 2018-03-07

Code version: 18c3071

Brief review

In this paper, using a novel HMM-based approach, researchers compare sequences of var gene DBL$\alpha$ domains from two divergent isolates in P.falciparum: 3D7 and HB3, and its sister taxa P.reichenowi. Results demonstrate this two species share similar gene size, and gene structure. To more specifically, they both have more than 51 var genes in a genome, and they both have a series of “homology blocks”. In addition, through simulation, they found recombination occurs almost every residue in DBL$\alpha$ domain which appears to be unusual, and no hotspot structure for this type of intradomain recombination even though interdomain hotspot structure could be considered in previous Rask’s study.

High level of sequence diversity in the PfEMP1 proteins, encoded by 50 to 60 var genes, express on the surface of infected red cells, provoking an immune response, and are known virulence factors. Previous studies show that recombination, combined with point mutation, is the mechanism of var gene revolution. Another noteworthy point is that since host’s immune system is more effective against antigens, as a result, parasites expressing less-common proteins avoid detection more effectively.

DBL$\alpha$ domain, averaging 1.8kb in length, is the only functional domain in P.falciparum and P.reichenowi, and it has stable location as well.

Using the tBLASTx algorithm and the prototype DBL$\alpha$ domain, researchers are able to extract reads, then using Clean Data assembly algorithm in Sequencer(GeneCodes), 51 unique DBL$\alpha$ regions are recovered, which is within the range of P.falciparum, thus, demonstrating that the family is equally large in both species. In further phylogenetic analysis of the var gene DBL$\alpha$ domains, P.reichenowi are not clustered together, indicating var genes likely arose as an entire family before the P.falciparum-P.reichenowi speciation event 2.5-6 million years ago.

Before introduction of method part, there are two pictures showing us the difference of gene conversion and crossover.

This graphic is adapted from Chen et al.(2007)

This graphic is adapted from webbook

Obviously, gene conversion in second figure belongs to the third type (Interallelic gene conversion)shown in first figure.

Core Method

New method is necessary to identify the relationship between species. In this paper, Tesserae program written in C language is used and implemented. HMM is employed to find homology, global alignment algorithm (Needleman-Wunsch) is to detect mosaic recombination, product of approximately conditionals (PAC) likelihood is utilized to estimate the recombination parameter. This paper aim to reconstruct each sequence in dataset as a mosaic of one or more donor sequences, allowing substitution, indel and recombination. The precise steps are as follows:

Initially, recombination parameter set to zero, transition and emission probabilities , indels and mutations are estimated using Baum-Welch algorithm, then fix these parameters, a likelihood surface is constructed for the recombination parameter. Once MLE for recombination parameter is found, Viterbi path is computed for each sequence, this path provide the mosaic alignments.

For simulated sequences, they construct 10 gene families, each family has 60 genes, each gene is composed of 150 amino acid residues in length, and a table of input parameters are needed. Each gene family is used for 8 different sets of parameters, namely, different levels of recombination and conversion. Each gene family’s simulation could viewed as two groups, one is indels without recombination, another one is coalescent with recombination, since there is currently no available program for joint estimation about recombination and indel.

Here we have to note that the coalescent is likely to be an inaccurate description of the true var gene evolution. However, basic coalescent processes of coancestry and allelic recombination may represent var gene duplication and non-allelic recombination, thus, making aspects of coalescent represent several features of var gene family evolution. Here authors employ calibration method, allowing them to make comparisons between the $\rho$ and recombination parameter in coalescent models.

Firstly, comparisons of the exact values show a high level of accuracy in the estimated recombination rates, through computing the difference of likelihood with and without recombination for each sequence, the statistical significance of the improvement as p less than $1*10^{-32}$.

In order to test false positives further, they uses a non-recombining data from P.falciparum, the program is able to recover all known recombination event, while finding no recombination history, as expected.

In further examination of the P.falciparum and P.reichenowi DBL$\alpha$ domain homology regions, researchers find multiple regions pf particular high homology, classified into two groups: core motifs and conserved peptides. The former one is between 18 and 28 residues, corresponding to HB1-5(termed ``homology blocks" by Rask et al.), among them, HB2 motif are he most frequently ones, then is HB3 and HB5. The latter one is between 24 and 140 residues, and between 80% and 100% similarity.

Lastly, the most important analysis result is recombination is uniform throughout the DBL$\alpha$ domain and does not show a hot- or coldspot structure, that means recombination breaks at almost every residue. From the following table, high variance of block length indicate the lack of hot - or coldspots of recombination.

This table is adapted from original paper

Introduction three programs for generating sequences

One basic introduction I would add is the programs generating sequences , they are Seq-Gen, ms and Rose, which are all shown in the first column of Table S2.

Seq-Gen simulates sequences given a substitution model along a phylogeny. Ms, is an all-time classic engine for coalescent sequence simulations under a Wright-Fisher neutral model. and Rose, simulates sequences given a substitution model along a phylogeny incorporating indels. This introduction is originally from a web link. My own practice is here

Summary about the transition matrix

Here is a summary of talk between Heejung and me on Monday and Tuesday afternoon in the first week of March 2018. Our talk focus on the explanation of the following transition matrix allowing insertion and deletion events.

This table is adapted from original paper

Firstly we have $n$ destination sequences and one target sequence, assuming this target sequence as a mosaic of segments from those destination sequences. In this matrix, subscript “x” represents the target sequence and “k” represents any other destination sequence in the dataset.

Firstly, this matrix is not a symmetric matrix. For example, elements in the first column are not the same with elements in the first row.

The meaning of $M_{x},I_{x},D_{x},M_{k},I_{k},D_{k}$ are shown in the following picture:

If we look at the two destination sequences in this figure, there are $6+3=9$ positions, at first, the start position uniformly come from this nine positions, therefore, every position has the same probability 1/9, representing $1/|Y|$ in the transition matrix. Next, let’s look at this matrix row by row, I will use $T_{ij}$ represent the element in row i and column j.

In the first row, $T_{11}$ and $T_{18} $are both 0, which are sensible, the target sequence’s start point come from and only come from Match or insert, not Delete. So $T_{14}$and $T_{17}$ are 0, moreover, the start point may come from one of sequence x or sequence k , so my understanding is the first four elements are independent of the last four elements. Lastly, if start position comes from sequence x, it is match or insertion, so I suppose:

$\pi_{M} + \pi_{I}=1$ *

In the second row, gap open probability is both $\delta$m if we jump to another sequence k, then will add recombination probability. Here we should notice that $T_{27}$ is 0, which means the start point in target sequence should not be a gap when we jump to another sequence.

In the third row, $T_{34}=T_{37}=0$, here we can suppose there is no allowance to transition from insertion to deletion, gap extension is allowed, of course.

In the fourth row, There is no allowance to transition from deletion to insertion. gap extension is allowed, of course.

The last four rows also show the same rule with the first four rows. Therefore, we should notice that the last two positions shown in picture should not happen in practice!

Last point I would add is gap also exist in target sequence. In the simulation, these gaps is sampled from the statonary distribution of the emission matrix.

This R Markdown site was created with workflowr

Recombination estimation by HMM and PAC

Qian Feng

2018-03-05

Brief review

Core Method

Introduction three programs for generating sequences

Summary about the transition matrix