Objective
To propose a new sequence alignment method based on neural network.
Problem
Given an sequence pair set S.
In set S, each sequence pair < ${Seq}_A$, ${Seq}_B$> has best alignment form $Align({Seq}_A,{Seq}_B)$.
For example, pair < ${Seq}_A$, ${Seq}_B$>:
LDRCVPKFWTLHKN
LEKCWTMNMCQKN
the best alignment form is:
LDRCVPKFWTL——HKN
LEKC——WTMNMCQKN
For a new sequence pair not in S, how to predict its best alignment form.
Further Formalize Problem
We change this problem to another scenery.
There are two queue QueueA and QueueB. SeqA is in QueueA, SeqB is in QueueB. There are three allowed action to be adopted at two queue: M, I,and D. which means,
each time adopts a stategy.
when two queue is empty, we get a strategy sequence.
then problem is change to this statement.
Given an sequence pair set S.
In set S, each sequence pair < ${Seq}_A$, ${Seq}_B$> has best dequeue strategy process $Align({Seq}_A,{Seq}_B)$.
For SeqA and SeqB above, the best dequeue strategy process is:
MMMMIIIIMMMMMDDDDMMM
For a new sequence pair not in S, how to predict its best strategy process.
Motivation
The strength of deep learning is feature extraction, metric space mapping and big data processing. Consider the problem complexity of sequence alignment, only simple deep learning method is not enough.
Seq2seq model provides a simple end-to-end architecture to deal with a multi-step decision problem. Furthermore, Seq2seq model can be easily modified and expanded.
By formalizing sequence alignment to strategy sequence, we adopt seq2seq neural network to model sequence alignment problem.
Method
DataSet
SCOPe2.6
database | proteins | alignments_family | alignments_superfamily | alignments_fold |
---|---|---|---|---|
scope20_2.06 | 7659 | 16194 | 66229 | 133900 |
scope40_2.06 | 13760 | 98558 | 305553 | 526136 |
scope70_2.06 | 21132 | 314039 | 907374 | 1470385 |
scope95_2.06 | 28010 | 1099425 | 2963970 | 4170984 |
scope100_2.06 | 69590 | 6060416 | 16688439 | 23215716 |
scopeall_2.06 | 216248 | 69574992 | 172157027 | 239906177 |
Model
Textual Encoder
Given two input sequences
$X_1=(a_1,a_2,…,a_{T_1}),X_2=(b_1,b_2,…,b_{T_2})$
by Bi-LSTM and Multi_LSTM, get encodes of input sequences
In another perspective, we can consider encoders as memory of input.
$A^{X_1}=Encoder(a_1,a_2,…,a_{T_1})=\{h_1^{X_1},h_2^{X_1},…,h_{T1}^{X_1}\}$
$A^{X_2}=Encoder(b_1,b_2,…,b_{T_2})=\{h_1^{X_2},h_2^{X_2},…,h_{T2}^{X_2}\}$
Joint mutimodal representation
Mutimodal Vector Concat
For first try, we consider to concat two input memory to get a joint representation.
Bilinear pooling.
Bilinear models (Tenenbaum and Freeman, 2000) take the outer product of two vectors $x ∈ R^{n_1}$ and $q ∈ R^{n_2}$ and learn a model $W$ (here linear), i.e.
$z = W [x ⊗ q]$, where $⊗$ denotes the outer product$(xq^T)$ and $[]$ denotes linearizing the matrix in a vector.
However, the high dimensional representation leads to an infeasible number of parameters to learn in $W$.
We thus need a method that projects the outer product to a lower dimensional space and also avoids computing the outer product directly.
Multimodal Compact Bilinear Pooling (MCB)
This allows us to project the outer product to a lower dimensional space, which reduces the number of parameters in W. To avoid computing the outer product explicitly, Pham and Pagh (2013) showed that the count sketch of the outer product of two vectors can be expressed as convolution of both count sketches: $Ψ(x ⊗ q, h, s) = Ψ(x, h, s) ∗ Ψ(q, h, s)$, where ∗ is the convolution operator. Additionally, the convolution theorem states that convolution in the time domain is equivalent to element-wise product in the frequency domain. The convolution $x’∗ q’$ can be rewritten as ${FFT}^{−1}(FFT(x’) \centerdot FFT(q’))$, where $\centerdot$ refers to element-wise product. These ideas are formalized in Algorithm 1, which is based on the Tensor Sketch algorithm of Pham and Pagh (2013). We invoke the algorithm with $A^{X_1} = x$ and $A^{X_2} = q$. We note that this easily extends and remains efficient for more than two multi-modal inputs as the combination happens as element-wise product.
Attention
At every time-step, the attention mechanism computes a context vector $c^t$, given the current decoder state $s_t$ and memory set $A^m$.
Firstly, we compute attention weights $α_t = softmax(v^Ttanh(W_1A + W_2s_t + b))$.
Then, the context vector is obtained with the following weighted sum : $c_t =\sum\limits ^{|A|}_{i=1} α_{ti}h_i$
Decoder
The decoder produces an output sequence $Y = (y1, y2, …, y_{T_0} )$, is initialized by $s0 = tanh(W_{init}h^t_T + b_{init})$ where $h^t_T$ is the textual encoder’s last state.
The next decoder states are obtained as follows:
$s_t, o_t = Decoder(s_{t−1}, W_{in}[y_{t−1}: c_{t−1}])$
During training, $y_{t−1}$ is the ground truth symbol in the sentence while $c_{t−1}$ is the previous attention vector computed by the attention model. The current attention vector $c_t$, concatenated with the output $o_t$, is used to compute a new vector $o’_t = W_{proj} [o_t; c_t] + b{proj} $.
The probability distribution over the target vocabulary is computed by the equation :
$p(y_t|y_{t−1}, s_{t−1}, c_{t−1}, A^{X_1}, A^{X_2}) = softmax(W_{out}o’_t + b_{out})$
Loss Function
Cross Entropy
Evaluate the difference between the predicted probability distribution and the true distribution of the current training.
Optimizer
Adam
Adam(Adaptive Moment Estimation): To use the first order moment estimation and second order moment estimation of the gradient to dynamically adjust the learning rate of each parameter.
Commonly used together with RNN.
Results
Experiment 1
DataSet
sequences in SCOPe100_2.06, and sequence are aligned between family. Furthermore, the length of sequences are less than 50.
Training Set: 50952
Evaluation Set: 6369
Testing Set: 6369
Training:Evaluation:Testing=8:1:1
Data Format
Input1: _AG AGK GKK KKL KLF LFK FKC KCN CNE NEC ECK CKK KKT KTF TFT FTQ TQS QSS SSS SSL SLT LTV TVH VHQ HQR QRI RIH IHT HTG TGE GEK EK_
Input2: _VK VKP KPY PYG YGC GCS CSE SEC ECG CGK GKA KAF AFR FRS RSK SKS KSY SYL YLI LII IIH IHM HMR MRT RTH TH_
Target: D M D M M M M M M M M M M M M M M M M M M M M M M M M M D D D D
Multimodel Joint Style
Mutimodal Vector Concat
concat two input memory to get a joint representation
For reason: easy to achievement, and use less memory.
Effect
Horizontal axis coordinates: steps
Vertical axis coordinates: loss
Analysis:
- Fluctuation of losses illustrate different batch data has different characteristic and current training model result havn’t catch it.
- Overall decline of traing loss and evaluation loss illustrate current training model result learning something from traing dataset.
Test Result:
- CASE 1:
I D M M M M M M M M M M M M M M M M M M M M M M M M M M M
I D M M M M M M M M M M M M M M M M M M M M M M M M M M M - CASE 2:
I I I M M M M M M M M M M M M M M I M M M M M M M M M M M M M M M D M M M M M M M M M M M
I I I M M M M M M M M M M M M M M I M M M M M M M M M M M M M M M M D M M M M M M M M M M - CASE 3:
D D D D D D D I I I I M M I M I I I M M M M I M M M M M I M M M M M M M M M M M M D D M M M M I I I I I I
D I I I M M D D D D M M M M I I I M M M M M M I I M M M M M M M M M M M M M M D M M M M M I I I I I - CASE 4:
D I I M M M D D D M M M D D M M M M M D D D D D I I M M M M M M M I I M M M M I M M M M M D M M D D D M
I I I M M M D D D D D M M M D M M M D D D D D D M I I I M M M M M M M M M M M M I I M M M M M M D M D D D
Evaluatioin
Easy case work.
Too long sequence doesn’t work well.
For the difference between IDM-style align result and the difference between straightforward-style align result, the relationship between the two foramt is not so straight-forward.
Problelm
a. How to evaluate result is still a problem? not just a loss. Or is there a better loss.
- add some metrics for validation results
- build evaluation metircs for final results
- According to the alignment results of testing data, build structure, compare testing structure with structurre built from alignment resutls of tmalign.
b.Memory footprint and data loading.
Improvement
add metrics for validation process.
metrics is used to measure the difference between two sequence alignments.
metric | formula |
---|---|
Modeller score | $S_{mod}=N_{correct}/L_{align}$ |
Maxsub score | $S_{maxsub}=N_{correct}/min(L_q,Lp)$ |
balance score | $S_{bal}=(S_{mod}+S_{maxsub})/2$ |