Research Plan

Objective

To propose a new sequence alignment method based on neural network.

Problem

Given an sequence pair set S.

In set S, each sequence pair < ${Seq}_A$, ${Seq}_B$> has best alignment form $Align({Seq}_A,{Seq}_B)$.

For example, pair < ${Seq}_A$, ${Seq}_B$>:

LDRCVPKFWTLHKN

LEKCWTMNMCQKN

the best alignment form is:

LDRCVPKFWTL——HKN

LEKC——WTMNMCQKN

For a new sequence pair not in S, how to predict its best alignment form.

Further Formalize Problem

We change this problem to another scenery.

There are two queue QueueA and QueueB. SeqA is in QueueA, SeqB is in QueueB. There are three allowed action to be adopted at two queue: M, I,and D. which means,

1
2
3
4
5
6
M: dequeue QueueA and QueueB
I: only dequeue QueueA
D: only dequeue QueueB

each time adopts a stategy.

when two queue is empty, we get a strategy sequence.

then problem is change to this statement.

Given an sequence pair set S.

In set S, each sequence pair < ${Seq}_A$, ${Seq}_B$> has best dequeue strategy process $Align({Seq}_A,{Seq}_B)$.

For SeqA and SeqB above, the best dequeue strategy process is:

MMMMIIIIMMMMMDDDDMMM

For a new sequence pair not in S, how to predict its best strategy process.

Motivation

The strength of deep learning is feature extraction, metric space mapping and big data processing. Consider the problem complexity of sequence alignment, only simple deep learning method is not enough.

Seq2seq model provides a simple end-to-end architecture to deal with a multi-step decision problem. Furthermore, Seq2seq model can be easily modified and expanded.

By formalizing sequence alignment to strategy sequence, we adopt seq2seq neural network to model sequence alignment problem.

Method

DataSet

SCOPe2.6

database proteins alignments_family alignments_superfamily alignments_fold
scope20_2.06 7659 16194 66229 133900
scope40_2.06 13760 98558 305553 526136
scope70_2.06 21132 314039 907374 1470385
scope95_2.06 28010 1099425 2963970 4170984
scope100_2.06 69590 6060416 16688439 23215716
scopeall_2.06 216248 69574992 172157027 239906177

Model

Textual Encoder

Given two input sequences

$X_1=(a_1,a_2,…,a_{T_1}),X_2=(b_1,b_2,…,b_{T_2})$
by Bi-LSTM and Multi_LSTM, get encodes of input sequences
In another perspective, we can consider encoders as memory of input.

$A^{X_1}=Encoder(a_1,a_2,…,a_{T_1})=\{h_1^{X_1},h_2^{X_1},…,h_{T1}^{X_1}\}$

$A^{X_2}=Encoder(b_1,b_2,…,b_{T_2})=\{h_1^{X_2},h_2^{X_2},…,h_{T2}^{X_2}\}$

Joint mutimodal representation
Mutimodal Vector Concat

For first try, we consider to concat two input memory to get a joint representation.

Bilinear pooling.

Bilinear models (Tenenbaum and Freeman, 2000) take the outer product of two vectors $x ∈ R^{n_1}$ and $q ∈ R^{n_2}$ and learn a model $W$ (here linear), i.e.
$z = W [x ⊗ q]$, where $⊗$ denotes the outer product$(xq^T)$ and $[]$ denotes linearizing the matrix in a vector.
However, the high dimensional representation leads to an infeasible number of parameters to learn in $W$.
We thus need a method that projects the outer product to a lower dimensional space and also avoids computing the outer product directly.

Multimodal Compact Bilinear Pooling (MCB)


This allows us to project the outer product to a lower dimensional space, which reduces the number of parameters in W. To avoid computing the outer product explicitly, Pham and Pagh (2013) showed that the count sketch of the outer product of two vectors can be expressed as convolution of both count sketches: $Ψ(x ⊗ q, h, s) = Ψ(x, h, s) ∗ Ψ(q, h, s)$, where ∗ is the convolution operator. Additionally, the convolution theorem states that convolution in the time domain is equivalent to element-wise product in the frequency domain. The convolution $x’∗ q’$ can be rewritten as ${FFT}^{−1}(FFT(x’) \centerdot FFT(q’))$, where $\centerdot$ refers to element-wise product. These ideas are formalized in Algorithm 1, which is based on the Tensor Sketch algorithm of Pham and Pagh (2013). We invoke the algorithm with $A^{X_1} = x$ and $A^{X_2} = q$. We note that this easily extends and remains efficient for more than two multi-modal inputs as the combination happens as element-wise product.

Attention

At every time-step, the attention mechanism computes a context vector $c^t$, given the current decoder state $s_t$ and memory set $A^m$.
Firstly, we compute attention weights $α_t = softmax(v^Ttanh(W_1A + W_2s_t + b))$.
Then, the context vector is obtained with the following weighted sum : $c_t =\sum\limits ^{|A|}_{i=1} α_{ti}h_i$

Decoder

The decoder produces an output sequence $Y = (y1, y2, …, y_{T_0} )$, is initialized by $s0 = tanh(W_{init}h^t_T + b_{init})$ where $h^t_T$ is the textual encoder’s last state.
The next decoder states are obtained as follows:
$s_t, o_t = Decoder(s_{t−1}, W_{in}[y_{t−1}: c_{t−1}])$
During training, $y_{t−1}$ is the ground truth symbol in the sentence while $c_{t−1}$ is the previous attention vector computed by the attention model. The current attention vector $c_t$, concatenated with the output $o_t$, is used to compute a new vector $o’_t = W_{proj} [o_t; c_t] + b{proj} $.
The probability distribution over the target vocabulary is computed by the equation :
$p(y_t|y_{t−1}, s_{t−1}, c_{t−1}, A^{X_1}, A^{X_2}) = softmax(W_{out}o’_t + b_{out})$

Loss Function
Cross Entropy

Evaluate the difference between the predicted probability distribution and the true distribution of the current training.

Optimizer
Adam

Adam(Adaptive Moment Estimation): To use the first order moment estimation and second order moment estimation of the gradient to dynamically adjust the learning rate of each parameter.
Commonly used together with RNN.

Results

Experiment 1
DataSet

sequences in SCOPe100_2.06, and sequence are aligned between family. Furthermore, the length of sequences are less than 50.

Training Set: 50952
Evaluation Set: 6369
Testing Set: 6369
Training:Evaluation:Testing=8:1:1

Data Format

Input1: _AG AGK GKK KKL KLF LFK FKC KCN CNE NEC ECK CKK KKT KTF TFT FTQ TQS QSS SSS SSL SLT LTV TVH VHQ HQR QRI RIH IHT HTG TGE GEK EK_

Input2: _VK VKP KPY PYG YGC GCS CSE SEC ECG CGK GKA KAF AFR FRS RSK SKS KSY SYL YLI LII IIH IHM HMR MRT RTH TH_

Target: D M D M M M M M M M M M M M M M M M M M M M M M M M M M D D D D

Multimodel Joint Style

Mutimodal Vector Concat
concat two input memory to get a joint representation
For reason: easy to achievement, and use less memory.

Effect


Horizontal axis coordinates: steps
Vertical axis coordinates: loss

Analysis:

  1. Fluctuation of losses illustrate different batch data has different characteristic and current training model result havn’t catch it.
  2. Overall decline of traing loss and evaluation loss illustrate current training model result learning something from traing dataset.

Test Result:

  • CASE 1:
    I D M M M M M M M M M M M M M M M M M M M M M M M M M M M
    I D M M M M M M M M M M M M M M M M M M M M M M M M M M M
  • CASE 2:
    I I I M M M M M M M M M M M M M M I M M M M M M M M M M M M M M M D M M M M M M M M M M M
    I I I M M M M M M M M M M M M M M I M M M M M M M M M M M M M M M M D M M M M M M M M M M
  • CASE 3:
    D D D D D D D I I I I M M I M I I I M M M M I M M M M M I M M M M M M M M M M M M D D M M M M I I I I I I
    D I I I M M D D D D M M M M I I I M M M M M M I I M M M M M M M M M M M M M M D M M M M M I I I I I
  • CASE 4:
    D I I M M M D D D M M M D D M M M M M D D D D D I I M M M M M M M I I M M M M I M M M M M D M M D D D M
    I I I M M M D D D D D M M M D M M M D D D D D D M I I I M M M M M M M M M M M M I I M M M M M M D M D D D
Evaluatioin

Easy case work.
Too long sequence doesn’t work well.

For the difference between IDM-style align result and the difference between straightforward-style align result, the relationship between the two foramt is not so straight-forward.

Problelm

a. How to evaluate result is still a problem? not just a loss. Or is there a better loss.

  • add some metrics for validation results
  • build evaluation metircs for final results
    • According to the alignment results of testing data, build structure, compare testing structure with structurre built from alignment resutls of tmalign.

b.Memory footprint and data loading.

Improvement
add metrics for validation process.

metrics is used to measure the difference between two sequence alignments.

metric formula
Modeller score $S_{mod}=N_{correct}/L_{align}$
Maxsub score $S_{maxsub}=N_{correct}/min(L_q,Lp)$
balance score $S_{bal}=(S_{mod}+S_{maxsub})/2$
Results

Conclusion