Research Plan

Objective

To propose a new sequence alignment method based on neural network.

Problem

Given an sequence pair set S.

In set S, each sequence pair < ${Seq}_A$, ${Seq}_B$> has best alignment form $Align({Seq}_A,{Seq}_B)$.

For example, pair < ${Seq}_A$, ${Seq}_B$>:

LDRCVPKFWTLHKN

LEKCWTMNMCQKN

the best alignment form is:

LDRCVPKFWTL——HKN

LEKC——WTMNMCQKN

For a new sequence pair not in S, how to predict its best alignment form.

Further Formalize Problem

We change this problem to another scenery.

There are two queue QueueA and QueueB. SeqA is in QueueA, SeqB is in QueueB. There are three allowed action to be adopted at two queue: M, I,and D. which means,


M: dequeue QueueA and QueueB
I: only dequeue QueueA
D: only dequeue QueueB

each time adopts a stategy.

when two queue is empty, we get a strategy sequence.

then problem is change to this statement.

Given an sequence pair set S.

In set S, each sequence pair < ${Seq}_A$, ${Seq}_B$> has best dequeue strategy process $Align({Seq}_A,{Seq}_B)$.

For SeqA and SeqB above, the best dequeue strategy process is:

MMMMIIIIMMMMMDDDDMMM

For a new sequence pair not in S, how to predict its best strategy process.

Motivation

The strength of deep learning is feature extraction, metric space mapping and big data processing. Consider the problem complexity of sequence alignment, only simple deep learning method is not enough.

Seq2seq model provides a simple end-to-end architecture to deal with a multi-step decision problem. Furthermore, Seq2seq model can be easily modified and expanded.

By formalizing sequence alignment to strategy sequence, we adopt seq2seq neural network to model sequence alignment problem.

Method

DataSet

SCOPe2.6

database	proteins	alignments_family	alignments_superfamily	alignments_fold
scope20_2.06	7659	16194	66229	133900
scope40_2.06	13760	98558	305553	526136
scope70_2.06	21132	314039	907374	1470385
scope95_2.06	28010	1099425	2963970	4170984
scope100_2.06	69590	6060416	16688439	23215716
scopeall_2.06	216248	69574992	172157027	239906177

Model

Textual Encoder

Given two input sequences

$X_1=(a_1,a_2,…,a_{T_1}),X_2=(b_1,b_2,…,b_{T_2})$
by Bi-LSTM and Multi_LSTM, get encodes of input sequences
In another perspective, we can consider encoders as memory of input.

$A^{X_1}=Encoder(a_1,a_2,…,a_{T_1})=\{h_1^{X_1},h_2^{X_1},…,h_{T1}^{X_1}\}$

$A^{X_2}=Encoder(b_1,b_2,…,b_{T_2})=\{h_1^{X_2},h_2^{X_2},…,h_{T2}^{X_2}\}$

Joint mutimodal representation

Mutimodal Vector Concat

For first try, we consider to concat two input memory to get a joint representation.

$A = concat(A^{X_1},A^{X_2})$

Bilinear pooling.

Bilinear models (Tenenbaum and Freeman, 2000) take the outer product of two vectors $x ∈ R^{n_1}$ and $q ∈ R^{n_2}$ and learn a model $W$ (here linear), i.e.
$z = W [x ⊗ q]$, where $⊗$ denotes the outer product$(xq^T)$ and $[]$ denotes linearizing the matrix in a vector.
However, the high dimensional representation leads to an infeasible number of parameters to learn in $W$.
We thus need a method that projects the outer product to a lower dimensional space and also avoids computing the outer product directly.

Multimodal Compact Bilinear Pooling (MCB)

This allows us to project the outer product to a lower dimensional space, which reduces the number of parameters in W. To avoid computing the outer product explicitly, Pham and Pagh (2013) showed that the count sketch of the outer product of two vectors can be expressed as convolution of both count sketches: $Ψ(x ⊗ q, h, s) = Ψ(x, h, s) ∗ Ψ(q, h, s)$, where ∗ is the convolution operator. Additionally, the convolution theorem states that convolution in the time domain is equivalent to element-wise product in the frequency domain. The convolution $x’∗ q’$ can be rewritten as ${FFT}^{−1}(FFT(x’) \centerdot FFT(q’))$, where $\centerdot$ refers to element-wise product. These ideas are formalized in Algorithm 1, which is based on the Tensor Sketch algorithm of Pham and Pagh (2013). We invoke the algorithm with $A^{X_1} = x$ and $A^{X_2} = q$. We note that this easily extends and remains efficient for more than two multi-modal inputs as the combination happens as element-wise product.

$A=MCB(A^{X_1},A^{X_2})=\{h_1,h_2,...,h_{T}\}$

Attention

At every time-step, the attention mechanism computes a context vector $c^t$, given the current decoder state $s_t$ and memory set $A^m$.
Firstly, we compute attention weights $α_t = softmax(v^Ttanh(W_1A + W_2s_t + b))$.
Then, the context vector is obtained with the following weighted sum : $c_t =\sum\limits ^{|A|}_{i=1} α_{ti}h_i$

Decoder

The decoder produces an output sequence $Y = (y1, y2, …, y_{T_0} )$, is initialized by $s0 = tanh(W_{init}h^t_T + b_{init})$ where $h^t_T$ is the textual encoder’s last state.
The next decoder states are obtained as follows:
$s_t, o_t = Decoder(s_{t−1}, W_{in}[y_{t−1}: c_{t−1}])$
During training, $y_{t−1}$ is the ground truth symbol in the sentence while $c_{t−1}$ is the previous attention vector computed by the attention model. The current attention vector $c_t$, concatenated with the output $o_t$, is used to compute a new vector $o’_t = W_{proj} [o_t; c_t] + b{proj} $.
The probability distribution over the target vocabulary is computed by the equation :
$p(y_t|y_{t−1}, s_{t−1}, c_{t−1}, A^{X_1}, A^{X_2}) = softmax(W_{out}o’_t + b_{out})$

Loss Function

Cross Entropy

$H(p,q)=− \sum_x p(x)\log q(x)$

Evaluate the difference between the predicted probability distribution and the true distribution of the current training.

Optimizer

Adam

Adam(Adaptive Moment Estimation): To use the first order moment estimation and second order moment estimation of the gradient to dynamically adjust the learning rate of each parameter.
Commonly used together with RNN.

Results

Experiment 1

DataSet

sequences in SCOPe100_2.06, and sequence are aligned between family. Furthermore, the length of sequences are less than 50.

Training Set: 50952
Evaluation Set: 6369
Testing Set: 6369
Training:Evaluation:Testing=8:1:1

Data Format

Input1: _AG AGK GKK KKL KLF LFK FKC KCN CNE NEC ECK CKK KKT KTF TFT FTQ TQS QSS SSS SSL SLT LTV TVH VHQ HQR QRI RIH IHT HTG TGE GEK EK_

Input2: _VK VKP KPY PYG YGC GCS CSE SEC ECG CGK GKA KAF AFR FRS RSK SKS KSY SYL YLI LII IIH IHM HMR MRT RTH TH_

Target: D M D M M M M M M M M M M M M M M M M M M M M M M M M M D D D D

Multimodel Joint Style

Mutimodal Vector Concat
concat two input memory to get a joint representation
For reason: easy to achievement, and use less memory.

Effect

Horizontal axis coordinates: steps
Vertical axis coordinates: loss

Analysis:

Fluctuation of losses illustrate different batch data has different characteristic and current training model result havn’t catch it.
Overall decline of traing loss and evaluation loss illustrate current training model result learning something from traing dataset.

Test Result:

CASE 1:
I D M M M M M M M M M M M M M M M M M M M M M M M M M M M
I D M M M M M M M M M M M M M M M M M M M M M M M M M M M
CASE 2:
I I I M M M M M M M M M M M M M M I M M M M M M M M M M M M M M M D M M M M M M M M M M M
I I I M M M M M M M M M M M M M M I M M M M M M M M M M M M M M M M D M M M M M M M M M M
CASE 3:
D D D D D D D I I I I M M I M I I I M M M M I M M M M M I M M M M M M M M M M M M D D M M M M I I I I I I
D I I I M M D D D D M M M M I I I M M M M M M I I M M M M M M M M M M M M M M D M M M M M I I I I I
CASE 4:
D I I M M M D D D M M M D D M M M M M D D D D D I I M M M M M M M I I M M M M I M M M M M D M M D D D M
I I I M M M D D D D D M M M D M M M D D D D D D M I I I M M M M M M M M M M M M I I M M M M M M D M D D D

Evaluatioin

Easy case work.
Too long sequence doesn’t work well.

For the difference between IDM-style align result and the difference between straightforward-style align result, the relationship between the two foramt is not so straight-forward.

Problelm

a. ~~How to evaluate result is still a problem? not just a loss. Or is there a better loss.~~

add some metrics for validation results
build evaluation metircs for final results
- According to the alignment results of testing data， build structure, compare testing structure with structurre built from alignment resutls of tmalign.

b.~~Memory footprint and data loading.~~

Improvement

add metrics for validation process.

metrics is used to measure the difference between two sequence alignments.

metric	formula
Modeller score	$S_{mod}=N_{correct}/L_{align}$
Maxsub score	$S_{maxsub}=N_{correct}/min(L_q,Lp)$
balance score	$S_{bal}=(S_{mod}+S_{maxsub})/2$

Objective

Problem

Further Formalize Problem

Motivation

Method

DataSet

Model

Textual Encoder

Joint mutimodal representation

Mutimodal Vector Concat

Bilinear pooling.

Multimodal Compact Bilinear Pooling (MCB)

Attention

Decoder

Loss Function

Cross Entropy

Optimizer

Adam

Results

Experiment 1

DataSet

Data Format

Multimodel Joint Style

Effect

Evaluatioin

Problelm

Improvement

add metrics for validation process.

Results

Conclusion