Embeddings in Biological Sequences

Objective

Construct a distributed representation of biological sequences.

Word2vec

CBOW
Skip-gram

Doc2vec

Doc2vec-CBOW

Doc2vec-DM

Effect

In fact, unstable.

BioVec: ProtVec+GeneVec (Plos one)

Objective

Our goal is to construct a distributed representation of biological sequences.

Method

Here, a biological sequence is treated like a sentence in a text corpus while the kmers derived from the sequence are treated like words, and are given as input to the embedding algorithm.

Data Preprocess

NN Model

Skip-gram Model

Evaluation
Protein Family Classification

To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families,
Database
:Swiss-Prot
Method: SVM
Results
an average family classification accuracy of 93% ± 0.06% is obtained, outperforming existing family classification methods.

Disordered Proteins Classification

In addition, we use ProtVec representation to predict disordered proteins from structured proteins.
Database:
The DisProt database as well as FG-Nups (a database featuring the
disordered regions of nucleoporins rich with phenylalanine-glycine repeats).
Method:
Using support vector machine classifiers,
Results:
FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy.

Conclusion
  • By only providing sequence data for various proteins into this model, accurate information about protein structure can be determined.
  • Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest.
  • Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics.

Seq2vec (DTMBio)

Objective

Constructing a distributed representation of biological sequences.

Method

Our algorithm is based on the doc2vec approach, which is an extension of the original word2vec algorithm.

Data Preprocess

Two approach.

Non-overlapping Process

QWERTYQWERTY
->
Seq 1: QWE RTY QWE RTY
Seq 2: WER TYQ WER
Seq 3: ERT YQW ERT.

Overlapping Process:

QWERTYQWERTY
->
QWE WER ERT RTY TYQ YQW QWE WER ERT RTY.

Evaluation

Protein Classification.
First, we use the protein vectors learned using seq2vec and ProtVecs as features and compare them for the task of protein classification using SVMs.

Next, since distributed representations embed similar sequences in proximity to each other, we use k-nearest neighbors (kNN) to retrieve the k nearest sequences in the vector space and see how successful are we in predicting the family of a test sequence based on a majority vote.