the purpose of this repository is explore text classification methods in NLP with deep learning.
it has all kinds of baseline models for text classificaiton.
it also support for multi-label classification where multi label associate with an sentence or document.
although many of these models are simple, and may not get you to top level of the task.but some of these models are very classic, so they may be good to serve as baseline models.
each model has a test function under model class.
we also explore two seq2seq model(seq2seq with attention,transformer: attention is all you need) to do text classification. and these two models can also be used for sequences generating, and other tasks. if you task is a multi-label classification, you can cast the problem to sequences generating.
we implement one memory network: recurrent entity network: tracking state of the world. it has blocks of key-value pairs as memory, run in parallel, which achieve new state of art. it can be used for modelling question answering with contexts(or history). for example, you can let the model to read some sentences(as context), and ask a question(as query), then ask the model to predict an answer; if you feed story same as query, then it can do classification task.
if you want to know more detail about dataset of text classification or task these models can be used, one of choose is below: https://biendata.com/competition/zhihu/
- fastText
- TextCNN
- TextRNN
- RCNN
- Hierarchical Attention Network
- seq2seq with attention
- Transformer("Attend Is All You Need")
- EntityNetwork:tracking state of the world
and other models: BiLstmTextRelation; twoCNNTextRelation; BiLstmTextRelationTwoRNN
(mulit-label label prediction task,ask to prediction top5, 3 million training data,full mark:0.5)
Model | fastText | TextCNN | TextRNN | RCNN | HierAtteNetwork | Seq2seqWithAttention | EntityNetwork |
---|---|---|---|---|---|---|---|
Score | 0.362 | 0.405 | 0.358 | 0.395 | 0.398 | 0.322 | 0.400 |
Training | 10 minutes | 2 hours | 10 hours | 2 hours | 2 hours | 3 hours | 3 hour |
notice: 'HierAtteNetwork' means Hierarchical Attention Networkk
- model is in xxx_model.py
- run python xxx_train.py to train the model
- run python xxx_predict.py to do inference(test).
Each model has a test method under the model class. you can run the test method first to check whether the model can work properly.
python 2.7+ tensorflow 1.1
(most of models should also work fine in other tensorflow version, since we use very few features bond to certain version.)
Some util function is in data_util.py; typical input like: "x1 x2 x3 x4 x5 label 323434" where 'x1,x2' is words, '323434' is label; it has a function to load and assign pretrained word embedding to the model,where word embedding is pretrained in word2vec or fastText.
implmentation of Bag of Tricks for Efficient Text Classification
- use bi-gram and/or tri-gram
- use NCE loss to speed us softmax computation(not use hierarchy softmax as original paper) result: performance is as good as paper, speed also very fast.
check: p5_fastTextB_model.py
implementation of Convolutional Neural Networks for Sentence Classification
Structure:embedding--->conv--->max pooling--->fully connected layer-------->softmax
check: p7_TextCNN_model.py
in order to get very good result with TextCNN, you also need to read carefully about this paper A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification: it give you some insights of things that can affect performance. although you need to change some settings according to your specific task.
Structure:embedding--->bi-directional lstm--->concat output--->average----->softmax
check: p8_TextRNN_model.py
Structure same as TextRNN. but input is special designed. e.g.input:"how much is the computer? EOS price of laptop". where 'EOS' is a special token spilted question1 and question2.
check:p9_BiLstmTextRelation_model.py
Structure: first use two different convolutional to extract feature of two sentences. then concat two features. use linear transform layer to out projection to target label, then softmax.
check: p9_twoCNNTextRelation_model.py
Structure: one bi-directional lstm for one sentence(get output1), another bi-directional lstm for another sentence(get output2). then: softmax(output1Moutput2)
check:p9_BiLstmTextRelationTwoRNN_model.py
for more detail you can go to: Deep Learning for Chatbots, Part 2 – Implementing a Retrieval-Based Model in Tensorflow
Recurrent convolutional neural network for text classification
implementation of Recurrent Convolutional Neural Network for Text Classification
structure:1)recurrent structure (convolutional layer) 2)max pooling 3) fully connected layer+softmax
it learn represenation of each word in the sentence or document with left side context and right side context:
representation current word=[left_side_context_vector,current_word_embedding,right_side_context_vecotor].
for left side context, it use a recurrent structure, a no-linearity transfrom of previous word and left side previous context; similarly to right side context.
check: p71_TextRCNN_model.py
Implementation of Hierarchical Attention Networks for Document Classification
Structure:
1)embedding
-
Word Encoder: word level bi-directional GRU to get rich representation of words
-
Word Attention:word level attention to get important information in a sentence
-
Sentence Encoder: sentence level bi-directional GRU to get rich representation of sentences
-
Sentence Attetion: sentence level attention to get important sentence among sentences
-
FC+Softmax
Input of data:
Generally speaking, input of this model should have serveral sentences instead of sinle sentence. shape is:[None,sentence_lenght]. where None means the batch_size.
In my training data, for each example, i have four parts. each part has same length. i concat four parts to form one single sentence. the model will split the sentence into four parts, to form a tensor with shape:[None,num_sentence,sentence_length]. where num_sentence is number of sentences(equal to 4, in my setting).
check:p1_HierarchicalAttention_model.py
Implementation seq2seq with attention derived from NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
I.Structure:
1)embedding 2)bi-GRU too get rich representation from source sentences(forward & backward). 3)decoder with attention.
II.Input of data:
there are two kinds of three kinds of inputs:1)encoder inputs, which is a sentence; 2)decoder inputs, it is labels list with fixed length;3)target labels, it is also a list of labels.
for example, labels is:"L1 L2 L3 L4", then decoder inputs will be:[_GO,L1,L2,L2,L3,_PAD]; target label will be:[L1,L2,L3,L3,_END,_PAD]. length is fixed to 6, any exceed labels will be trancated, will pad if label is not enough to fill.
III.Attention Mechanism:
-
transfer encoder input list and hidden state of decoder
-
calculate similiarity of hidden state with each encoder input, to get possibility distribution for each encoder input.
-
weighted sum of encoder input based on possibility distribution.
go though RNN Cell using this weight sum together with decoder input to get new hidden state
IV.How Vanilla Encoder Decoder Works:
the source sentence will be encoded using RNN as fixed size vector ("thought vector"). then during decoder:
-
when it is training, another RNN will be used to try to get a word by using this "thought vector" as init state, and take input from decoder input at each timestamp. decoder start from special token "_GO". after one step is performanced, new hidden state will be get and together with new input, we can continue this process until we reach to a special token "_END". we can calculate loss by compute cross entropy loss of logits and target label. logits is get through a projection layer for the hidden state(for output of decoder step(in GRU we can just use hidden states from decoder as output).
-
when it is testing, there is no label. so we should feed the output we get from previous timestamp, and continue the process util we reached "_END" TOKEN.
V.Notices:
-
here i use two kinds of vocabularies. one is from words,used by encoder; another is for labels,used by decoder
-
for vocabulary of lables, i insert three special token:"_GO","_END","_PAD"; "_UNK" is not used, since all labels is pre-defined.
Status:
Just finish main part, and able to output reverse order of its sequences, and do it in parallell style. layer normalization,residual connection, and mask are also used in the model.
For every building blocks, we include a test function in the each file below, and we've test each small piece successfully.
Sequence to sequence with attention is a typical model to solve sequence generation problem, such as translate, dialogue system. most of time, it use RNN as buidling block to do these tasks. util recently, people also apply convolutional Neural Network for sequence to sequence problem. Transformer, however, it perform these tasks solely on attention mechansim. it is fast and acheive new state-of-art result.
It also has two main parts: encoder and decoder. below is desc from paper:
Encoder:
6 layers.each layers has two sub-layers. the first is multi-head self-attention mechanism; the second is position-wise fully connected feed-forward network. for each sublayer. use LayerNorm(x+Sublayer(x)). all dimension=512.
Decoder:
- The decoder is composed of a stack of N= 6 identical layers.
- In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
- Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
Main Take away from this model:
- multi-head self attention: use self attention, linear transform multi-times to get projection of key-values, then do ordinary attention; 2) some tricks to improve performance(residual connection,position encoding, poistion feed forward, label smooth, mask to ignore things we want to ignore).
for detail of the model, please check: a2_transformer.py
- Recurrent Entity Network
Input:1. story: it is multi-sentences, as context. 2.query: a sentence, which is a question, 3. ansewr: a single label.
Model Structure:
-
Input encoding: use bag of word to encode story(context) and query(question); take account of position by using position mask
by using bi-directional rnn to encode story and query, performance boost from 0.392 to 0.398, increase 1.5%.
-
Dynamic memory:
a. compute gate by using 'similiarity' of keys,values with input of story.
b. get candidate hidden state by transform each key,value and input.
c. combine gate and candidate hidden state to update current hidden state.
- Output moudle( use attention mechanism): a. to get possibility distribution by computing 'similarity' of query and hidden state
b. get weighted sum of hidden state using possibility distribution.
c. non-linearity transform of query and hidden state to get predict label.
Main take away from this model:
-
use blocks of keys and values, which is independent from each other. so it can be run in parallel.
-
modelling context and question together. use memory to track state of world; and use non-linearity transform of hidden state and question(query) to make a prediction.
-
simple model can also achieve very good performance. simple encode as use bag of word.
for detail of the model, please check: a3_entity_network.py
under this model, it has a test function, which ask this model to count numbers both for story(context) and query(question). but weights of story is smaller than query.
1.Character-level Convolutional Networks for Text Classification
2.Convolutional Neural Networks for Text Categorization:Shallow Word-level vs. Deep Character-level
3.Very Deep Convolutional Networks for Text Classification
4.Adversarial Training Methods For Semi-supervised Text Classification
1.Bag of Tricks for Efficient Text Classification
2.Convolutional Neural Networks for Sentence Classification
3.Deep Learning for Chatbots, Part 2 – Implementing a Retrieval-Based Model in Tensorflow, from www.wildml.com
4.Recurrent Convolutional Neural Network for Text Classification
5.Hierarchical Attention Networks for Document Classification
6.Neural Machine Translation by Jointly Learning to Align and Translate
7.Attention Is All You Need
8.Tracking the state of world with recurrent entity networks
to be continued. for any problem, concat [email protected]