## Implementing four neural networks for multi-class text classification problems

In this post, I will show four different neural networks which could be applied to multi-class text classification problems.

- CNN, Convolutional Neural Networks
- LSTM, Recurrent Neural Networks / Long Short Term Memory
- BLSTM, Bidirectional LSTM
- CLSTM, Convolutional LSTM

And I will implement them in Tensorflow.

**I. Implement a CNN for text classification**

The CNN model looks like this:

As far as I'm concerned, we can view CNN as a super n-grams model. For example, if we use a length 2 window to slide over the embedded word vectors, we can obtain all the 2-grams from the document. Different from the normal n-grams, CNN realizes the model in a low-dimensional space and is able to extract high-level features. For more details, check this article.

Let's implement this model in Tensorflow. Before constructing the network, we need to define the hyperparameters:

class cnn_clf(object): def __init__(self, config): self.max_length = config.max_length # Max document length self.num_classes = config.num_classes # Number of classes self.vocab_size = config.vocab_size # Vocabulary size self.embedding_size = config.embedding_size # Embedding size self.filter_sizes = config.filter_sizes # Lengths of different windows self.num_filters = config.num_filters # Number of window per window size self.l2_reg_lambda = config.l2_reg_lambda # L2 regularization lambda

Ok, now we are going to define placeholders for network inputs:

self.input_x = tf.placeholder(dtype=tf.int32, shape=[None, self.max_length]) self.input_y = tf.placeholder(dtype=tf.int64, shape=[None]) self.keep_prob = tf.placeholder(dtype=tf.float32)

For the first two tensors input_x and input_y, None means the length of the first dimension could be any number. In this case, the first dimension will be batch size. And the second dimension of input_x means that all our input vectors should have the same length. We can use zero padding to achieve this.

We also need to keep track of the L2 loss. By applying L2 regularization, we can prevent our model from being too complicated:

self.l2_loss = tf.constant(0.0)

Let's define our first layer - embedding layer:

with tf.device('/cpu:0'), tf.name_scope('embedding'): embedding = tf.Variable(tf.random_uniform([self.vocab_size, self.embedding_size], -1.0, 1.0), name="embedding") embed = tf.nn.embedding_lookup(embedding, self.input_x) inputs = tf.expand_dims(embed, -1)

This layer embeds words into low-dimensional vectors. Note that we expanded the embedded vectors with one dimension. This is because Tensorflow's conv2d expects a 4-dimensional tensor but the shape of embedded vectors is [batch_size, max_length, embedding_size]. We need to expand it to [batch_size, max_length, embedding_size, channels]. Since we only have one channel, the final shape is [batch_size, max_length, embedding_size, 1].

Convolution and max-pooling layers:

pooled_outputs = [] for i, filter_size in enumerate(self.filter_sizes): with tf.variable_scope("conv-maxpool-%s" % filter_size): # Convolution filter_shape = [filter_size, self.embedding_size, 1, self.num_filters] conv_w = tf.get_variable("weights", filter_shape, initializer=tf.truncated_normal_initializer(stddev=0.1)) conv_b = tf.get_variable("biases", [self.num_filters], initializer=tf.constant_initializer(0.0)) conv = tf.nn.conv2d(inputs, conv_w, strides=[1, 1, 1, 1], padding='VALID', name='conv') # Activation function h = tf.nn.relu(tf.nn.bias_add(conv, conv_b), name='relu') # Max-pooling pooled = tf.nn.max_pool(h, ksize=[1, self.max_length - filter_size + 1, 1, 1], strides=[1, 1, 1, 1], padding='VALID', name='pool') pooled_outputs.append(pooled) num_filters_total = self.num_filters * len(self.filter_sizes) h_pool = tf.concat(pooled_outputs, 3) h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total])

From the code above, we defined a convolution layer with different lengths of filters (windows) in parallel. We then applied max-pooling to each filter in order to select features. Finally, we concatenated the pooled outputs from different filters.

As one of the most popular regularization methods, dropout layer could improve our model performance by forcing neurons to be "independent". So don't forget to add this layer!

h_drop = tf.nn.dropout(h_pool_flat, keep_prob=self.keep_prob)

Next, the softmax output layer:

with tf.name_scope('softmax'): softmax_w = tf.Variable(tf.truncated_normal([num_filters_total, self.num_classes], stddev=0.1), name='softmax_w') softmax_b = tf.Variable(tf.constant(0.1, shape=[self.num_classes]), name='softmax_b') # Add L2 regularization to output layer self.l2_loss += tf.nn.l2_loss(softmax_w) self.l2_loss += tf.nn.l2_loss(softmax_b) self.logits = tf.matmul(h_drop, softmax_w) + softmax_b predictions = tf.nn.softmax(self.logits) self.predictions = tf.argmax(predictions, 1)

Finally, it's time to calculate the loss and accuracy.

with tf.name_scope('loss'): losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y, logits=self.logits) # Add L2 losses self.cost = tf.reduce_mean(losses) + self.l2_reg_lambda * self.l2_loss

with tf.name_scope('accuracy'): correct_predictions = tf.equal(self.predictions, self.input_y) self.correct_num = tf.reduce_sum(tf.cast(correct_predictions, tf.float32)) self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32), name='accuracy')

This is our first model. And since we will use the same optimizer for all the four models, we will talk about this later.

Thanks to the powerful Tensorflow, we can view our network structure using tensorboard:

**II. Implement a RNN/LSTM for Text Classification**

Unlike CNN, RNN is slightly different from normal neural networks because it uses loops to handle inputs:

The above picture shows the unrolling steps of the RNN loops. As we can see, at each step, RNN receives two types of inputs: 1) the outputs from the previous step, and 2) the current input. By this means, RNN is able to "remember" previous information and combine it with current input. Such nature makes RNN suitable for tasks related to sequences and lists. In other words, RNN is a good choice for language modeling. However, in practice, RNN will encounter a problem called "gradient vanishing", which makes it hard to learn "long-term dependencies".

In order to solve the problem, LSTM was introduced. Briefly, LSTM uses three gates to control the flow of information: input gate, forget gate, and output gate. LSTM also has two states: cell state and hidden state. First, the forget gate will decide what information we'are going to throw away from the cell state based on the hidden state h_{t-1} from the previous step and the current input x_{t}. Then the input gate will decide what information we'are going to add to the cell state. Finally, the output gate will output a filtered version of the cell state, which is called hidden state. This is only a rough explanation of the LSTM. For more information, please make sure to check this article and this article.

Talk is cheap, let's code!

Likewise, first define the hyperparameters:

class rnn_clf(object): def __init__(self, config): self.num_classes = config.num_classes # Number of classes self.vocab_size = config.vocab_size # Vocabulary size self.hidden_size = config.hidden_size # Hidden size self.num_layers = config.num_layers # Number of layers self.l2_reg_lambda = config.l2_reg_lambda # L2 regularization lambda

hidden_size defines the number of units in the LSTM cell. num_layers defines the number of the LSTM cells.

Placeholders:

self.batch_size = tf.placeholder(dtype=tf.int32, shape=[]) self.input_x = tf.placeholder(dtype=tf.int32, shape=[None, None]) self.input_y = tf.placeholder(dtype=tf.int64, shape=[None]) self.keep_prob = tf.placeholder(dtype=tf.float32, shape=[]) self.sequence_length = tf.placeholder(dtype=tf.int32, shape=[None])

Different from the CNN model, this time all the input vectors don't have to have the same length. We just need to make sure that the input vectors in the same batch have the same length, but the lengths of vectors from different batches can vary. As a result, the second dimension of the tensor input_x was set to None.

What's more, you may notice that I used placeholder to receive batch_size. This is because we may use different batch sizes in training stage and validation stage. It is more convenient to use placeholder to pass this parameter to our network. You can also achieve this by using placeholder to receive the initial cell states. And the sequence_length represents the real lengths of the sentences in one batch. By passing this parameter to our network, the LSTM will know where to stop the unrolling steps, so that the padded zeros will not be fed into the network.

Same as before, the L2 loss and the word embedding layer:

self.l2_loss = tf.constant(0.0)

with tf.device('/cpu:0'), tf.name_scope('embedding'): embedding = tf.get_variable('embedding', shape=[self.vocab_size, self.hidden_size], dtype=tf.float32) inputs = tf.nn.embedding_lookup(embedding, self.input_x)

For LSTM, dropout should be added to non-recurrent connections in order to reduce over-fitting. This paper discussed the reason for such dropout strategy. So first, we will add dropout to the inputs:

self.inputs = tf.nn.dropout(inputs, keep_prob=self.keep_prob)

Define the LSTM cell:

cell = tf.contrib.rnn.LSTMCell(self.hidden_size, forget_bias=1.0, state_is_tuple=True, reuse=tf.get_variable_scope().reuse)

Add dropout to cell outputs:

cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=self.keep_prob)

Stack the LSTM cells:

cell = tf.contrib.rnn.MultiRNNCell([cell] * self.num_layers, state_is_tuple=True)

Define the initial cell states:

self._initial_state = cell.zero_state(self.batch_size, dtype=tf.float32)

Unroll the network using dynamic_rnn:

with tf.variable_scope('LSTM'): _, state = tf.nn.dynamic_rnn(cell, inputs=self.inputs, initial_state=self._initial_state, sequence_length=self.sequence_length) self.final_state = state

tf.nn.dynamic_rnn will return a pair (outputs, state), in which state is a tuple (c_state, m_state). c_state is the final cell state and m_state is the final hidden state. We will use the final hidden state, which is also the last output, for our classification task. Compared with tf.nn.static_rnn, dynamic_rnn is much faster.

Softmax output layer:

with tf.name_scope('softmax'): softmax_w = tf.get_variable('softmax_w', shape=[self.hidden_size, self.num_classes], dtype=tf.float32) softmax_b = tf.get_variable('softmax_b', shape=[self.num_classes], dtype=tf.float32) # L2 regularization for output layer self.l2_loss += tf.nn.l2_loss(softmax_w) self.l2_loss += tf.nn.l2_loss(softmax_b) self.logits = tf.matmul(self.final_state[self.num_layers - 1].h, softmax_w) + softmax_b predictions = tf.nn.softmax(self.logits) self.predictions = tf.argmax(predictions, 1)

I used self.final_state[self.num_layers - 1].h to retrieve the final hidden state of the last LSTM cell.

Loss:

with tf.name_scope('loss'): tvars = tf.trainable_variables() # L2 regularization for LSTM weights for tv in tvars: if 'kernel' in tv.name: self.l2_loss += tf.nn.l2_loss(tv) losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y, logits=self.logits) self.cost = tf.reduce_mean(losses) + self.l2_reg_lambda * self.l2_loss

I added L2 regularization to LSTM weights.

Accuracy:

with tf.name_scope('accuracy'): correct_predictions = tf.equal(self.predictions, self.input_y) self.correct_num = tf.reduce_sum(tf.cast(correct_predictions, tf.float32)) self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32), name='accuracy')

The network structure:

**III. Implement a BLSTM for text classification**

Normal RNN network is suitable for processing sequences because it can take the past information into account, but it fails to consider the future information. As a variant of RNN, bidirectional RNN is capable of reaching the future information from the current state by adding a backward layer. The structure of the bidirectional RNN looks like this:

The two hidden layers with opposite directions will not connect to each other, but their states' outputs will be connected to the same output. Bidirectional RNN is very useful when we need to consider context information, so BRNN has been widely applied to tasks such as sequence labeling.

References:

1. Understanding Convolutional Neural Networks for NLP.

2. Understanding LSTM networks.

3. Recurrent Neural Network Regularization.

## 3,069 Responses

## سئو says:

Awesome article.

## March 31, 2018 at 12:47 pm

## runescape 3 gold says:

Ahaa, its good discussion on the topic of this article

at this place at this blog, I have read all that, so now me also

commenting at this place.

## March 31, 2018 at 4:06 pm

## buy runescape items says:

I always spent my half an hour to read this webpage's articles or

reviews everyday along with a mug of coffee.

## March 31, 2018 at 4:22 pm

## Agen Sbobet Online says:

I am regular visitor, how are you everybody? This paragraph posted at this web site is truly fastidious.

## March 31, 2018 at 7:24 pm

## personal finance advice india says:

Hi thｅre everyοne, it's my first visit

at this site, and article is in fact fruitful for me, keep up posting thеse articlеs.

## March 31, 2018 at 9:11 pm

## daftar judi online terpercaya says:

My brother recommended I might like this website.

He was totally right. This post truly made my day. You cann't

imagine simply how much time I had spent

for this information! Thanks!

## March 31, 2018 at 9:27 pm

## judi Online Terpercaya di Indonesia says:

Unquestionably believe that which you said. Your favorite reason appeared to be on the net

the easiest thing to be aware of. I say to you,

I certainly get annoyed while people think about worries that they just do not know about.

You managed to hit the nail upon the top as well as defined out the whole thing without having side-effects ,

people could take a signal. Will likely be back to get more.

Thanks

## March 31, 2018 at 9:49 pm

## Health says:

Good web site! I truly love how it is easy on my eyes and the data are well written. I'm wondering how I could be notified when a new post has been made. I've subscribed to your feed which must do the trick! Have a great day!

## March 31, 2018 at 10:02 pm

## Art says:

I truly appreciate this post. I have been looking everywhere for this! Thank goodness I found it on Bing. You've made my day! Thanks again

## March 31, 2018 at 10:07 pm

## Education says:

Thank you for every other wonderful post. The place else may just anybody get that kind of info in such a perfect way of writing? I've a presentation next week, and I am on the search for such info.

## March 31, 2018 at 10:17 pm

## New Hampshire T Shirts says:

I'm not sure where you're getting your info, but

good topic. I needs to spend some time learning

much more or understanding more. Thanks for excellent information I was looking for

this information for my mission.

## March 31, 2018 at 11:44 pm

## daftaragenbolaterpercaya.hatenablog.com says:

Hello, i think that i noticed you visited my blog

so i got here to return the prefer?.I am trying to in finding issues to

enhance my site!I assume its good enough to use a few of your

ideas!!

## April 1, 2018 at 12:47 am

## pourquoi prendre plusieurs assurance vie says:

bareme droit de succession sur assurance vie pourquoi prendre plusieurs

assurance vie assurance vie notaire succession explication fonctionnement assurance vie assurance vie fonds en unite de compte abattement 4600 assurance vie exemple fiscalite rachat assurance vie fermeture assurance vie avant 8 ans offre assurance vie prime simulateur

d assurance vie assurance vie et avantages fiscaux exemple

calcul assurance vie imposition sur retrait partiel assurance vie assurance vie

pour etranger residant en france meilleure assurance vie gestion pilotee calcul valeur de rachat d'une assurance vie assurance vie deces combien de temps assurance vie conseil fiscalite retrait sur contrat assurance vie loi sur assurance vie pfl ou ir pour rachat assurance vie

taux minimum garanti assurance vie 2018 assurance vie avec rente mensuelle fiscalite assurance vie co souscription calcul fiscalite assurance vie apres 8 ans droit beneficiaire assurance vie nouvelle loi sur assurance vie retrait sur assurance vie assurance vie prime versee

avant 70 ans mandat pour assurance vie heritier legataire assurance vie

assurance vie pour gendarme retrait partiel sur pep assurance vie

heritage assurance vie d'un oncle imposition assurance vie tableau comparatif d'assurance vie simulateur rachat assurance vie rachat assurance vie choix fiscal impots sur assurance vie

suite deces detenir plusieurs assurance vie assurance vie primes versees apres 70 ans combien de temps pour toucher une assurance vie

apres un deces requalification d'un contrat d'assurance vie en donation assurance vie ouvert avant 70 ans assurance vie et defiscalisation assurance vie risques assurance vie et testament qui prime definition assurance vie code

des assurances fiscalite rachat assurance vie exemple retirer argent assurance vie apres 8 ans assurance vie france comparatif

## April 1, 2018 at 1:38 am

## Home Improvement says:

Fantastic goods from you, man. I have understand your stuff previous to and you are just too wonderful. I actually like what you've acquired here, really like what you're stating and the way in which you say it. You make it enjoyable and you still take care of to keep it wise. I can not wait to read much more from you. This is really a tremendous web site.

## April 1, 2018 at 2:43 am

## Automotive says:

Just wish to say your article is as amazing. The clarity in your post is just excellent and i can assume you're an expert on this subject. Fine with your permission allow me to grab your RSS feed to keep up to date with forthcoming post. Thanks a million and please carry on the gratifying work.

## April 1, 2018 at 3:50 am

## puzzlejet.com says:

Pretty nice post. I just stumbled upon your weblog and wanted to say that I have

truly enjoyed browsing your blog posts. In any case I will be

subscribing to your rss feed and I hope you write again soon!

## April 1, 2018 at 7:12 am