Implementing four neural networks for multi-class text classification problems

In this post, I will show four different neural networks which could be applied to multi-class text classification problems. 

  • CNN, Convolutional Neural Networks
  • LSTM, Recurrent Neural Networks / Long Short Term Memory
  • BLSTM, Bidirectional LSTM
  • CLSTM, Convolutional LSTM

And I will implement them in Tensorflow.

I. Implement a CNN for text classification

The CNN model looks like this:


As far as I'm concerned, we can view CNN as a super n-grams model. For example, if we use a length 2 window to slide over the embedded word vectors, we can obtain all the 2-grams from the document. Different from the normal n-grams, CNN realizes the model in a low-dimensional space and is able to extract high-level features. For more details, check this article. 

Let's implement this model in Tensorflow. Before constructing the network, we need to define the hyperparameters:

class cnn_clf(object):
    def __init__(self, config):
        self.max_length = config.max_length  # Max document length
        self.num_classes = config.num_classes  # Number of classes
        self.vocab_size = config.vocab_size  # Vocabulary size
        self.embedding_size = config.embedding_size  # Embedding size
        self.filter_sizes = config.filter_sizes  # Lengths of different windows
        self.num_filters = config.num_filters  # Number of window per window size
        self.l2_reg_lambda = config.l2_reg_lambda  # L2 regularization lambda

Ok, now we are going to define placeholders for network inputs:

self.input_x = tf.placeholder(dtype=tf.int32, shape=[None, self.max_length])
self.input_y = tf.placeholder(dtype=tf.int64, shape=[None])
self.keep_prob = tf.placeholder(dtype=tf.float32)

For the first two tensors input_x and input_y, None means the length of the first dimension could be any number. In this case, the first dimension will be batch size. And the second dimension of input_x means that all our input vectors should have the same length. We can use zero padding to achieve this. 

We also need to keep track of the L2 loss. By applying L2 regularization, we can prevent our model from being too complicated:

self.l2_loss = tf.constant(0.0)

Let's define our first layer - embedding layer:

with tf.device('/cpu:0'), tf.name_scope('embedding'):
    embedding = tf.Variable(tf.random_uniform([self.vocab_size, self.embedding_size], -1.0, 1.0), name="embedding")
    embed = tf.nn.embedding_lookup(embedding, self.input_x)
    inputs = tf.expand_dims(embed, -1)

This layer embeds words into low-dimensional vectors. Note that we expanded the embedded vectors with one dimension. This is because Tensorflow's conv2d expects a 4-dimensional tensor but the shape of embedded vectors is [batch_size, max_length, embedding_size]. We need to expand it to [batch_size, max_length, embedding_size, channels]. Since we only have one channel, the final shape is [batch_size, max_length, embedding_size, 1].

Convolution and max-pooling layers:

pooled_outputs = []
for i, filter_size in enumerate(self.filter_sizes):
    with tf.variable_scope("conv-maxpool-%s" % filter_size):
        # Convolution
        filter_shape = [filter_size, self.embedding_size, 1, self.num_filters]
        conv_w = tf.get_variable("weights", filter_shape, initializer=tf.truncated_normal_initializer(stddev=0.1))
        conv_b = tf.get_variable("biases", [self.num_filters], initializer=tf.constant_initializer(0.0))

        conv = tf.nn.conv2d(inputs,
                            strides=[1, 1, 1, 1],
        # Activation function
        h = tf.nn.relu(tf.nn.bias_add(conv, conv_b), name='relu')

        # Max-pooling
        pooled = tf.nn.max_pool(h,
                                ksize=[1, self.max_length - filter_size + 1, 1, 1],
                                strides=[1, 1, 1, 1],

num_filters_total = self.num_filters * len(self.filter_sizes)
h_pool = tf.concat(pooled_outputs, 3)
h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total])

From the code above, we defined a convolution layer with different lengths of filters (windows) in parallel. We then applied max-pooling to each filter in order to select features. Finally, we concatenated the pooled outputs from different filters.

As one of the most popular regularization methods, dropout layer could improve our model performance by forcing neurons to be "independent". So don't forget to add this layer!

h_drop = tf.nn.dropout(h_pool_flat, keep_prob=self.keep_prob)

Next, the softmax output layer:

with tf.name_scope('softmax'):
    softmax_w = tf.Variable(tf.truncated_normal([num_filters_total, self.num_classes], stddev=0.1), name='softmax_w')
    softmax_b = tf.Variable(tf.constant(0.1, shape=[self.num_classes]), name='softmax_b')

    # Add L2 regularization to output layer
    self.l2_loss += tf.nn.l2_loss(softmax_w)
    self.l2_loss += tf.nn.l2_loss(softmax_b)

    self.logits = tf.matmul(h_drop, softmax_w) + softmax_b
    predictions = tf.nn.softmax(self.logits)
    self.predictions = tf.argmax(predictions, 1)

Finally, it's time to calculate the loss and accuracy. 

with tf.name_scope('loss'):
    losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y, logits=self.logits)
    # Add L2 losses
    self.cost = tf.reduce_mean(losses) + self.l2_reg_lambda * self.l2_loss
with tf.name_scope('accuracy'):
    correct_predictions = tf.equal(self.predictions, self.input_y)
    self.correct_num = tf.reduce_sum(tf.cast(correct_predictions, tf.float32))
    self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32), name='accuracy')

This is our first model. And since we will use the same optimizer for all the four models, we will talk about this later. 

Thanks to the powerful Tensorflow, we can view our network structure using tensorboard:

II. Implement a RNN/LSTM for Text Classification

Unlike CNN, RNN is slightly different from normal neural networks because it uses loops to handle inputs:


The above picture shows the unrolling steps of the RNN loops. As we can see, at each step, RNN receives two types of inputs: 1) the outputs from the previous step, and 2) the current input. By this means, RNN is able to "remember" previous information and combine it with current input. Such nature makes RNN suitable for tasks related to sequences and lists. In other words, RNN is a good choice for language modeling. However, in practice, RNN will encounter a problem called "gradient vanishing", which makes it hard to learn "long-term dependencies". 

In order to solve the problem, LSTM was introduced. Briefly, LSTM uses three gates to control the flow of information: input gate, forget gate, and output gate. LSTM also has two states: cell state and hidden state. First, the forget gate will decide what information we'are going to throw away from the cell state based on the hidden state ht-1 from the previous step and the current input xt. Then the input gate will decide what information we'are going to add to the cell state. Finally, the output gate will output a filtered version of the cell state, which is called hidden state. This is only a rough explanation of the LSTM. For more information, please make sure to check this article and this article

Talk is cheap, let's code!

Likewise, first define the hyperparameters:

class rnn_clf(object):
    def __init__(self, config):
        self.num_classes = config.num_classes  # Number of classes
        self.vocab_size = config.vocab_size  # Vocabulary size
        self.hidden_size = config.hidden_size  # Hidden size
        self.num_layers = config.num_layers  # Number of layers
        self.l2_reg_lambda = config.l2_reg_lambda  # L2 regularization lambda

hidden_size defines the number of units in the LSTM cell. num_layers defines the number of the LSTM cells.


self.batch_size = tf.placeholder(dtype=tf.int32, shape=[])
self.input_x = tf.placeholder(dtype=tf.int32, shape=[None, None])
self.input_y = tf.placeholder(dtype=tf.int64, shape=[None])
self.keep_prob = tf.placeholder(dtype=tf.float32, shape=[])
self.sequence_length = tf.placeholder(dtype=tf.int32, shape=[None])

Different from the CNN model, this time all the input vectors don't have to have the same length. We just need to make sure that the input vectors in the same batch have the same length, but the lengths of vectors from different batches can vary. As a result, the second dimension of the tensor input_x was set to None.

What's more, you may notice that I used placeholder to receive batch_size. This is because we may use different batch sizes in training stage and validation stage. It is more convenient to use placeholder to pass this parameter to our network. You can also achieve this by using placeholder to receive the initial cell states. And the sequence_length represents the real lengths of the sentences in one batch. By passing this parameter to our network, the LSTM will know where to stop the unrolling steps, so that the padded zeros will not be fed into the network. 

Same as before, the L2 loss and the word embedding layer:

self.l2_loss = tf.constant(0.0)
with tf.device('/cpu:0'), tf.name_scope('embedding'):
    embedding = tf.get_variable('embedding', shape=[self.vocab_size, self.hidden_size], dtype=tf.float32)
    inputs = tf.nn.embedding_lookup(embedding, self.input_x)

For LSTM, dropout should be added to non-recurrent connections in order to reduce over-fitting. This paper discussed the reason for such dropout strategy. So first, we will add dropout to the inputs:

self.inputs = tf.nn.dropout(inputs, keep_prob=self.keep_prob)

Define the LSTM cell:

cell = tf.contrib.rnn.LSTMCell(self.hidden_size, forget_bias=1.0, state_is_tuple=True, reuse=tf.get_variable_scope().reuse)

Add dropout to cell outputs:

cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=self.keep_prob)

Stack the LSTM cells:

cell = tf.contrib.rnn.MultiRNNCell([cell] * self.num_layers, state_is_tuple=True)

Define the initial cell states:

self._initial_state = cell.zero_state(self.batch_size, dtype=tf.float32)

Unroll the network using dynamic_rnn:

with tf.variable_scope('LSTM'):
    _, state = tf.nn.dynamic_rnn(cell, inputs=self.inputs, initial_state=self._initial_state, sequence_length=self.sequence_length)

self.final_state = state

tf.nn.dynamic_rnn will return a pair (outputs, state), in which state is a tuple (c_state, m_state). c_state is the final cell state and m_state is the final hidden state. We will use the final hidden state, which is also the last output, for our classification task. Compared with tf.nn.static_rnn, dynamic_rnn is much faster. 

Softmax output layer:

with tf.name_scope('softmax'):
    softmax_w = tf.get_variable('softmax_w', shape=[self.hidden_size, self.num_classes], dtype=tf.float32)
    softmax_b = tf.get_variable('softmax_b', shape=[self.num_classes], dtype=tf.float32)

    # L2 regularization for output layer
    self.l2_loss += tf.nn.l2_loss(softmax_w)
    self.l2_loss += tf.nn.l2_loss(softmax_b)

    self.logits = tf.matmul(self.final_state[self.num_layers - 1].h, softmax_w) + softmax_b

    predictions = tf.nn.softmax(self.logits)
    self.predictions = tf.argmax(predictions, 1)

I used self.final_state[self.num_layers - 1].h to retrieve the final hidden state of the last LSTM cell. 


with tf.name_scope('loss'):
    tvars = tf.trainable_variables()

    # L2 regularization for LSTM weights
    for tv in tvars:
        if 'kernel' in
            self.l2_loss += tf.nn.l2_loss(tv)

    losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y,
    self.cost = tf.reduce_mean(losses) + self.l2_reg_lambda * self.l2_loss

I added L2 regularization to LSTM weights.


with tf.name_scope('accuracy'):
    correct_predictions = tf.equal(self.predictions, self.input_y)
    self.correct_num = tf.reduce_sum(tf.cast(correct_predictions, tf.float32))
    self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32), name='accuracy')

The network structure:

III. Implement a BLSTM for text classification

Normal RNN network is suitable for processing sequences because it can take the past information into account, but it fails to consider the future information. As a variant of RNN, bidirectional RNN is capable of reaching the future information from the current state by adding a backward layer. The structure of the bidirectional RNN looks like this:

The two hidden layers with opposite directions will not connect to each other, but their states' outputs will be connected to the same output. Bidirectional RNN is very useful when we need to consider context information, so BRNN has been widely applied to tasks such as sequence labeling. 


1. Understanding Convolutional Neural Networks for NLP.

2. Understanding LSTM networks.

3. Recurrent Neural Network Regularization

More from my site