In this post, I will show four different neural networks which could be applied to multiclass text classification problems.

CNN, Convolutional Neural Networks

LSTM, Recurrent Neural Networks / Long Short Term Memory

BLSTM, Bidirectional LSTM

CLSTM, Convolutional LSTM
And I will implement them in Tensorflow. (GitHub repo)
I. Implement a CNN for text classification
The CNN model looks like this:
As far as I'm concerned, we can view CNN as a super ngrams model. For example, if we use a length 2 window to slide over the embedded word vectors, we can obtain all the 2grams from the document. Different from the normal ngrams, CNN realizes the model in a lowdimensional space and is able to extract highlevel features. For more details, check this article.
Let's implement this model in Tensorflow. Before constructing the network, we need to define the hyperparameters:
class cnn_clf(object):
def __init__(self, config):
self.max_length = config.max_length # Max document length
self.num_classes = config.num_classes # Number of classes
self.vocab_size = config.vocab_size # Vocabulary size
self.embedding_size = config.embedding_size # Embedding size
self.filter_sizes = config.filter_sizes # Lengths of different windows
self.num_filters = config.num_filters # Number of window per window size
self.l2_reg_lambda = config.l2_reg_lambda # L2 regularization lambda
Ok, now we are going to define placeholders for network inputs:
self.input_x = tf.placeholder(dtype=tf.int32, shape=[None, self.max_length])
self.input_y = tf.placeholder(dtype=tf.int64, shape=[None])
self.keep_prob = tf.placeholder(dtype=tf.float32)
For the first two tensors input_x and input_y, None means the length of the first dimension could be any number. In this case, the first dimension will be batch size. And the second dimension of input_x means that all our input vectors should have the same length. We can use zero padding to achieve this.
We also need to keep track of the L2 loss. By applying L2 regularization, we can prevent our model from being too complicated:
self.l2_loss = tf.constant(0.0)
Let's define our first layer  embedding layer:
with tf.device('/cpu:0'), tf.name_scope('embedding'):
embedding = tf.Variable(tf.random_uniform([self.vocab_size, self.embedding_size], 1.0, 1.0), name="embedding")
embed = tf.nn.embedding_lookup(embedding, self.input_x)
inputs = tf.expand_dims(embed, 1)
This layer embeds words into lowdimensional vectors. Note that we expanded the embedded vectors with one dimension. This is because Tensorflow's conv2d expects a 4dimensional tensor but the shape of embedded vectors is [batch_size, max_length, embedding_size]. We need to expand it to [batch_size, max_length, embedding_size, channels]. Since we only have one channel, the final shape is [batch_size, max_length, embedding_size, 1].
Convolution and maxpooling layers:
pooled_outputs = []
for i, filter_size in enumerate(self.filter_sizes):
with tf.variable_scope("convmaxpool%s" % filter_size):
# Convolution
filter_shape = [filter_size, self.embedding_size, 1, self.num_filters]
conv_w = tf.get_variable("weights", filter_shape, initializer=tf.truncated_normal_initializer(stddev=0.1))
conv_b = tf.get_variable("biases", [self.num_filters], initializer=tf.constant_initializer(0.0))
conv = tf.nn.conv2d(inputs,
conv_w,
strides=[1, 1, 1, 1],
padding='VALID',
name='conv')
# Activation function
h = tf.nn.relu(tf.nn.bias_add(conv, conv_b), name='relu')
# Maxpooling
pooled = tf.nn.max_pool(h,
ksize=[1, self.max_length  filter_size + 1, 1, 1],
strides=[1, 1, 1, 1],
padding='VALID',
name='pool')
pooled_outputs.append(pooled)
num_filters_total = self.num_filters * len(self.filter_sizes)
h_pool = tf.concat(pooled_outputs, 3)
h_pool_flat = tf.reshape(h_pool, [1, num_filters_total])
From the code above, we defined a convolution layer with different lengths of filters (windows) in parallel. We then applied maxpooling to each filter in order to select features. Finally, we concatenated the pooled outputs from different filters.
As one of the most popular regularization methods, dropout layer could improve our model performance by forcing neurons to be "independent". So don't forget to add this layer!
h_drop = tf.nn.dropout(h_pool_flat, keep_prob=self.keep_prob)
Next, the softmax output layer:
with tf.name_scope('softmax'):
softmax_w = tf.Variable(tf.truncated_normal([num_filters_total, self.num_classes], stddev=0.1), name='softmax_w')
softmax_b = tf.Variable(tf.constant(0.1, shape=[self.num_classes]), name='softmax_b')
# Add L2 regularization to output layer
self.l2_loss += tf.nn.l2_loss(softmax_w)
self.l2_loss += tf.nn.l2_loss(softmax_b)
self.logits = tf.matmul(h_drop, softmax_w) + softmax_b
predictions = tf.nn.softmax(self.logits)
self.predictions = tf.argmax(predictions, 1)
Finally, it's time to calculate the loss and accuracy.
with tf.name_scope('loss'):
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y, logits=self.logits)
# Add L2 losses
self.cost = tf.reduce_mean(losses) + self.l2_reg_lambda * self.l2_loss
with tf.name_scope('accuracy'):
correct_predictions = tf.equal(self.predictions, self.input_y)
self.correct_num = tf.reduce_sum(tf.cast(correct_predictions, tf.float32))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32), name='accuracy')
This is our first model. And since we will use the same optimizer for all the four models, we will talk about this later.
Thanks to the powerful Tensorflow, we can view our network structure using tensorboard:
II. Implement a RNN/LSTM for Text Classification
Unlike CNN, RNN is slightly different from normal neural networks because it uses loops to handle inputs:
The above picture shows the unrolling steps of the RNN loops. As we can see, at each step, RNN receives two types of inputs: 1) the outputs from the previous step, and 2) the current input. By this means, RNN is able to "remember" previous information and combine it with current input. Such nature makes RNN suitable for tasks related to sequences and lists. In other words, RNN is a good choice for language modeling. However, in practice, RNN will encounter a problem called "gradient vanishing", which makes it hard to learn "longterm dependencies".
In order to solve the problem, LSTM was introduced. Briefly, LSTM uses three gates to control the flow of information: input gate, forget gate, and output gate. LSTM also has two states: cell state and hidden state. First, the forget gate will decide what information we'are going to throw away from the cell state based on the hidden state h_{t1} from the previous step and the current input x_{t}. Then the input gate will decide what information we'are going to add to the cell state. Finally, the output gate will output a filtered version of the cell state, which is called hidden state. This is only a rough explanation of the LSTM. For more information, please make sure to check this article and this article.
Talk is cheap, let's code!
Likewise, first define the hyperparameters:
class rnn_clf(object):
def __init__(self, config):
self.num_classes = config.num_classes # Number of classes
self.vocab_size = config.vocab_size # Vocabulary size
self.hidden_size = config.hidden_size # Hidden size
self.num_layers = config.num_layers # Number of layers
self.l2_reg_lambda = config.l2_reg_lambda # L2 regularization lambda
hidden_size defines the number of units in the LSTM cell. num_layers defines the number of the LSTM cells.
Placeholders:
self.batch_size = tf.placeholder(dtype=tf.int32, shape=[])
self.input_x = tf.placeholder(dtype=tf.int32, shape=[None, None])
self.input_y = tf.placeholder(dtype=tf.int64, shape=[None])
self.keep_prob = tf.placeholder(dtype=tf.float32, shape=[])
self.sequence_length = tf.placeholder(dtype=tf.int32, shape=[None])
Different from the CNN model, this time all the input vectors don't have to have the same length. We just need to make sure that the input vectors in the same batch have the same length, but the lengths of vectors from different batches can vary. As a result, the second dimension of the tensor input_x was set to None.
What's more, you may notice that I used placeholder to receive batch_size. This is because we may use different batch sizes in training stage and validation stage. It is more convenient to use placeholder to pass this parameter to our network. You can also achieve this by using placeholder to receive the initial cell states. And the sequence_length represents the real lengths of the sentences in one batch. By passing this parameter to our network, the LSTM will know where to stop the unrolling steps, so that the padded zeros will not be fed into the network.
Same as before, the L2 loss and the word embedding layer:
self.l2_loss = tf.constant(0.0)
with tf.device('/cpu:0'), tf.name_scope('embedding'):
embedding = tf.get_variable('embedding', shape=[self.vocab_size, self.hidden_size], dtype=tf.float32)
inputs = tf.nn.embedding_lookup(embedding, self.input_x)
For LSTM, dropout should be added to nonrecurrent connections in order to reduce overfitting. This paper discussed the reason for such dropout strategy. So first, we will add dropout to the inputs:
self.inputs = tf.nn.dropout(inputs, keep_prob=self.keep_prob)
Define the LSTM cell:
cell = tf.contrib.rnn.LSTMCell(self.hidden_size, forget_bias=1.0, state_is_tuple=True, reuse=tf.get_variable_scope().reuse)
Add dropout to cell outputs:
cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=self.keep_prob)
Stack the LSTM cells:
cell = tf.contrib.rnn.MultiRNNCell([cell] * self.num_layers, state_is_tuple=True)
Define the initial cell states:
self._initial_state = cell.zero_state(self.batch_size, dtype=tf.float32)
Unroll the network using dynamic_rnn:
with tf.variable_scope('LSTM'):
_, state = tf.nn.dynamic_rnn(cell, inputs=self.inputs, initial_state=self._initial_state, sequence_length=self.sequence_length)
self.final_state = state
tf.nn.dynamic_rnn will return a pair (outputs, state), in which state is a tuple (c_state, m_state). c_state is the final cell state and m_state is the final hidden state. We will use the final hidden state, which is also the last output, for our classification task. Compared with tf.nn.static_rnn, dynamic_rnn is much faster.
Softmax output layer:
with tf.name_scope('softmax'):
softmax_w = tf.get_variable('softmax_w', shape=[self.hidden_size, self.num_classes], dtype=tf.float32)
softmax_b = tf.get_variable('softmax_b', shape=[self.num_classes], dtype=tf.float32)
# L2 regularization for output layer
self.l2_loss += tf.nn.l2_loss(softmax_w)
self.l2_loss += tf.nn.l2_loss(softmax_b)
self.logits = tf.matmul(self.final_state[self.num_layers  1].h, softmax_w) + softmax_b
predictions = tf.nn.softmax(self.logits)
self.predictions = tf.argmax(predictions, 1)
I used self.final_state[self.num_layers  1].h to retrieve the final hidden state of the last LSTM cell.
Loss:
with tf.name_scope('loss'):
tvars = tf.trainable_variables()
# L2 regularization for LSTM weights
for tv in tvars:
if 'kernel' in tv.name:
self.l2_loss += tf.nn.l2_loss(tv)
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y,
logits=self.logits)
self.cost = tf.reduce_mean(losses) + self.l2_reg_lambda * self.l2_loss
I added L2 regularization to LSTM weights.
Accuracy:
with tf.name_scope('accuracy'):
correct_predictions = tf.equal(self.predictions, self.input_y)
self.correct_num = tf.reduce_sum(tf.cast(correct_predictions, tf.float32))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32), name='accuracy')
The network structure:
III. Implement a BLSTM for text classification
Normal RNN network is suitable for processing sequences because it can take the past information into account, but it fails to consider the future information. As a variant of RNN, bidirectional RNN is capable of reaching the future information from the current state by adding a backward layer. The structure of the bidirectional RNN looks like this:
The two hidden layers with opposite directions will not connect to each other, but their states' outputs will be connected to the same output. Bidirectional RNN is very useful when we need to consider context information, so BRNN has been widely applied to tasks such as sequence labeling.
References:
1. Understanding Convolutional Neural Networks for NLP.
2. Understanding LSTM networks.
3. Recurrent Neural Network Regularization.
Implementing four neural networks for multiclass text classification problems
July 13, 2017
Machine Learning, Tensorflow
7 Comments
zachy
In this post, I will show four different neural networks which could be applied to multiclass text classification problems.
And I will implement them in Tensorflow. (GitHub repo)
I. Implement a CNN for text classification
The CNN model looks like this:
As far as I'm concerned, we can view CNN as a super ngrams model. For example, if we use a length 2 window to slide over the embedded word vectors, we can obtain all the 2grams from the document. Different from the normal ngrams, CNN realizes the model in a lowdimensional space and is able to extract highlevel features. For more details, check this article.
Let's implement this model in Tensorflow. Before constructing the network, we need to define the hyperparameters:
Ok, now we are going to define placeholders for network inputs:
For the first two tensors input_x and input_y, None means the length of the first dimension could be any number. In this case, the first dimension will be batch size. And the second dimension of input_x means that all our input vectors should have the same length. We can use zero padding to achieve this.
We also need to keep track of the L2 loss. By applying L2 regularization, we can prevent our model from being too complicated:
Let's define our first layer  embedding layer:
This layer embeds words into lowdimensional vectors. Note that we expanded the embedded vectors with one dimension. This is because Tensorflow's conv2d expects a 4dimensional tensor but the shape of embedded vectors is [batch_size, max_length, embedding_size]. We need to expand it to [batch_size, max_length, embedding_size, channels]. Since we only have one channel, the final shape is [batch_size, max_length, embedding_size, 1].
Convolution and maxpooling layers:
From the code above, we defined a convolution layer with different lengths of filters (windows) in parallel. We then applied maxpooling to each filter in order to select features. Finally, we concatenated the pooled outputs from different filters.
As one of the most popular regularization methods, dropout layer could improve our model performance by forcing neurons to be "independent". So don't forget to add this layer!
Next, the softmax output layer:
Finally, it's time to calculate the loss and accuracy.
This is our first model. And since we will use the same optimizer for all the four models, we will talk about this later.
Thanks to the powerful Tensorflow, we can view our network structure using tensorboard:
II. Implement a RNN/LSTM for Text Classification
Unlike CNN, RNN is slightly different from normal neural networks because it uses loops to handle inputs:
The above picture shows the unrolling steps of the RNN loops. As we can see, at each step, RNN receives two types of inputs: 1) the outputs from the previous step, and 2) the current input. By this means, RNN is able to "remember" previous information and combine it with current input. Such nature makes RNN suitable for tasks related to sequences and lists. In other words, RNN is a good choice for language modeling. However, in practice, RNN will encounter a problem called "gradient vanishing", which makes it hard to learn "longterm dependencies".
In order to solve the problem, LSTM was introduced. Briefly, LSTM uses three gates to control the flow of information: input gate, forget gate, and output gate. LSTM also has two states: cell state and hidden state. First, the forget gate will decide what information we'are going to throw away from the cell state based on the hidden state h_{t1} from the previous step and the current input x_{t}. Then the input gate will decide what information we'are going to add to the cell state. Finally, the output gate will output a filtered version of the cell state, which is called hidden state. This is only a rough explanation of the LSTM. For more information, please make sure to check this article and this article.
Talk is cheap, let's code!
Likewise, first define the hyperparameters:
hidden_size defines the number of units in the LSTM cell. num_layers defines the number of the LSTM cells.
Placeholders:
Different from the CNN model, this time all the input vectors don't have to have the same length. We just need to make sure that the input vectors in the same batch have the same length, but the lengths of vectors from different batches can vary. As a result, the second dimension of the tensor input_x was set to None.
What's more, you may notice that I used placeholder to receive batch_size. This is because we may use different batch sizes in training stage and validation stage. It is more convenient to use placeholder to pass this parameter to our network. You can also achieve this by using placeholder to receive the initial cell states. And the sequence_length represents the real lengths of the sentences in one batch. By passing this parameter to our network, the LSTM will know where to stop the unrolling steps, so that the padded zeros will not be fed into the network.
Same as before, the L2 loss and the word embedding layer:
For LSTM, dropout should be added to nonrecurrent connections in order to reduce overfitting. This paper discussed the reason for such dropout strategy. So first, we will add dropout to the inputs:
Define the LSTM cell:
Add dropout to cell outputs:
Stack the LSTM cells:
Define the initial cell states:
Unroll the network using dynamic_rnn:
tf.nn.dynamic_rnn will return a pair (outputs, state), in which state is a tuple (c_state, m_state). c_state is the final cell state and m_state is the final hidden state. We will use the final hidden state, which is also the last output, for our classification task. Compared with tf.nn.static_rnn, dynamic_rnn is much faster.
Softmax output layer:
I used self.final_state[self.num_layers  1].h to retrieve the final hidden state of the last LSTM cell.
Loss:
I added L2 regularization to LSTM weights.
Accuracy:
The network structure:
III. Implement a BLSTM for text classification
Normal RNN network is suitable for processing sequences because it can take the past information into account, but it fails to consider the future information. As a variant of RNN, bidirectional RNN is capable of reaching the future information from the current state by adding a backward layer. The structure of the bidirectional RNN looks like this:
The two hidden layers with opposite directions will not connect to each other, but their states' outputs will be connected to the same output. Bidirectional RNN is very useful when we need to consider context information, so BRNN has been widely applied to tasks such as sequence labeling.
References:
1. Understanding Convolutional Neural Networks for NLP.
2. Understanding LSTM networks.
3. Recurrent Neural Network Regularization.
More from my site