This post shows how to build a sentiment classifier for strings using Deep Learning, specifically Keras.
The classifier is trained from scratch using labelled data from Twitter.
The data set can be obtained from:
The CSV data looks like the following, inside the file:
training.1600000.processed.noemoticon.csv:
"0","1467810369","Mon Apr 06 22:19:45 PDT 2009","NO_QUERY","_TheSpecialOne_","@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"
"0","1467810672","Mon Apr 06 22:19:49 PDT 2009","NO_QUERY","scotthamilton","is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"
"0","1467810917","Mon Apr 06 22:19:53 PDT 2009","NO_QUERY","mattycus","@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds"
"0","1467811184","Mon Apr 06 22:19:57 PDT 2009","NO_QUERY","ElleCTF","my whole body feels itchy and like its on fire "
...
The first value is the sentiment (“0” is Negative, “4” is positive).
To start, we need to convert the file to remove incompatible UTF-8 characters:
$ iconv -c training.1600000.processed.noemoticon.csv > training-data.csv
The -c option specifies to ignore lines which cause errors.
The file training-data.csv is now the input for the next step.
The following script cleans the data to extract only the parts we need: the strings and the labelled sentiment values.
prepare-training-data.py:
import numpy as np
inputFile = open("training-data.csv")
lines = inputFile.readlines()
np.random.shuffle(lines)
outputLines = []
for line in lines:
  parts = line.split(",")
  sentiment = parts[0]
  text = parts[5]
  outputLine = text.strip() + " , " + sentiment + "\n"
  outputLines.append(outputLine)
outputFile = open("cleaned-sentiment-data.csv", "w")
outputFile.writelines(outputLines)
Run the script to generate the cleaned file:
$ python prepare-training-data.py
The cleaned training file has only Text, Sentiment.
The file looks like this:
cleaned-sentiment-data.csv:
"@realdollowner Today is a better day.  Overslept , "4"
"@argreen Boo~ , "0"
"Just for the people I don't know  x" , "4"
...
We can now use this file to train our model.
The following script is the training process.
sentiment-train.py:
import pickle
from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer 
from numpy import array
trainFile = open("cleaned-sentiment-data.csv", "r")
labels = []
wordVectors = []
allLines = trainFile.readlines()
# Take a subset of the data.
lines = allLines[:600000]
for line in lines:
  parts = line.split(",")
  string = parts[0].strip()
  wordVectors.append(string)
  sentiment = parts[1].strip()
  if sentiment == "\"4\"": # Positive.
    labels.append(array([1, 0]))
  if sentiment == "\"0\"": # Negative.
    labels.append(array([0, 1]))
labels = array(labels)
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(wordVectors)
# Save tokenizer to file; will be needed for categorization script.
with open("tokenizer.pickle", "wb") as handle:
  pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
sequences = tokenizer.texts_to_sequences(wordVectors)
paddedSequences = pad_sequences(sequences, maxlen=60)
model = Sequential()
# Embedding layer: number of possible words, size of the embedding vectors.
model.add(Embedding(10000, 60))
model.add(LSTM(15, dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='adam',
  loss='categorical_crossentropy',
  metrics=['accuracy']
)
model.summary()
model.fit(paddedSequences, labels, epochs=5, batch_size=128)
model.save("sentiment-model.h5")
The training process output will be something similar to:
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, None, 60)          600000
_________________________________________________________________
lstm (LSTM)                  (None, 15)                4560
_________________________________________________________________
dense (Dense)                (None, 2)                 32
=================================================================
Total params: 604,592
Trainable params: 604,592
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
4688/4688 [==============================] - 134s 29ms/step - loss: 0.4844 - accuracy: 0.7660
Epoch 2/5
4688/4688 [==============================] - 111s 24ms/step - loss: 0.4488 - accuracy: 0.7879
Epoch 3/5
4688/4688 [==============================] - 110s 24ms/step - loss: 0.4342 - accuracy: 0.7961
Epoch 4/5
4688/4688 [==============================] - 111s 24ms/step - loss: 0.4226 - accuracy: 0.8026
Epoch 5/5
4688/4688 [==============================] - 127s 27ms/step - loss: 0.4128 - accuracy: 0.8079
The model created is saved to the file: sentiment-model.h5
To use the model, we can use the following script:
sentiment-classify.py:
import pickle
from keras import models
from keras.preprocessing.sequence import pad_sequences
with open("tokenizer.pickle", "rb") as handle:
    tokenizer = pickle.load(handle)
userInput = input("Enter a phrase: ")
inputSequence = tokenizer.texts_to_sequences([userInput])
paddedSequence = pad_sequences(inputSequence)
model = models.load_model("sentiment-model.h5")
predictions = model.predict(paddedSequence)
print(predictions[0])
if predictions[0][0] > predictions[0][1]:
  print("Positive")
else:
  print("Negative")
Example usage:
$ python sentiment-classify.py 
Enter a phrase: what a great day!
[0.8984171  0.10158285]
Positive
$ python sentiment-classify.py
Enter a phrase: yesterday was terrible
[0.13580368 0.86419624]
Negative