This post shows how to build a sentiment classifier for strings using Deep Learning, specifically Keras.
The classifier is trained from scratch using labelled data from Twitter.
The data set can be obtained from:
The CSV data looks like the following, inside the file:
training.1600000.processed.noemoticon.csv:
"0","1467810369","Mon Apr 06 22:19:45 PDT 2009","NO_QUERY","_TheSpecialOne_","@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
"0","1467810672","Mon Apr 06 22:19:49 PDT 2009","NO_QUERY","scotthamilton","is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!"
"0","1467810917","Mon Apr 06 22:19:53 PDT 2009","NO_QUERY","mattycus","@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds"
"0","1467811184","Mon Apr 06 22:19:57 PDT 2009","NO_QUERY","ElleCTF","my whole body feels itchy and like its on fire "
...
The first value is the sentiment (“0” is Negative, “4” is positive).
To start, we need to convert the file to remove incompatible UTF-8 characters:
$ iconv -c training.1600000.processed.noemoticon.csv > training-data.csv
The -c option specifies to ignore lines which cause errors.
The file training-data.csv is now the input for the next step.
The following script cleans the data to extract only the parts we need: the strings and the labelled sentiment values.
prepare-training-data.py:
import numpy as np
inputFile = open("training-data.csv")
lines = inputFile.readlines()
np.random.shuffle(lines)
outputLines = []
for line in lines:
parts = line.split(",")
sentiment = parts[0]
text = parts[5]
outputLine = text.strip() + " , " + sentiment + "\n"
outputLines.append(outputLine)
outputFile = open("cleaned-sentiment-data.csv", "w")
outputFile.writelines(outputLines)
Run the script to generate the cleaned file:
$ python prepare-training-data.py
The cleaned training file has only Text, Sentiment.
The file looks like this:
cleaned-sentiment-data.csv:
"@realdollowner Today is a better day. Overslept , "4"
"@argreen Boo~ , "0"
"Just for the people I don't know x" , "4"
...
We can now use this file to train our model.
The following script is the training process.
sentiment-train.py:
import pickle
from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from numpy import array
trainFile = open("cleaned-sentiment-data.csv", "r")
labels = []
wordVectors = []
allLines = trainFile.readlines()
# Take a subset of the data.
lines = allLines[:600000]
for line in lines:
parts = line.split(",")
string = parts[0].strip()
wordVectors.append(string)
sentiment = parts[1].strip()
if sentiment == "\"4\"": # Positive.
labels.append(array([1, 0]))
if sentiment == "\"0\"": # Negative.
labels.append(array([0, 1]))
labels = array(labels)
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(wordVectors)
# Save tokenizer to file; will be needed for categorization script.
with open("tokenizer.pickle", "wb") as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
sequences = tokenizer.texts_to_sequences(wordVectors)
paddedSequences = pad_sequences(sequences, maxlen=60)
model = Sequential()
# Embedding layer: number of possible words, size of the embedding vectors.
model.add(Embedding(10000, 60))
model.add(LSTM(15, dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
model.summary()
model.fit(paddedSequences, labels, epochs=5, batch_size=128)
model.save("sentiment-model.h5")
The training process output will be something similar to:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 60) 600000
_________________________________________________________________
lstm (LSTM) (None, 15) 4560
_________________________________________________________________
dense (Dense) (None, 2) 32
=================================================================
Total params: 604,592
Trainable params: 604,592
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
4688/4688 [==============================] - 134s 29ms/step - loss: 0.4844 - accuracy: 0.7660
Epoch 2/5
4688/4688 [==============================] - 111s 24ms/step - loss: 0.4488 - accuracy: 0.7879
Epoch 3/5
4688/4688 [==============================] - 110s 24ms/step - loss: 0.4342 - accuracy: 0.7961
Epoch 4/5
4688/4688 [==============================] - 111s 24ms/step - loss: 0.4226 - accuracy: 0.8026
Epoch 5/5
4688/4688 [==============================] - 127s 27ms/step - loss: 0.4128 - accuracy: 0.8079
The model created is saved to the file: sentiment-model.h5
To use the model, we can use the following script:
sentiment-classify.py:
import pickle
from keras import models
from keras.preprocessing.sequence import pad_sequences
with open("tokenizer.pickle", "rb") as handle:
tokenizer = pickle.load(handle)
userInput = input("Enter a phrase: ")
inputSequence = tokenizer.texts_to_sequences([userInput])
paddedSequence = pad_sequences(inputSequence)
model = models.load_model("sentiment-model.h5")
predictions = model.predict(paddedSequence)
print(predictions[0])
if predictions[0][0] > predictions[0][1]:
print("Positive")
else:
print("Negative")
Example usage:
$ python sentiment-classify.py
Enter a phrase: what a great day!
[0.8984171 0.10158285]
Positive
$ python sentiment-classify.py
Enter a phrase: yesterday was terrible
[0.13580368 0.86419624]
Negative