Sentiment Analysis with Keras based on Twitter Training Data

This post shows how to build a sentiment classifier for strings using Deep Learning, specifically Keras.

The classifier is trained from scratch using labelled data from Twitter.

The data set can be obtained from:

http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

The CSV data looks like the following, inside the file:

training.1600000.processed.noemoticon.csv:

"0","1467810369","Mon Apr 06 22:19:45 PDT 2009","NO_QUERY","_TheSpecialOne_","@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"
"0","1467810672","Mon Apr 06 22:19:49 PDT 2009","NO_QUERY","scotthamilton","is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"
"0","1467810917","Mon Apr 06 22:19:53 PDT 2009","NO_QUERY","mattycus","@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds"
"0","1467811184","Mon Apr 06 22:19:57 PDT 2009","NO_QUERY","ElleCTF","my whole body feels itchy and like its on fire "
...

The first value is the sentiment (“0” is Negative, “4” is positive).

To start, we need to convert the file to remove incompatible UTF-8 characters:

$ iconv -c training.1600000.processed.noemoticon.csv > training-data.csv

The -c option specifies to ignore lines which cause errors.

The file training-data.csv is now the input for the next step.

The following script cleans the data to extract only the parts we need: the strings and the labelled sentiment values.

prepare-training-data.py:

import numpy as np

inputFile = open("training-data.csv")
lines = inputFile.readlines()

np.random.shuffle(lines)

outputLines = []

for line in lines:
  parts = line.split(",")
  sentiment = parts[0]
  text = parts[5]
  outputLine = text.strip() + " , " + sentiment + "\n"
  outputLines.append(outputLine)

outputFile = open("cleaned-sentiment-data.csv", "w")
outputFile.writelines(outputLines)

Run the script to generate the cleaned file:

$ python prepare-training-data.py

The cleaned training file has only Text, Sentiment.

The file looks like this:

cleaned-sentiment-data.csv:

"@realdollowner Today is a better day.  Overslept , "4"
"@argreen Boo~ , "0"
"Just for the people I don't know  x" , "4"
...

We can now use this file to train our model.

The following script is the training process.

sentiment-train.py:

import pickle

from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer 
from numpy import array

trainFile = open("cleaned-sentiment-data.csv", "r")

labels = []
wordVectors = []

allLines = trainFile.readlines()

# Take a subset of the data.
lines = allLines[:600000]

for line in lines:
  parts = line.split(",")
  string = parts[0].strip()
  wordVectors.append(string)

  sentiment = parts[1].strip()
  if sentiment == "\"4\"": # Positive.
    labels.append(array([1, 0]))
  if sentiment == "\"0\"": # Negative.
    labels.append(array([0, 1]))

labels = array(labels)

tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(wordVectors)

# Save tokenizer to file; will be needed for categorization script.
with open("tokenizer.pickle", "wb") as handle:
  pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

sequences = tokenizer.texts_to_sequences(wordVectors)

paddedSequences = pad_sequences(sequences, maxlen=60)

model = Sequential()
# Embedding layer: number of possible words, size of the embedding vectors.
model.add(Embedding(10000, 60))
model.add(LSTM(15, dropout=0.2))
model.add(Dense(2, activation='softmax'))

model.compile(optimizer='adam',
  loss='categorical_crossentropy',
  metrics=['accuracy']
)

model.summary()

model.fit(paddedSequences, labels, epochs=5, batch_size=128)

model.save("sentiment-model.h5")

The training process output will be something similar to:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, None, 60)          600000
_________________________________________________________________
lstm (LSTM)                  (None, 15)                4560
_________________________________________________________________
dense (Dense)                (None, 2)                 32
=================================================================
Total params: 604,592
Trainable params: 604,592
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
4688/4688 [==============================] - 134s 29ms/step - loss: 0.4844 - accuracy: 0.7660
Epoch 2/5
4688/4688 [==============================] - 111s 24ms/step - loss: 0.4488 - accuracy: 0.7879
Epoch 3/5
4688/4688 [==============================] - 110s 24ms/step - loss: 0.4342 - accuracy: 0.7961
Epoch 4/5
4688/4688 [==============================] - 111s 24ms/step - loss: 0.4226 - accuracy: 0.8026
Epoch 5/5
4688/4688 [==============================] - 127s 27ms/step - loss: 0.4128 - accuracy: 0.8079

The model created is saved to the file: sentiment-model.h5

To use the model, we can use the following script:

sentiment-classify.py:

import pickle
from keras import models
from keras.preprocessing.sequence import pad_sequences

with open("tokenizer.pickle", "rb") as handle:
    tokenizer = pickle.load(handle)

userInput = input("Enter a phrase: ")

inputSequence = tokenizer.texts_to_sequences([userInput])

paddedSequence = pad_sequences(inputSequence)

model = models.load_model("sentiment-model.h5")

predictions = model.predict(paddedSequence)

print(predictions[0])

if predictions[0][0] > predictions[0][1]:
  print("Positive")
else:
  print("Negative")

Example usage:

$ python sentiment-classify.py 
Enter a phrase: what a great day!

[0.8984171  0.10158285]
Positive

$ python sentiment-classify.py
Enter a phrase: yesterday was terrible

[0.13580368 0.86419624]
Negative

training.1600000.processed.noemoticon.csv:

prepare-training-data.py:

cleaned-sentiment-data.csv:

sentiment-train.py:

sentiment-classify.py:

Leave a Reply Cancel reply