Sentiment Analysis with Keras based on Twitter Training Data

This post shows how to build a sentiment classifier for strings using Deep Learning, specifically Keras.
The classifier is trained from scratch using labelled data from Twitter.
The data set can be obtained from:
The CSV data looks like the following, inside the file:


"0","1467810369","Mon Apr 06 22:19:45 PDT 2009","NO_QUERY","_TheSpecialOne_","@switchfoot - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"
"0","1467810672","Mon Apr 06 22:19:49 PDT 2009","NO_QUERY","scotthamilton","is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"
"0","1467810917","Mon Apr 06 22:19:53 PDT 2009","NO_QUERY","mattycus","@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds"
"0","1467811184","Mon Apr 06 22:19:57 PDT 2009","NO_QUERY","ElleCTF","my whole body feels itchy and like its on fire "
The first value is the sentiment (“0” is Negative, “4” is positive).
To start, we need to convert the file to remove incompatible UTF-8 characters:
$ iconv -c training.1600000.processed.noemoticon.csv > training-data.csv
The -c option specifies to ignore lines which cause errors.
The file training-data.csv is now the input for the next step.
The following script cleans the data to extract only the parts we need: the strings and the labelled sentiment values.

import numpy as np

inputFile = open("training-data.csv")
lines = inputFile.readlines()


outputLines = []

for line in lines:
  parts = line.split(",")
  sentiment = parts[0]
  text = parts[5]
  outputLine = text.strip() + " , " + sentiment + "\n"

outputFile = open("cleaned-sentiment-data.csv", "w")
Run the script to generate the cleaned file:
$ python
The cleaned training file has only Text, Sentiment.
The file looks like this:


"@realdollowner Today is a better day.  Overslept , "4"
"@argreen Boo~ , "0"
"Just for the people I don't know  x" , "4"
We can now use this file to train our model.
The following script is the training process.

import pickle

from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer 
from numpy import array

trainFile = open("cleaned-sentiment-data.csv", "r")

labels = []
wordVectors = []

allLines = trainFile.readlines()

# Take a subset of the data.
lines = allLines[:600000]

for line in lines:
  parts = line.split(",")
  string = parts[0].strip()

  sentiment = parts[1].strip()
  if sentiment == "\"4\"": # Positive.
    labels.append(array([1, 0]))
  if sentiment == "\"0\"": # Negative.
    labels.append(array([0, 1]))

labels = array(labels)

tokenizer = Tokenizer(num_words=10000)

# Save tokenizer to file; will be needed for categorization script.
with open("tokenizer.pickle", "wb") as handle:
  pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

sequences = tokenizer.texts_to_sequences(wordVectors)

paddedSequences = pad_sequences(sequences, maxlen=60)

model = Sequential()
# Embedding layer: number of possible words, size of the embedding vectors.
model.add(Embedding(10000, 60))
model.add(LSTM(15, dropout=0.2))
model.add(Dense(2, activation='softmax'))


model.summary(), labels, epochs=5, batch_size=128)"sentiment-model.h5")
The training process output will be something similar to:
Model: "sequential"
Layer (type)                 Output Shape              Param #
embedding (Embedding)        (None, None, 60)          600000
lstm (LSTM)                  (None, 15)                4560
dense (Dense)                (None, 2)                 32
Total params: 604,592
Trainable params: 604,592
Non-trainable params: 0
Epoch 1/5
4688/4688 [==============================] - 134s 29ms/step - loss: 0.4844 - accuracy: 0.7660
Epoch 2/5
4688/4688 [==============================] - 111s 24ms/step - loss: 0.4488 - accuracy: 0.7879
Epoch 3/5
4688/4688 [==============================] - 110s 24ms/step - loss: 0.4342 - accuracy: 0.7961
Epoch 4/5
4688/4688 [==============================] - 111s 24ms/step - loss: 0.4226 - accuracy: 0.8026
Epoch 5/5
4688/4688 [==============================] - 127s 27ms/step - loss: 0.4128 - accuracy: 0.8079
The model created is saved to the file: sentiment-model.h5
To use the model, we can use the following script:

import pickle
from keras import models
from keras.preprocessing.sequence import pad_sequences

with open("tokenizer.pickle", "rb") as handle:
    tokenizer = pickle.load(handle)

userInput = input("Enter a phrase: ")

inputSequence = tokenizer.texts_to_sequences([userInput])

paddedSequence = pad_sequences(inputSequence)

model = models.load_model("sentiment-model.h5")

predictions = model.predict(paddedSequence)


if predictions[0][0] > predictions[0][1]:
Example usage:
$ python 
Enter a phrase: what a great day!

[0.8984171  0.10158285]

$ python
Enter a phrase: yesterday was terrible

[0.13580368 0.86419624]

Test XPath in the Terminal

There is a handy command-line tool to run XPath expressions and test them in the terminal on macOS.
The command is: xpath.

Assuming we have an XML file test.xml:

  <book id="123">
    <title>Book A</title>
  <book id="456">
    <title>Book B</title>

To test an XPath statement on this file, we can use:

$ xpath test.xml "//book[@id=456]"


Found 1 nodes:
-- NODE --
<book id="456">
  <title>Book B</title>

Note that the file argument is first and the XPath expression second, in double-quotes.