This post shows how to build a sentiment classifier for strings using Deep Learning, specifically Keras.
The classifier is trained from scratch using labelled data from Twitter.
The data set can be obtained from:
The CSV data looks like the following, inside the file:
training.1600000.processed.noemoticon.csv:
"0","1467810369","Mon Apr 06 22:19:45 PDT 2009","NO_QUERY","_TheSpecialOne_","@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D" "0","1467810672","Mon Apr 06 22:19:49 PDT 2009","NO_QUERY","scotthamilton","is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!" "0","1467810917","Mon Apr 06 22:19:53 PDT 2009","NO_QUERY","mattycus","@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds" "0","1467811184","Mon Apr 06 22:19:57 PDT 2009","NO_QUERY","ElleCTF","my whole body feels itchy and like its on fire " ...
The first value is the sentiment (“0” is Negative, “4” is positive).
To start, we need to convert the file to remove incompatible UTF-8 characters:
$ iconv -c training.1600000.processed.noemoticon.csv > training-data.csv
The -c option specifies to ignore lines which cause errors.
The file training-data.csv is now the input for the next step.
The following script cleans the data to extract only the parts we need: the strings and the labelled sentiment values.
prepare-training-data.py:
import numpy as np inputFile = open("training-data.csv") lines = inputFile.readlines() np.random.shuffle(lines) outputLines = [] for line in lines: parts = line.split(",") sentiment = parts[0] text = parts[5] outputLine = text.strip() + " , " + sentiment + "\n" outputLines.append(outputLine) outputFile = open("cleaned-sentiment-data.csv", "w") outputFile.writelines(outputLines)
Run the script to generate the cleaned file:
$ python prepare-training-data.py
The cleaned training file has only Text, Sentiment.
The file looks like this:
cleaned-sentiment-data.csv:
"@realdollowner Today is a better day. Overslept , "4" "@argreen Boo~ , "0" "Just for the people I don't know x" , "4" ...
We can now use this file to train our model.
The following script is the training process.
sentiment-train.py:
import pickle from keras.layers import Embedding, LSTM, Dense from keras.models import Sequential from keras.preprocessing.sequence import pad_sequences from keras.preprocessing.text import Tokenizer from numpy import array trainFile = open("cleaned-sentiment-data.csv", "r") labels = [] wordVectors = [] allLines = trainFile.readlines() # Take a subset of the data. lines = allLines[:600000] for line in lines: parts = line.split(",") string = parts[0].strip() wordVectors.append(string) sentiment = parts[1].strip() if sentiment == "\"4\"": # Positive. labels.append(array([1, 0])) if sentiment == "\"0\"": # Negative. labels.append(array([0, 1])) labels = array(labels) tokenizer = Tokenizer(num_words=10000) tokenizer.fit_on_texts(wordVectors) # Save tokenizer to file; will be needed for categorization script. with open("tokenizer.pickle", "wb") as handle: pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL) sequences = tokenizer.texts_to_sequences(wordVectors) paddedSequences = pad_sequences(sequences, maxlen=60) model = Sequential() # Embedding layer: number of possible words, size of the embedding vectors. model.add(Embedding(10000, 60)) model.add(LSTM(15, dropout=0.2)) model.add(Dense(2, activation='softmax')) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'] ) model.summary() model.fit(paddedSequences, labels, epochs=5, batch_size=128) model.save("sentiment-model.h5")
The training process output will be something similar to:
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, None, 60) 600000 _________________________________________________________________ lstm (LSTM) (None, 15) 4560 _________________________________________________________________ dense (Dense) (None, 2) 32 ================================================================= Total params: 604,592 Trainable params: 604,592 Non-trainable params: 0 _________________________________________________________________ Epoch 1/5 4688/4688 [==============================] - 134s 29ms/step - loss: 0.4844 - accuracy: 0.7660 Epoch 2/5 4688/4688 [==============================] - 111s 24ms/step - loss: 0.4488 - accuracy: 0.7879 Epoch 3/5 4688/4688 [==============================] - 110s 24ms/step - loss: 0.4342 - accuracy: 0.7961 Epoch 4/5 4688/4688 [==============================] - 111s 24ms/step - loss: 0.4226 - accuracy: 0.8026 Epoch 5/5 4688/4688 [==============================] - 127s 27ms/step - loss: 0.4128 - accuracy: 0.8079
The model created is saved to the file: sentiment-model.h5
To use the model, we can use the following script:
sentiment-classify.py:
import pickle from keras import models from keras.preprocessing.sequence import pad_sequences with open("tokenizer.pickle", "rb") as handle: tokenizer = pickle.load(handle) userInput = input("Enter a phrase: ") inputSequence = tokenizer.texts_to_sequences([userInput]) paddedSequence = pad_sequences(inputSequence) model = models.load_model("sentiment-model.h5") predictions = model.predict(paddedSequence) print(predictions[0]) if predictions[0][0] > predictions[0][1]: print("Positive") else: print("Negative")
Example usage:
$ python sentiment-classify.py Enter a phrase: what a great day! [0.8984171 0.10158285] Positive $ python sentiment-classify.py Enter a phrase: yesterday was terrible [0.13580368 0.86419624] Negative