Sentiment Analysis with Keras based on Twitter Training Data

This post shows how to build a sentiment classifier for strings using Deep Learning, specifically Keras.
The classifier is trained from scratch using labelled data from Twitter.
The data set can be obtained from:
The CSV data looks like the following, inside the file:

training.1600000.processed.noemoticon.csv:

"0","1467810369","Mon Apr 06 22:19:45 PDT 2009","NO_QUERY","_TheSpecialOne_","@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"
"0","1467810672","Mon Apr 06 22:19:49 PDT 2009","NO_QUERY","scotthamilton","is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"
"0","1467810917","Mon Apr 06 22:19:53 PDT 2009","NO_QUERY","mattycus","@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds"
"0","1467811184","Mon Apr 06 22:19:57 PDT 2009","NO_QUERY","ElleCTF","my whole body feels itchy and like its on fire "
...
The first value is the sentiment (“0” is Negative, “4” is positive).
To start, we need to convert the file to remove incompatible UTF-8 characters:
$ iconv -c training.1600000.processed.noemoticon.csv > training-data.csv
The -c option specifies to ignore lines which cause errors.
The file training-data.csv is now the input for the next step.
The following script cleans the data to extract only the parts we need: the strings and the labelled sentiment values.

prepare-training-data.py:

import numpy as np

inputFile = open("training-data.csv")
lines = inputFile.readlines()

np.random.shuffle(lines)

outputLines = []

for line in lines:
  parts = line.split(",")
  sentiment = parts[0]
  text = parts[5]
  outputLine = text.strip() + " , " + sentiment + "\n"
  outputLines.append(outputLine)

outputFile = open("cleaned-sentiment-data.csv", "w")
outputFile.writelines(outputLines)
Run the script to generate the cleaned file:
$ python prepare-training-data.py
The cleaned training file has only Text, Sentiment.
The file looks like this:

cleaned-sentiment-data.csv:

"@realdollowner Today is a better day.  Overslept , "4"
"@argreen Boo~ , "0"
"Just for the people I don't know  x" , "4"
...
We can now use this file to train our model.
The following script is the training process.

sentiment-train.py:

import pickle

from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer 
from numpy import array

trainFile = open("cleaned-sentiment-data.csv", "r")

labels = []
wordVectors = []

allLines = trainFile.readlines()

# Take a subset of the data.
lines = allLines[:600000]

for line in lines:
  parts = line.split(",")
  string = parts[0].strip()
  wordVectors.append(string)

  sentiment = parts[1].strip()
  if sentiment == "\"4\"": # Positive.
    labels.append(array([1, 0]))
  if sentiment == "\"0\"": # Negative.
    labels.append(array([0, 1]))

labels = array(labels)

tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(wordVectors)

# Save tokenizer to file; will be needed for categorization script.
with open("tokenizer.pickle", "wb") as handle:
  pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

sequences = tokenizer.texts_to_sequences(wordVectors)

paddedSequences = pad_sequences(sequences, maxlen=60)

model = Sequential()
# Embedding layer: number of possible words, size of the embedding vectors.
model.add(Embedding(10000, 60))
model.add(LSTM(15, dropout=0.2))
model.add(Dense(2, activation='softmax'))

model.compile(optimizer='adam',
  loss='categorical_crossentropy',
  metrics=['accuracy']
)

model.summary()

model.fit(paddedSequences, labels, epochs=5, batch_size=128)

model.save("sentiment-model.h5")
The training process output will be something similar to:
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, None, 60)          600000
_________________________________________________________________
lstm (LSTM)                  (None, 15)                4560
_________________________________________________________________
dense (Dense)                (None, 2)                 32
=================================================================
Total params: 604,592
Trainable params: 604,592
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
4688/4688 [==============================] - 134s 29ms/step - loss: 0.4844 - accuracy: 0.7660
Epoch 2/5
4688/4688 [==============================] - 111s 24ms/step - loss: 0.4488 - accuracy: 0.7879
Epoch 3/5
4688/4688 [==============================] - 110s 24ms/step - loss: 0.4342 - accuracy: 0.7961
Epoch 4/5
4688/4688 [==============================] - 111s 24ms/step - loss: 0.4226 - accuracy: 0.8026
Epoch 5/5
4688/4688 [==============================] - 127s 27ms/step - loss: 0.4128 - accuracy: 0.8079
The model created is saved to the file: sentiment-model.h5
To use the model, we can use the following script:

sentiment-classify.py:

import pickle
from keras import models
from keras.preprocessing.sequence import pad_sequences

with open("tokenizer.pickle", "rb") as handle:
    tokenizer = pickle.load(handle)

userInput = input("Enter a phrase: ")

inputSequence = tokenizer.texts_to_sequences([userInput])

paddedSequence = pad_sequences(inputSequence)

model = models.load_model("sentiment-model.h5")

predictions = model.predict(paddedSequence)

print(predictions[0])

if predictions[0][0] > predictions[0][1]:
  print("Positive")
else:
  print("Negative")
Example usage:
$ python sentiment-classify.py 
Enter a phrase: what a great day!

[0.8984171  0.10158285]
Positive

$ python sentiment-classify.py
Enter a phrase: yesterday was terrible

[0.13580368 0.86419624]
Negative

Test XPath in the Terminal

There is a handy command-line tool to run XPath expressions and test them in the terminal on macOS.
The command is: xpath.

Assuming we have an XML file test.xml:

<books>
  <book id="123">
    <title>Book A</title>
  </book>
  <book id="456">
    <title>Book B</title>
  </book>
</books>

To test an XPath statement on this file, we can use:

$ xpath test.xml "//book[@id=456]"

Result:

Found 1 nodes:
-- NODE --
<book id="456">
  <title>Book B</title>
</book>

Note that the file argument is first and the XPath expression second, in double-quotes.

 

Merge Standard Error with Standard Output when Using a Pipe

When piping the output of one terminal command to another on a Unix-based system, by default only the Standard Output (stdout) of the first command is piped to the Standard Input (stdin) of the second command.
We can merge streams to make sure stderr is piped to the second command.
The general syntax is:

$ cmd1 2>&1 | cmd2

For a specific example: the “No such file” error below is sent to Standard Error.
The word count command receives an empty input.

$ cat invalid-file | wc
cat: invalid-file: No such file or directory
    0     0     0

Now, if we merge streams, Standard Error is piped to word-count and the number of characters in the error message is counted and printed:

$ cat invalid-file 2>&1 | wc
    1     7     45

This shows that en error output was sent through the pipe to the second command.

 

Run a Large Language Model Locally in the Terminal

We can run a Large Language Model (LLM) – although not quite as good as ChatGPT – on a local machine.
One of the easiest to run is Alpaca, a fine-tuning of LLaMA.
The following works on an Apple M1 Mac.

Clone and build the repo:

$ git clone https://github.com/antimatter15/alpaca.cpp

$ cd alpaca.cpp/

$ make chat

Download the pre-trained model weights:

$ wget -O ggml-alpaca-7b-q4.bin -c https://gateway.estuary.tech/gw/ipfs/QmQ1bf2BTnYxq73MFJWu1B7bQ2UD6qG7D7YDCxhTndVkPC

(See the source repo below for alternatives if this fails).

Run the model:

$ ./chat

Output:

main: seed = 1679968451
llama_model_load: loading model from 'ggml-alpaca-7b-q4.bin' - please wait ...
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'ggml-alpaca-7b-q4.bin'
llama_model_load: .................................... done
llama_model_load: model size = 4017.27 MB / num tensors = 291

== Running in chat mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- If you want to submit another line, end your input in '\'.

> What is the age of the universe?
The current estimate for when our Universe was created, 
according to modern cosmology and astronomy, 
is 13.798 billion years ago (±0.2%).
>

References

https://github.com/antimatter15/alpaca.cpp

When npm prestart is Executed

The package.json file allows us to specify a “prestart” script.
When does the npm prestart script run?
The entry under “prestart” will run first when “start” is called, before the script specified by “start“.

The example below demonstrates this.

package.json:

{
  "name": "example-repo",
  "scripts": {
    "prestart": "echo 'In prestart...'",
    "start": "node index.js"
  }
}

Running start:

$ npm run start

Output:

> prestart
> echo 'In prestart...'

In prestart...

> start
> node index.js

This can be useful to run scripts required as a pre-requisite for running a build, for example.

 

REST API Design: Endpoint to Delete a Large Number of Items

In a proper REST API design, we use the DELETE HTTP verb to delete items by ID:

DELETE /resources/{id}

Deleting a list of IDs could look like:

DELETE /resources/{id1},{id2},{id3}

Or as a list of query string parameters:

DELETE /resources?ids=id1,id2,id3

What if the list of items to delete is really large? Meaning, what if clients need to delete items by specifying thousands of IDs?
We then run into URI length limits, so the above options will not be enough.

One way of dealing with this would be to avoid the issue using deletion by some criteria, e.g. tags.

DELETE /resources?tag=items-to-delete

But what if we really have no choice but to delete a very large ad-hoc unpredictable list specified by the client?

The following design can work well. Since DELETE does not have a body, we can use POST.

Because the verb POST no longer properly reflects the action being performed by the API, we can add a sub-resource named /bulk-deletion under resources:

POST /resources/bulk-deletion

{
  "idsToDelete": [ id1, id2, id3, id4, ..., idN ]
}

As a variation, the sub-resource could be /deletion-list or even /ids-to-delete:

POST /resources/ids-to-delete

{
  "ids": [ id1, id2, id3, id4, ..., idN ]
}

This way our limit on the number of IDs is the POST body size limit, so the design can handle a very long list.
The path makes the operation unambiguous and also keeps the route name as a noun so we avoid action names in the URI.

 

Install golang-migrate Inside a Docker Container

To run database migrations using golang-migrate in a pipeline for a Go application, we need the binary migrate command available in the container.

Installing the database migration utility golang-migrate for Go inside a Linux container can be accomplished with the following.
Add this to the Dockerfile (note: the full path):

RUN go install -tags 'postgres' github.com/golang-migrate/migrate/v4/cmd/migrate@latest
RUN ln -s /go/bin/linux_amd64/migrate /usr/local/bin/migrate

After running docker build and run, SSH into the container to see if everything is correct.
Use docker ps to get the container ID. The following commands accomplish this:

$ docker build -t my-tag .

$ docker run -d -p 5000:5000 my-tag

$ docker ps
CONTAINER ID IMAGE ...
8e402138e4b2 my-tag ...

$ docker exec -it 8e402138e4b2 /bin/sh

Now we should have a shell inside the container.
Check if the migrate command installed correctly:

/# which migrate

This should show a path to the command, e.g.:

/go/bin/migrate

Check the version of the command:

/# migrate -version

This should show, e.g.:

dev

The container should now be able to run Go database migrations, preferably by running a startup script when the container starts.

 

Perform Speech Recognition in the Terminal with Whisper

OpenAI Whisper is a state-of-the-art speech recognition model that we can run from the command line.

This post assumes macOS with Python >= 3.7 installed.

First we need to install FFmpeg for audio processing.

$ brew install ffmpeg

Install Whisper:

$ pip install openai-whisper

This will also install a binary command: whisper

Now, record a piece of audio using QuickTime or similar.

Save the file to file.m4a, for example.

Then, to run the speech recognition:

$ whisper file.m4a --model small

The output will look something like this:

Detecting language using up to the first 30 seconds. 
Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:01.420] Hello there.

Notes

We can also use the specific repo URI if brew does not work on a system:

$ pip install git+https://github.com/openai/whisper.git

We can use the medium or large models if the small model is not sufficiently accurate:

$ whisper file.m4a --model medium
$ whisper file.m4a --model large

 

Run TypeScript File from the Command Line

To run a script written in TypeScript from the terminal, outside of the browser, we cannot use the regular Node.js binary.
We need to install the ts-node command as below:

$ npm install -g ts-node

If TypeScript itself is not installed, install it with:

$ npm install -g typescript

Then, we can run the TypeScript file using:

$ ts-node file.ts

 

GraphQL Mesh Gateway Health Check Endpoint

When running the GraphQL Mesh server in a cloud environment in a container, perhaps using a platform like Kubernetes or Elastic Container Service on AWS, we need to specify an HTTP endpoint to act as a health check for the container.

Though it is not well-documented, the Mesh server does have a healthcheck endpoint available.

The route is simply: /healthcheck

Here is an example calling the healthcheck of a locally running Mesh server:

$ curl "http://localhost:4000/healthcheck" -v
* Connected to localhost (127.0.0.1) port 4000 (#0)
> GET /healthcheck HTTP/1.1
> Host: localhost:4000
> User-Agent: curl/7.77.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Mon, 19 Dec 2022 03:06:17 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
< Content-Length: 0
<

Note that just the root response (“/”) returns an HTTP 302 Found and so /healthcheck is ideal for a 200 OK health status response.