Skip to main content

Command Palette

Search for a command to run...

Visualization of Self-Attention Network for Sentiment Analysis

Published
4 min read
M

I am a self-taught Python aficionado, dancing in the realms of AI and ML. What started as a curious exploration soon turned into a revelation: the unsung heroes behind the AI symphony are linear algebra, probability, and statistics. Astonishingly, these mathematical wizards not only power the algorithms but also surpass human problem-solving finesse.

I use Self-Attention network for Sentiment Analysis. This model is equivalent to 1 head of Multi head Transformer Encoder.

I use the IMDB dataset which contains 50K movie reviews which are marked positive or negative in sentiment.

Architecture

The inputs are tokenized and then represented via a vector.

Since self-attentions have no way of knowing the position of words in the input, positional information is added to the inputs.

The vectors are projected in query, key and value matrices.

Self-attention processes them and the output is averaged. What does Self-Attention do actually? Since in self-attention each token communicated with one another, it can be seen where information is gathered.

Now that information has been gathered, we want to think on it. So it is passed through a dense layer.

Finally the output from the dense layer is feed to the output layer which uses sigmoid activation for classification into positive or negative.

The averaged output is passed through a layer with is further sent to the output layer with sigmoid activation.

Visualizations

Visualization for a Negative Sentiment

Input text: 'Horrible acting with the worst special f/x I've ever bore witness too. It's bad enough I wasted $3 to watch this crummy pile of crap, but it's the hour and a half time I lost that I could've been doing anything else like getting a root canal or volunteering for jury duty. Getting drunk couldn't even help this video.

To put it bluntly, I sincerely believe I actually lost a few IQ points during the course of watching this idiotic piece of mind-numbing "work"! Perhaps I should have followed my own advice this time. Never expect a decent film if it's written, directed and produced by the same person, and never EVER expect anything of value from Jeff Fahey.'

The input text is preprocessed. Stopwords and less common words are removed.

Attention Visualization

Interpretation: Pay attention to the words highlighted in red like horrible, worst, wasted. These are the word that the model pays most importance to for classifying the input text as negative

It is worth paying attention to the words highlighted in brown and black which also help the model in classification.

Visualization for a Positive Sentiment

Input text: 'I used to have a fascination with the cartoon back in college when it was being made. It had much the charm of "Get Smart". While it admittedly had its faults, it was rather enjoyable.

Naturally I was very interested in seeing the film version. That was before I saw it. Afterwords I wished it had never been made.

Besides being miscast all around (who on Earth though Broderick was even close to the role?) it just didn't make the grade.

The effects were reasonable and perhaps the ONLY thing I liked about the movie; seeing a live-action version of the gadgets in action! What was missing was a story and treatment which made it funny or charming or interesting.

The original was a wacky cartoon with a very lighthearted attitude. It was FUN. The motion picture became murky and took itself FAR too seriously. If it had seriously had a great plot or went crazy enough to make it seem like a "cartoon on film" it might have been enjoyable.

As it exists it doesn't deserve to be considered part of the "Gadget Legacy".'

Attention Visualization

Interpretation: Look at the highlighted red words like charm, liked, great, plot, enough. These are the words that help the model classify the input as positive.

Other words highlighted in brown and dark also help the model.

Additionally I will provide visualization on which words the self attention network pays attention to the most for classifying a review as positive or negative.

What's the intuition behind the visualization?

Since in self-attention, each token talks to each other and assigns a score which tells how much important the token is. If that score can be extracted, it can be visualized.

maybe add a diagram that shows how self attention talk to one another.