Coursera Natural Language Processing Specification notes

Jan 17, 2021 00:00 · 834 words · 4 minute read NLP

Simple notes for Coursera NLP course.

Course1. Natural Language Processing with Classification and Vector Spaces

Week1. Logistic Regression

Supervised ML and Sentiment Analysis:
- Input -> labels with optimizing cost function
Sentiment analysis: given a paragraph, we classify it as positive or negative.
Vocabulary and feature extraction:
- Vocabulary: all unique word in the interested text
- Feature extraction: Convert one sentence to a vector by set the corresponding word to 1 in the vocabulary vector if that word exists in the sentence, otherwise, set it to 0.
- Large training time and large vectors
Negative and positive frequency:
- map (word, class) to frequency
Feature extraction: xm = [1, sum of pos. frequency, sum of negative frequency]. 1 is bias.
Preprocessing: stop words, punctuation, stemming, and lowercasing
Use logistic model to predict sentiment

Week2. Naive Bayes

Baye’s rule: P(X|Y)P(Y) = P(Y|X)P(X)

Week3. Vector space models

Represent words and documents as vectors
Representation that capture relative meaning
Word by Word Design: number of times they occur together within a certain distance k
Word by Document Design: number of times a word occurs within a certain category
Cosine similarity
PCA:
- Eigenvectors give the direction of uncorrelated features
- Eigenvalues are the variance of the new features
- Dot product gives the projection on uncorrelated features

Week4. Machine translation and document search

Translate X to Y: XR=Y
Locality sensitive hashing (LSH): an algorithm for solving the approximate or exact Nearest Neighbor Search in high dimensional spaces. LSH refers to a family of functions (known as LSH families) to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. This makes it easier to identify observations with various degrees of similarity.

Course2. Natural Language Processing with Probabilistic Models

Week1. Autocorrect and minimum edit distance

How it works: * Identify a misspelled word * Find strings n edit distance away * Filter candidates * Calculate word probabilities

Week2. Part of speech tagging and Hidden Markov Models

nlp_POS nlp_transation

Week3. Autocomplete and Language Models

An N-gram is a sequence of N words.
Probability of a bigram: P(y|x) = C(x, y)/C(x)

nlp_transation

Week4. Word embedding with neural networks

Basic word embedding methods:

word2vec (Google, 2013)
- continuous bag of words (CBOW)
- continuous skip-gram / skip-gram with negative sampling
Global Vectors (GloVe)(Stanford, 2014)
fastText (Facebook, 2016)
- supports out-of-vocabulary (OOV) words

Advanced word embedding methods:

BERT (Google, 2018)
ELMo
GPT-2

Cleaning and tokenizing matters:

letter case
punctuation
numbers
special characters
special words

Course3. Natural Language Processing with Sequence Models

Week1. Neural network with sentiment analysis

Classes in Python

class MyClass:
  def __init__(self, y):
    self.y = y

  def my_method(self, x):
    return x + self.y

  def __call__(self, x):
    return self.my_method(x)

f = MyClass(7)
print(f(3))

Week2. Recurrent Neural Network for language modeling

RNNs model relationships among distant words
In RNNs a lot of computations share parameters

nlp_crossentropy

Week3. LSTM and NER

RNN disadvantages:

struggles with longer sequences
prone to vanishing or exploding gradients

LSTM: a memorable solution

Learns when to remember and when to forget
Basic anatomy:
- a cell state
- a hidden state with three gates
- loop back again at the end of each time step
- forget gate decides what to keep
- input gate decides what to add
- output gate decides what the next hidden state will be
Gates allow gradients to flow unchanged

Named Entity Recognition (NER):

locates and extracts predefined entities from text
places, organizations, names, time and dates

Applications of NER systems:

search engine efficiency
recommendation engines
customer service
automatic trading

Week4. Siamese Network

What do Siamese Networks learn?

Identify similarity between things

nlp_siaseme

nlp_tripletloss

Course4. Natural Language Processing with Attention Models

Week1. Neural Machine Translation

Seq2Seq model:

Introduced by Google in 2014
Maps variable-length sequences to fixed-length memory
LSTMs and GRUs are typically used to overcome the vanishing gradient problem

Seq2Seq shortcomings:

As sequence size increases, model performance decreases.
Solution: focus attention in the right places
- prevent sequence overload by giving the model a way to focus on the likeliest words at each step.
- do this by providing the information specific to each input word.

Inside the Attention Layer: Query -> Attention: Key, Value pairs

Week2. Text Summarization

Transformer: Encoder - decoder with attention layers

RNNs shortcomings:

parallel computing is difficult to implement
there is loss of information for long sequences in RNNs
there is the problem of vanishing gradient in RNNs
Transformers help with all of the above

State of art Transformers:

GPT-2: Generative Pre-training for Transformer
BERT: Bidirectional Encoder Representations from Transformers
T5: Text-to-text transfer transformer

Week3. Question Answering

Transfer learning:

Pretraining: sentiment, classification
Training on downstream task: question answering

nlp_transferlearning

Week4. Chatbot

Transformer issues:

Attention on sequence of length L takes L^2 time and memory
N layers take N times as much memory

Memory with N layers:

Activations need to be stored for backprop
Big models are getting bigger
Compute vs memory tradeoff

What does Attention do? Select Nearest Neighbors (K, Q) and return corresponding V

nlp_NN

nlp_LSHattention

References: