Coursera Natural Language Processing Specification notes
Jan 17, 2021 00:00 · 834 words · 4 minute read
Simple notes for Coursera NLP course.
Course1. Natural Language Processing with Classification and Vector Spaces
Week1. Logistic Regression
- Supervised ML and Sentiment Analysis:
- Input -> labels with optimizing cost function
- Sentiment analysis: given a paragraph, we classify it as positive or negative.
- Vocabulary and feature extraction:
- Vocabulary: all unique word in the interested text
- Feature extraction: Convert one sentence to a vector by set the corresponding word to 1 in the vocabulary vector if that word exists in the sentence, otherwise, set it to 0.
- Large training time and large vectors
- Negative and positive frequency:
- map (word, class) to frequency
- Feature extraction: xm = [1, sum of pos. frequency, sum of negative frequency]. 1 is bias.
- Preprocessing: stop words, punctuation, stemming, and lowercasing
- Use logistic model to predict sentiment
Week2. Naive Bayes
- Baye’s rule: P(X|Y)P(Y) = P(Y|X)P(X)
Week3. Vector space models
- Represent words and documents as vectors
- Representation that capture relative meaning
- Word by Word Design: number of times they occur together within a certain distance k
- Word by Document Design: number of times a word occurs within a certain category
- Cosine similarity
- PCA:
- Eigenvectors give the direction of uncorrelated features
- Eigenvalues are the variance of the new features
- Dot product gives the projection on uncorrelated features
Week4. Machine translation and document search
- Translate X to Y: XR=Y
- Locality sensitive hashing (LSH): an algorithm for solving the approximate or exact Nearest Neighbor Search in high dimensional spaces. LSH refers to a family of functions (known as LSH families) to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. This makes it easier to identify observations with various degrees of similarity.
Course2. Natural Language Processing with Probabilistic Models
Week1. Autocorrect and minimum edit distance
How it works: * Identify a misspelled word * Find strings n edit distance away * Filter candidates * Calculate word probabilities
Week2. Part of speech tagging and Hidden Markov Models
Week3. Autocomplete and Language Models
- An N-gram is a sequence of N words.
- Probability of a bigram: P(y|x) = C(x, y)/C(x)
Week4. Word embedding with neural networks
Basic word embedding methods:
- word2vec (Google, 2013)
- continuous bag of words (CBOW)
- continuous skip-gram / skip-gram with negative sampling
- Global Vectors (GloVe)(Stanford, 2014)
- fastText (Facebook, 2016)
- supports out-of-vocabulary (OOV) words
Advanced word embedding methods:
- BERT (Google, 2018)
- ELMo
- GPT-2
Cleaning and tokenizing matters:
- letter case
- punctuation
- numbers
- special characters
- special words
Course3. Natural Language Processing with Sequence Models
Week1. Neural network with sentiment analysis
Classes in Python
class MyClass:
def __init__(self, y):
self.y = y
def my_method(self, x):
return x + self.y
def __call__(self, x):
return self.my_method(x)
f = MyClass(7)
print(f(3))
Week2. Recurrent Neural Network for language modeling
- RNNs model relationships among distant words
- In RNNs a lot of computations share parameters
Week3. LSTM and NER
RNN disadvantages:
- struggles with longer sequences
- prone to vanishing or exploding gradients
LSTM: a memorable solution
- Learns when to remember and when to forget
- Basic anatomy:
- a cell state
- a hidden state with three gates
- loop back again at the end of each time step
- forget gate decides what to keep
- input gate decides what to add
- output gate decides what the next hidden state will be
- Gates allow gradients to flow unchanged
Named Entity Recognition (NER):
- locates and extracts predefined entities from text
- places, organizations, names, time and dates
Applications of NER systems:
- search engine efficiency
- recommendation engines
- customer service
- automatic trading
Week4. Siamese Network
What do Siamese Networks learn?
- Identify similarity between things
Course4. Natural Language Processing with Attention Models
Week1. Neural Machine Translation
Seq2Seq model:
- Introduced by Google in 2014
- Maps variable-length sequences to fixed-length memory
- LSTMs and GRUs are typically used to overcome the vanishing gradient problem
Seq2Seq shortcomings:
- As sequence size increases, model performance decreases.
- Solution: focus attention in the right places
- prevent sequence overload by giving the model a way to focus on the likeliest words at each step.
- do this by providing the information specific to each input word.
Inside the Attention Layer: Query -> Attention: Key, Value pairs
Week2. Text Summarization
Transformer: Encoder - decoder with attention layers
RNNs shortcomings:
- parallel computing is difficult to implement
- there is loss of information for long sequences in RNNs
- there is the problem of vanishing gradient in RNNs
- Transformers help with all of the above
State of art Transformers:
- GPT-2: Generative Pre-training for Transformer
- BERT: Bidirectional Encoder Representations from Transformers
- T5: Text-to-text transfer transformer
Week3. Question Answering
Transfer learning:
- Pretraining: sentiment, classification
- Training on downstream task: question answering
Week4. Chatbot
Transformer issues:
- Attention on sequence of length L takes L^2 time and memory
- N layers take N times as much memory
Memory with N layers:
- Activations need to be stored for backprop
- Big models are getting bigger
- Compute vs memory tradeoff
What does Attention do? Select Nearest Neighbors (K, Q) and return corresponding V
References: