Coursera Natural Language Processing Specification notes

Jan 17, 2021 00:00 · 834 words · 4 minute read NLP

Simple notes for Coursera NLP course.

Course1. Natural Language Processing with Classification and Vector Spaces

Week1. Logistic Regression

  • Supervised ML and Sentiment Analysis:
    • Input -> labels with optimizing cost function
  • Sentiment analysis: given a paragraph, we classify it as positive or negative.
  • Vocabulary and feature extraction:
    • Vocabulary: all unique word in the interested text
    • Feature extraction: Convert one sentence to a vector by set the corresponding word to 1 in the vocabulary vector if that word exists in the sentence, otherwise, set it to 0.
    • Large training time and large vectors
  • Negative and positive frequency:
    • map (word, class) to frequency
  • Feature extraction: xm = [1, sum of pos. frequency, sum of negative frequency]. 1 is bias.
  • Preprocessing: stop words, punctuation, stemming, and lowercasing
  • Use logistic model to predict sentiment

Week2. Naive Bayes

  • Baye’s rule: P(X|Y)P(Y) = P(Y|X)P(X) nlp_bayes

Week3. Vector space models

  • Represent words and documents as vectors
  • Representation that capture relative meaning
  • Word by Word Design: number of times they occur together within a certain distance k
  • Word by Document Design: number of times a word occurs within a certain category
  • Cosine similarity
  • PCA:
    • Eigenvectors give the direction of uncorrelated features
    • Eigenvalues are the variance of the new features
    • Dot product gives the projection on uncorrelated features
  • Translate X to Y: XR=Y
  • Locality sensitive hashing (LSH): an algorithm for solving the approximate or exact Nearest Neighbor Search in high dimensional spaces. LSH refers to a family of functions (known as LSH families) to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. This makes it easier to identify observations with various degrees of similarity.

Course2. Natural Language Processing with Probabilistic Models

Week1. Autocorrect and minimum edit distance

How it works: * Identify a misspelled word * Find strings n edit distance away * Filter candidates * Calculate word probabilities

Week2. Part of speech tagging and Hidden Markov Models

nlp_POS nlp_transation

Week3. Autocomplete and Language Models

  • An N-gram is a sequence of N words.
  • Probability of a bigram: P(y|x) = C(x, y)/C(x)

nlp_transation

Week4. Word embedding with neural networks

Basic word embedding methods:

  • word2vec (Google, 2013)
    • continuous bag of words (CBOW)
    • continuous skip-gram / skip-gram with negative sampling
  • Global Vectors (GloVe)(Stanford, 2014)
  • fastText (Facebook, 2016)
    • supports out-of-vocabulary (OOV) words

Advanced word embedding methods:

  • BERT (Google, 2018)
  • ELMo
  • GPT-2

Cleaning and tokenizing matters:

  • letter case
  • punctuation
  • numbers
  • special characters
  • special words

Course3. Natural Language Processing with Sequence Models

Week1. Neural network with sentiment analysis

Classes in Python

class MyClass:
  def __init__(self, y):
    self.y = y

  def my_method(self, x):
    return x + self.y

  def __call__(self, x):
    return self.my_method(x)

f = MyClass(7)
print(f(3))

Week2. Recurrent Neural Network for language modeling

  • RNNs model relationships among distant words
  • In RNNs a lot of computations share parameters

nlp_crossentropy

Week3. LSTM and NER

RNN disadvantages:

  • struggles with longer sequences
  • prone to vanishing or exploding gradients

LSTM: a memorable solution

  • Learns when to remember and when to forget
  • Basic anatomy:
    • a cell state
    • a hidden state with three gates
    • loop back again at the end of each time step
    • forget gate decides what to keep
    • input gate decides what to add
    • output gate decides what the next hidden state will be
  • Gates allow gradients to flow unchanged

Named Entity Recognition (NER):

  • locates and extracts predefined entities from text
  • places, organizations, names, time and dates

Applications of NER systems:

  • search engine efficiency
  • recommendation engines
  • customer service
  • automatic trading

Week4. Siamese Network

What do Siamese Networks learn?

  • Identify similarity between things

nlp_siaseme

nlp_tripletloss

Course4. Natural Language Processing with Attention Models

Week1. Neural Machine Translation

Seq2Seq model:

  • Introduced by Google in 2014
  • Maps variable-length sequences to fixed-length memory
  • LSTMs and GRUs are typically used to overcome the vanishing gradient problem

Seq2Seq shortcomings:

  • As sequence size increases, model performance decreases.
  • Solution: focus attention in the right places
    • prevent sequence overload by giving the model a way to focus on the likeliest words at each step.
    • do this by providing the information specific to each input word.

Inside the Attention Layer: Query -> Attention: Key, Value pairs

Week2. Text Summarization

Transformer: Encoder - decoder with attention layers

RNNs shortcomings:

  • parallel computing is difficult to implement
  • there is loss of information for long sequences in RNNs
  • there is the problem of vanishing gradient in RNNs
  • Transformers help with all of the above

State of art Transformers:

  • GPT-2: Generative Pre-training for Transformer
  • BERT: Bidirectional Encoder Representations from Transformers
  • T5: Text-to-text transfer transformer

Week3. Question Answering

Transfer learning:

  • Pretraining: sentiment, classification
  • Training on downstream task: question answering

nlp_transferlearning

Week4. Chatbot

Transformer issues:

  • Attention on sequence of length L takes L^2 time and memory
  • N layers take N times as much memory

Memory with N layers:

  • Activations need to be stored for backprop
  • Big models are getting bigger
  • Compute vs memory tradeoff

What does Attention do? Select Nearest Neighbors (K, Q) and return corresponding V

nlp_NN

nlp_LSHattention

References:

tweet Share