LM-class

The lectures constitute a coherent sequence, where later sections often assume concepts and material from earlier sections. They are organized into three sections main sections: warming up, learning with raw data, and learning with annotated data.

I am maintaining a document of issues, including with feedback from other researchers and instructors. It is recommended to consult this document if using this material. I did not review the issue document in depth, so cannot stand by it. However, I plan to review it and address issues in the next iteration of the class (usually: next spring).

Want to grab all the lecture content at once? Just grab the GitHub repository.

Warming Up

This section quickly brings the students up to speed with the basics. The goal is to prepare the students for the first assignment. Beyond a quick introduction, it includes: data basics, linear perceptron, and multi-layer perceptron.

Introduction .key .pdf: A very brief introduction to the class, including setting up the main challenges with natural language and the history of the field.
Text Classification, Data Basics, and Perceptrons .key .pdf: The basics of text classification and working with data splits. We introduce the linear perceptron, starting from the binary case and generalizing to multi-class.
Neural Network Basics .key .pdf: A very quick introduction to the basics of neural networks and defining the multi-layer perceptron.

Learning with Raw Data

This section focuses on representation learning from raw data (i.e., without any annotation or user labor). It is divided into three major parts: word embeddings, next-word-prediction language modeling, and masked language modeling. Through these subsections we introduce many of the fundamental technical concepts and methods of the field.

Word Embeddings .key .pdf: Introduction of lexical semantics. We start with discrete word senses and WordNet, transition to distributional semantics, and then introduce word2vec. We use dependency contexts to briefly introduce syntactic structures.
N-gram Language Models .key .pdf: We introduce language models through the noisy channel model. We gradually build n-gram language models, discuss evaluation, basic smoothing techniques, and briefly touch on the unknown word problem.
Tokenization .key .pdf: We discuss handling of unknown words, and from this build to sub-word tokenization. We go over the BPE algorithm in detail.
Neural Language Models and Transformers .key .pdf: This lecture gradually builds neural language models (LM) starting from n-gram models and concluding with the Transformer decoder architecture, which we define in detail. We present attention as a weighted sum of items, previous tokens in this case.
Decoding LMs .key .pdf: This lecture is a brief discussion of decoding techniques for LMs, mostly focusing on sampling techniques, but also discussing beam search.
Scaling up to LLMs .key .pdf: We discuss the challenges of scaling LMs to contemporary LLMs, including data challenges, scaling laws, and some of the societal challenges and impacts LLMs bring about. This section focuses on pre-training only.
Masked Language Models and BERT .key .pdf: This section introduces BERT and its training. We use this opportunity to introduce the encoder variant of the Transformer.
Pretraining Encoder-decoders .key .pdf: This section introduces the BART and T5 models. In the most recent version of the class, it came late in the semester, so it does not define the encoder-decoder Transformer architecture in detail. This content should be migrated from the later tasks slide deck.
Working with Raw Data Recap .key .pdf: A very short recap of the first half of the class.

Learning with Annotated Data

This section focuses on learning with annotated data. It introduces the task as a framework to structure solution development, through the review of several prototypical NLP tasks. For each task, we discuss the problem, data, modeling decisions, and formulate a technical approach to address it. This section takes a broad view of annotated data, including covering language model alignment using annotated data (i.e., instruction tuning and RLHF).

Prototypical NLP Tasks .key .pdf: Defining the task as a conceptual way to think about problems in NLP. We discuss several prototypical tasks: named-entity recognition as a tagging problem, extract question answering as span extraction, machine translation as a language generation problem, and code generation as a structured output generation problem. We use these tasks to introduce general modeling techniques and discuss different evaluation techniques and challenges. We conclude with multi-task benchmark suites. This section currently also defines the encoder-decoder Transformer architecture, as part of the machine translation discussion. This content should be migrated earlier to the encoder-decoder pre-training lecture.
Aligning LLMs .key .pdf: The process of training LLMs past the initial pre-training stage. We discuss instruction tuning and RLHF. We provide a basic introduction to reinforcement learning, including PPO. We also describe DPO.
Working with LLMs: Prompting .key .pdf: This lecture covers the most common prompting techniques, including zero-shot, in-context learning, and chain-of-thought.