AMS691.01 · Spring 2026

Foundations and Frontiers of Large Language Models

Tuesdays: 3:30 PM - 6:20 PM
Psychology A 137, West Campus

Course Overview

Instructor: Jiawei (Joe) Zhou, Assistant Professor, Stony Brook University

Welcome to this graduate-level seminar course on Large Language Models. This course explores both the foundational principles and cutting-edge developments in LLM research through presentations and collaborative discussions.

Format: The instructor will introduce key topics and concepts through lectures. Students will present research papers from our curated collection and engage in critical discussions. Additionally, students will use LLM-based AI tools to develop a mini software application, such as an agentic productivity workflow, web app, mobile app, Chrome extension, chatbot interface, or other innovative software projects. Each student will also complete a research-oriented course project throughout the semester.

Total: 145 papers across all topics

Papers marked with a star are particularly recommended for selection.

Papers are listed in chronological order by publication year.

Module I: LLM Fundamentals
Pre-training 9 papers
(2017)
(2017)
(2018)
RoBERTa: A Robustly Optimized BERT Pretraining Approach
(2019)
(2018)
(2019)
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
(2019)
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
(2019)
(2020)
Fine-tuning 7 papers
(2019)
(2021)
(2021)
(2021)
P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
(2022)
Towards a Unified View of Parameter-Efficient Transfer Learning
(2022)
Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning
(2023) adapterhub.ml
In-context Learning and Prompting 1 6 papers
(2020)
Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
(2021)
How Many Data Points is a Prompt Worth?
(2021)
(2022)
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
(2021)
What Makes Good In-Context Examples for GPT-3?
(2021)
In-context Learning and Prompting 2 (Reasoning) 5 papers
(2023)
(2022)
(2022)
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
(2023)
ReAct: Synergizing Reasoning and Acting in Language Models
(2022)
Scaling Law 7 papers
Scaling Laws for Neural Language Models
(2020)
Scaling Laws for Autoregressive Generative Modeling
(2020)
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
(2021)
Training Compute-Optimal Large Language Models (Chinchilla)
(2022)
Scaling Scaling Laws with Board Games
(2021) Not directly LLM
Scaling data-constrained language models
(2023)
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
(2024)
Instruction Tuning 7 papers
(2021)
Scaling Instruction-Finetuned Language Models
(2022)
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
(2022)
Multitask Prompted Training Enables Zero-Shot Task Generalization
(2022)
(2022)
InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning
(2022)
Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections
(2021) IBM Blog
RLHF (Learning from Human Feedback) 4 papers
(2022)
Learning to summarize from human feedback
(2020)
Fine-Tuning Language Models from Human Preferences
(2020)
MemPrompt: Memory-assisted Prompt Editing with User Feedback
(2023)
Alignment 6 papers
(2023)
Alignment of Language Agents
(2021)
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
(2022)
Constitutional AI: Harmlessness from AI Feedback
(2022)
LaMDA: Language Models for Dialog Applications
(2022)
(2023)
Module II: LLM Evaluation, Augmentation, and Improvements
Data and Evaluation 11 papers
(2020)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
(2020)
Red Teaming Language Models with Language Models
(2020)
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
(2021)
(2021)
(2021)
Deduplicating Training Data Makes Language Models Better
(2022)
(2023)
(2023)
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
(2023)
(2024)
Efficiency and Optimization 7 papers
(2022)
(2022)
REST: Retrieval-Based Speculative Decoding
(2023)
(2023) Quantization
(2023)
(2020)
(2023)
Knowledge Understanding and Manipulation 8 papers
Language Models as Knowledge Bases?
(2019)
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
(2020)
Knowledge Neurons in Pretrained Transformers
(2022)
Fast Model Editing at Scale
(2022)
Locating and Editing Factual Associations in GPT
(2022)
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
(2023)
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
(2024)
Chunk-Distilled Language Modeling
(2024)
Retrieval Augmented Generation (RAG) 1 5 papers
Dense Passage Retrieval for Open-Domain Question Answering
(2020) Dense Retrieval
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
(2020) Initial RAG
REALM: Retrieval-Augmented Language Model Pre-Training
(2020)
Improving language models by retrieving from trillions of tokens (RETRO)
(2021)
Generalization through Memorization: Nearest Neighbor Language Models (kNN-LM)
(2020)
Retrieval Augmented Generation (RAG) 2 7 papers
Atlas: Few-shot Learning with Retrieval Augmented Language Models
(2022)
In-Context Retrieval-Augmented Language Models
(2023)
REPLUG: Retrieval-Augmented Black-Box Language Models
(2023)
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
(2023)
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
(2020)
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
(2024)
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
(2024)
Module III: Advanced LLM Capabilities & Applications
Multimodal Models 1 8 papers
Learning Transferable Visual Models From Natural Language Supervision
(2021)
Multimodal Few-Shot Learning with Frozen Language Models
(2021)
CoCa: Contrastive Captioners are Image-Text Foundation Models
(2021)
Flamingo: A Visual Language Model for Few-Shot Learning
(2022)
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
(2023)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
(2023)
Visual Instruction Tuning (LLaVA)
(2023)
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Multimodal Models 2 10 papers
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
(2021)
ClipCap: CLIP Prefix for Image Captioning
(2021)
CLIPTrans: transferring visual knowledge with pre-trained models for multimodal machine translation
(2023)
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
(2023)
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
(2023)
Visual Programming: Compositional visual reasoning without training
(2023)
Halc: Object hallucination reduction via adaptive focal-contrast decoding
(2024)
Denoising Diffusion Probabilistic Models (DDPM)
(2020)
Zero-Shot Text-to-Image Generation
(2021) DALL-E
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
(2024)
Reasoning 1 5 papers
Show Your Work: Scratchpads for Intermediate Computation with Language Models
(2021)
(2022)
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
(2023)
Let's Verify Step by Step
(2023)
Reinforced Self-Training (ReST) for Language Modeling
(2023)
Reasoning 2 5 papers
Stream of Search (SoS): Learning to Search in Language
(2024)
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
(2024)
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
(2024)
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
(2024)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
(2025)
Agent (Tool Usage, Frameworks, Benchmarks) 11 papers
TALM: Tool Augmented Language Models
(2022)
Toolformer: Language Models Can Teach Themselves to Use Tools
(2023)
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings
(2023)
(2023)
Improving Factuality and Reasoning in Language Models through Multiagent Debate
(2023)
WebArena: A Realistic Web Environment for Building Autonomous Agents
(2023)
Mind2Web: Towards a Generalist Agent for the Web
(2023)
(2023)
On the Tool Manipulation Capability of Open-source Large Language Models
(2023)
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
(2024)
GPT-4V(ision) is a Generalist Web Agent, if Grounded
(2024)
Robotics, Embodied AI 8 papers
(2022)
(2022)
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
(2022)
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
(2023)
SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning
(2023)
Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks
(2024)
HourVideo: 1-Hour Video-Language Understanding
(2024)
OpenEQA: Embodied Question Answering in the Era of Foundation Models
(2024)
Alternative Architectures 9 papers
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
(2022)
(2020)
(2022)
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
(2022)
A Review of Sparse Expert Models in Deep Learning
(2022)
Retentive Network: A Successor to Transformer for Large Language Models
(2023)
RWKV: Reinventing RNNs for the Transformer Era
(2023)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
(2024)
MiniMax-01: Scaling Foundation Models with Lightning Attention
(2025)