self_supervised_learning_word_embeddings

Self-Supervised Learning: Word Embeddings

Recommended Reading

word2vec Parameter Learning Explained by Xin Rong (2016)
Language Models, Word2Vec, and Efficient Softmax Approximations by Rohan Varma (2017)
Attention Is All You Need by Vaswani et al. (2017), a state-of-the-art sequence-to-sequence model, plus an illustrated guide plus an annotated paper with code.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Devlin et all (2018)
Improving Language Understanding by Generative Pre-Training by Radford et all (2018) (the GPT paper)
RoBERTa: A Robustly Optimized BERT Pretraining Approach by Liu et all (2019)
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Lewis et all (2019)
Language Models are Few-Shot Learners by Brown et all (2020) (the GPT-3 paper)
Large Language Models are Zero-Shot Reasoners by Kojima et al. (2022) (the chain of thought (CoT) prompting paper).
Training Compute-Optimal Large Language Models by Hoffmann et al. (2022), (the Chinchilla paper).
Understanding Large Language Models by Sebastian Raschka.