Introduction
In the history of AI, machine learning techniques such as Support Vector Machines, Decision Trees, etc. were used in almost all of the use cases at that time. Things were all good. The good part was that one could think of changing different parameters and get an intuition about how the model worked and performed based on that change. With the arrival of neural networks training time and number of parameters increased. So gradually the pace of getting intuition slowed down. Initially, neural networks performed poorly. So what changed the neural networks to start performing better than anything else? Advancements in computing (GPUs) and the availability of a lot of data are the major reasons. With the arrival of transformer architecture, it started performing great. Another big change happened when chatGPT launched in November 2022. AI crossed the expectations of many.
Development in the last decade
The following figure gives a snapshot of the development in the last decade. For example, if you give an image, it can identify the object in the image; given the audio signal it can transform into text; translate text from one language to another language; given the image, it can generate the description of the image.
With the advent of Generative AI the output in the above image can become input and data can be generated as output. For example, given the text, it can generate images, audio, video, etc.
Pace of development
In the following figure, you can see the progress of image classification on the ImageNet dataset. Accuracy from around 50% in 2011, increased to 92% in 2024.
You can see the progress of Speech Recognition on LibriSpeech test-other in the following figure.
A Brief History of LLMs
Large Language Models (LLMs) fall into the category of a bigger category called Artificial Intelligence Generated Content (AIGC). It comprises AI models that create content such as images, music, written language, etc. The following figure shows an example of AIGC in image generation.
Types of Generative AI models
1. Unimodal Models: Unimodal models receive instructions from the same modality as the generated content modality.
2. Multimodal Models: Multimodal models accept cross-modal instructions and produce results of different modalities.
Foundation Models
Transformer is the backbone architecture for many state-of-the-art models, such as GPT, DALL-E, Codex, and so on.
Traditional models like RNNs had limitations in dealing with variable-length sequences and context. Self-attention mechanism is at the core of the Transformer. Self-attention allows the model to focus on different parts of the input sequence.
Following is the encoder-decoder structure of the Transformer architecture
Taken from the paper “Attention Is All You Need“.
The encoder processes the input sequence to create the hidden representations whereas the decoder generates the output sequence.
Each encoder and decoder layer includes multi-head attention and feed-forward neural networks. Multi-head attention assigns weights to tokens based on relevance.
The transformer’s inherent parallelizability minimizes inductive biases, making it ideal for large-scale pre-training and adaptability to different downstream tasks.
There are two main types of pre-trained language models based on the training tasks:
1. Masked language modeling (e.g. BERT): Predict masked tokens within a sentence.
2. Autoregressive language modeling (e.g. GPT-3): Predict the next token given previous ones.
The following are the three main categories of pre-trained models:
1. Encoder models (BERT)
2. Decoder models (GPT)
3. Encoder-decoder models (T5-BART)
References
- https://www.youtube.com/live/Sy1psHS3w3I
- https://llmagents-learning.org/slides/Burak_slides.pdf
- A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT – Link
- https://mallahyari.github.io/rag-ebook/intro.html
- https://arxiv.org/abs/1706.03762