What is trending in Generative AI (GenAI)?

Introduction

Let us look back to the early days of artificial intelligence when machine learning methods like support vector machines and decision trees were widely accepted and applied to practically every use case. Everything was explainable since it was possible to alter the model’s parameters and gain insight into how those changes would impact the model’s functionality and performance. The introduction of neural networks resulted in a significant increase in both the number of parameters and training time. Thus, the rate of developing intuition slowed down with time. Neural networks did not perform well at first. What, then, caused the neural networks to begin outperforming all other options?

The main drivers were the availability of large amounts of data and the development of computer power (GPUs). Another innovation in recent years has been transformer architecture. When ChatGPT was introduced in November 2022, AI exceeded many people’s expectations.

Development in the last decade

An overview of the developments during the past ten years is provided in the following figure. Given an image, for instance, it can recognize the item; given an audio input, it can convert it to text; it can translate text between languages; and given an image, it can produce an image description.

Fig 1. A decade of amazing progress in what computers can do [1]

With the advent of Generative AI the output in the above image can become input, and data can be generated as output. For example, given the text, it can generate images, audio, video, etc.

Fig 2. A decade of amazing progress in what computers can do [1]

Pace of development

In the following figure, you can see the progress of image classification on the ImageNet dataset. Accuracy from around 50% in 2011, increased to 92% in 2024.

Fig 3: Image Classification on ImageNet [7]

You can see the progress of Speech Recognition on LibriSpeech test-other in the following figure.

Fig 4: Speech Recognition on LibriSpeech test-other [6]

A Brief History of LLMs

Artificial Intelligence Generated Content (AIGC) is a larger category that includes Large Language Models (LLMs). It consists of AI models that produce written text, music, graphics, and other types of content. The following figure illustrates AIGC in image generation.

Fig 4: Example of AIGC in image generation [3]

There are two types of GAI models: unimodal models and multimodal models. While multimodal models accept cross-modal instructions and generate outputs of diverse kinds, unimodal models get instructions from the same modality as the generated content modality. The following figure illustrates this in detail.

Fig 5. Illustration of unimodal models and multimodal models [3]

Foundation Models

Transformer is the backbone architecture for many state-of-the-art models, such as GPT, DALL-E, Codex, and so on.

Traditional models like RNNs had limitations in dealing with variable-length sequences and context. Self-attention mechanism is at the core of the Transformer. Self-attention allows the model to focus on different parts of the input sequence.

Following is the encoder-decoder structure of the Transformer architecture
Taken from the paper “Attention Is All You Need“.

The encoder processes the input sequence to create the hidden representations whereas the decoder generates the output sequence.

Each encoder and decoder layer includes multi-head attention and feed-forward neural networks. Multi-head attention assigns weights to tokens based on relevance.

The transformer’s inherent parallelizability minimizes inductive biases, making it ideal for large-scale pre-training and adaptability to different downstream tasks.

There are two main types of pre-trained language models based on the training tasks:
1. Masked language modeling (e.g. BERT): Predict masked tokens within a sentence.
2. Autoregressive language modeling (e.g. GPT-3): Predict the next token given previous ones.

The following are the three main categories of pre-trained models:
1. Encoder models (BERT)
2. Decoder models (GPT)
3. Encoder-decoder models (T5-BART)

References

https://www.youtube.com/live/Sy1psHS3w3I
https://llmagents-learning.org/slides/Burak_slides.pdf
A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT – Link
https://mallahyari.github.io/rag-ebook/intro.html
https://arxiv.org/abs/1706.03762
Speech Recognition on LibriSpeech test-other – Link
Image Classification on ImageNet – Link