What is trending in Generative AI (GenAI)?

1. Introduction

Let us look back to the early days of artificial intelligence when machine learning methods like support vector machines and decision trees were widely accepted and applied to practically every use case. Everything was explainable since it was possible to alter the model’s parameters and gain insight into how those changes would impact the model’s functionality and performance. The introduction of neural networks resulted in a significant increase in both the number of parameters and training time. Thus, the rate of developing intuition slowed down with time. Neural networks did not perform well at first. What, then, caused the neural networks to begin outperforming all other options?

The main drivers were the availability of large amounts of data and the development of computer power (GPUs). Another innovation in recent years has been transformer architecture. When ChatGPT was introduced in November 2022, AI exceeded many people’s expectations.

1.1 Development in the last decade

An overview of the developments during the past ten years is provided in the following figure. Given an image, for instance, it can recognize the item; given an audio input, it can convert it to text; it can translate text between languages; and given an image, it can produce an image description.

Fig 1. A decade of amazing progress in what computers can do [1]

With the advent of Generative AI the output in the above image can become input, and data can be generated as output. For example, given the text, it can generate images, audio, video, etc.

Fig 2. A decade of amazing progress in what computers can do [1]

1.2 Pace of development

In the following figure, you can see the progress of image classification on the ImageNet dataset. Accuracy from around 50% in 2011, increased to 92% in 2024.

Fig 3: Image Classification on ImageNet [7]

You can see the progress of Speech Recognition on LibriSpeech test-other in the following figure.

Fig 4: Speech Recognition on LibriSpeech test-other [6]

2. A Brief History of LLMs

Artificial Intelligence Generated Content (AIGC) is a larger category that includes Large Language Models (LLMs). It consists of AI models that produce written text, music, graphics, and other types of content. The following figure illustrates AIGC in image generation.

Fig 4: Example of AIGC in image generation [3]

There are two types of GAI models: unimodal models and multimodal models. While multimodal models accept cross-modal instructions and generate outputs of diverse kinds, unimodal models get instructions from the same modality as the generated content modality. The following figure illustrates this in detail.

Fig 5. Illustration of unimodal models and multimodal models [3]

The two main parts of the generation process are typically extracting intent information from human instructions and producing content based on the intentions that have been collected. Training more complex generative models on bigger datasets, utilizing larger foundation model architectures, and having access to a wealth of computational resources are the main reasons why modern AIGC has advanced so much in comparison to earlier efforts. To find the best answer for a given instruction, ChatGPT uses reinforcement learning from human feedback (RLHF), which gradually increases the model’s accuracy and dependability. Generative diffusion models, such as stable diffusion, proposed by Stability.AI, generate high-resolution images by regulating the trade-off between exploration and exploitation, resulting in a harmonic blend of variation in the generated images and resemblance to the training data.

Generative models, such as Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs), have been developed since the 1950s. However, deep generative models didn’t improve performance until the advent of deep learning in the 2010s. Traditional methods for generating sentences using N-gram language modeling are limited to short sequences, while recurrent neural networks (RNNs) and their variants can model longer dependencies and attend to around 200 tokens.

In computer vision, traditional algorithms used texture synthesis and mapping techniques based on hand-designed features. In 2014, Generative Adversarial Networks (GANs) were first proposed, providing significant improvements in generating diverse images. Later, Variational Autoencoders (VAEs) and diffusion models have been developed for more control over image generation and high-quality image creation.

The introduction of the Transformer architecture by Vaswani et al. in 2017 has become the foundation for generative models across various domains, including NLP (BERT, GPT) and computer vision (CV). This architecture combines visual components with transformer-based modeling, enabling multimodal tasks that integrate text and image information. The development of Vision Transformer (ViT) and Swin Transformer further advances this concept in CV, allowing for large-scale training on multimodal data.

Researchers are now exploring new techniques based on Transformer architecture, such as few-shot prompting in NLP, which uses only a few examples to help the model better understand task requirements. In computer vision, models combine modality-specific approaches with self-supervised learning objectives for more robust representations. These advancements have the potential to increase AIGC (Adversarial Image Generation Challenge) and create more efficient technologies.

Foundation Models

Transformer: The Transformer architecture [5], proposed by Vaswani et al. in 2017, is the backbone of many state-of-the-art language models such as GPT-3 and DALL-E-2. It achieves this by using self-attention mechanisms to attend to different parts of an input sequence. The model consists of two components: the encoder, which generates hidden representations, and the decoder, which produces output sequences. Each layer includes a multi-head attention mechanism and a feed-forward neural network. This architecture is highly parallelizable, allowing for efficient pre-training on large datasets and adapting well to different downstream tasks.

Pre-trained Language Models: Because of its parallelism and learning capabilities, the Transformer architecture has emerged as the industry standard for natural language processing. It falls into two categories: masked language modeling, which uses context information to estimate the likelihood of a masked token, such BERT (BERT) and RoBERTa. However, XL-Net is a different kind of autoregressive model that generates text using the Transformer architecture. To learn more information across tokens, this model uses permutation operations. Autoregressive models are more appropriate for generating tasks than masked language modeling.

Reinforcement Learning from Human Feedback: AIGC may not always align with human preferences, which includes considerations of usefulness and truthfulness. To improve this alignment, reinforcement learning from human feedback (RLHF) has been applied to fine-tune models like Sparrow, InstructGPT, and ChatGPT.

The three-step process involves:

  1. Pre-training a language model on large-scale datasets
  2. Training a reward model to encode human preferences for the pre-trained model
  3. Fine-tuning the language model with reinforcement learning

Recent hardware advancements, including significant increases in computing power and the adoption of TPUs and GPUs, distributed training, and cloud computing, have significantly improved the efficiency of training large-scale neural network models, enabling more complex and accurate results while opening up new possibilities for AI research and applications.

References

    1. https://www.youtube.com/live/Sy1psHS3w3I
    2. https://llmagents-learning.org/slides/Burak_slides.pdf
    3. A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT – Link
    4. https://mallahyari.github.io/rag-ebook/intro.html
    5. Attention Is All You Need – Link
    6. Speech Recognition on LibriSpeech test-other – Link
    7. Image Classification on ImageNet – Link