How to Record, Save and Play audio in Python?

Libraries

The following are the required libraries.

  • PortAudio is a free, cross-platform, open-source, audio I/O library. It lets you write simple audio programs in ‘C’ or C++ that will compile and run on many platforms including Windows, Macintosh OS X, and Unix (OSS/ALSA).
  • PyAudio provides Python bindings for PortAudio. Following is the pip command.
    pip install pyaudio
  • wave module of Python3.
  • simpleaudio to play the saved wave audio file. Following is the pip command:
    pip install simpleaudio

Check mic check

Check if you have a working microphone on your system. Following is the code snippet you can use.

import pyaudio
import pprint 

p = pyaudio.PyAudio()
pp = pprint.PrettyPrinter(indent=4)

try:
    pp.pprint(p.get_default_input_device_info())
except:
    print("No mics availiable")

Example output:

Record the audio

Following is the code snippet to record the audio.

import pyaudio
import wave

p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK) #buffer

print("* recording")
frames = []

for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
    data = stream.read(CHUNK)
    frames.append(data) # 2 bytes(16 bits) per channel

stream.stop_stream()
stream.close()
p.terminate()

pyaudio.PyAudio() method acquires system resources for PortAudio. pyaudio.PyAudio.open() sets up a pyaudio.PyAudio.Stream to play or record audio. pyaudio.PyAudio.Stream.read() is used to read audio data from the stream. In the above code, all the audio frames have been collected in the frames list frames []. These frames will be used for saving the audio file in the later part of the code.

Meaning of parameters to the function open:

  1. FORMAT: PortAudio provides samples in raw PCM (Pulse-Code Modulation) format. That means each sample is an amplitude to be given to the DAC (digital-to-analog converter) in your sound card. For paInt16, this is a value from -32768 to 32767. For paFloat32, this is a floating-point value from -1.0 to 1.0. The sound card converts this values to a proportional voltage that then drives your audio equipment. paFloat32, paInt32, paInt24, paInt16, paInt8, paUInt8, paCustomFormat
  2. CHANNELS means how many samples will be there for each frame.
  3. RATE is the sampling rate (the number of frames per second)
  4. CHUNK is the number of frames the signals are split into. This is arbitrarily chosen.

In the last, the stream should be closed and all the resources acquired must be released.

Saving the audio

Following is the code snippet to save the audio.

# Save the recorded audio file
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()

wave module of Python3 is used for this purpose. Parameters are set as the same values that were used while recording the audio.

Play the audio

Following is the code snippet to play the audio. simpleaudio library I have used for this purpose. There are other libraries available that can be tried. simpleaudio I found to be simple enough.

# Play the recorded audio file
wave_obj = sa.WaveObject.from_wave_file("pyaudio-output.wav")
play_obj = wave_obj.play()
play_obj.wait_done()

Complete Code

import pyaudio
import wave
import simpleaudio as sa

CHUNK = 512 
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
RECORD_SECONDS = 5
WAVE_OUTPUT_FILENAME = "myaudio.wav"

p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK) #buffer

print("* recording")
frames = []

for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
    data = stream.read(CHUNK)
    frames.append(data) # 2 bytes(16 bits) per channel

stream.stop_stream()
stream.close()
p.terminate()

# Save the recorded audio file
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()

# Play the recorded audio file
wave_obj = sa.WaveObject.from_wave_file("myaudio.wav")
play_obj = wave_obj.play()
play_obj.wait_done()

References

What is trending in Generative AI as of today?

Introduction

In the history of AI, machine learning techniques such as Support Vector Machines, Decision Trees, etc. were used in almost all of the use cases at that time. Things were all good. The good part was that one could think of changing different parameters and get an intuition about how the model worked and performed based on that change. With the arrival of neural networks training time and number of parameters increased. So gradually the pace of getting intuition slowed down. Initially, neural networks performed poorly. So what changed the neural networks to start performing better than anything else? Advancements in computing (GPUs) and the availability of a lot of data are the major reasons. With the arrival of transformer architecture, it started performing great. Another big change happened when chatGPT launched in November 2022. AI crossed the expectations of many.

Development in the last decade

The following figure gives a snapshot of the development in the last decade. For example, if you give an image, it can identify the object in the image; given the audio signal it can transform into text; translate text from one language to another language; given the image, it can generate the description of the image.

With the advent of Generative AI the output in the above image can become input and data can be generated as output. For example, given the text, it can generate images, audio, video, etc.

Pace of development

In the following figure, you can see the progress of image classification on the ImageNet dataset. Accuracy from around 50% in 2011, increased to 92% in 2024.

You can see the progress of Speech Recognition on LibriSpeech test-other in the following figure.

A Brief History of LLMs

Large Language Models (LLMs) fall into the category of a bigger category called Artificial Intelligence Generated Content (AIGC). It comprises AI models that create content such as images, music, written language, etc. The following figure shows an example of AIGC in image generation.

Types of Generative AI models

1. Unimodal Models: Unimodal models receive instructions from the same modality as the generated content modality.
2. Multimodal Models: Multimodal models accept cross-modal instructions and produce results of different modalities.

Foundation Models

Transformer is the backbone architecture for many state-of-the-art models, such as GPT, DALL-E, Codex, and so on.

Traditional models like RNNs had limitations in dealing with variable-length sequences and context. Self-attention mechanism is at the core of the Transformer. Self-attention allows the model to focus on different parts of the input sequence.

Following is the encoder-decoder structure of the Transformer architecture
Taken from the paper “Attention Is All You Need“.

The encoder processes the input sequence to create the hidden representations whereas the decoder generates the output sequence.

Each encoder and decoder layer includes multi-head attention and feed-forward neural networks. Multi-head attention assigns weights to tokens based on relevance.

The transformer’s inherent parallelizability minimizes inductive biases, making it ideal for large-scale pre-training and adaptability to different downstream tasks.

There are two main types of pre-trained language models based on the training tasks:
1. Masked language modeling (e.g. BERT): Predict masked tokens within a sentence.
2. Autoregressive language modeling (e.g. GPT-3): Predict the next token given previous ones.

The following are the three main categories of pre-trained models:
1. Encoder models (BERT)
2. Decoder models (GPT)
3. Encoder-decoder models (T5-BART)

References