Following are the questions that should be running in your mind when you are to choose from competing open-source products:
Are there any other users out there?
Is it the most popular in this category?
Is this technology in decline?
The popularity and traction of GitHub can be inferred from their star histories. You can use star-history.com to make a comparison based on these two metrics. Refer to the tutorial for details.
Learn about Large Language Models (LLMs), their installation, access through HTTP API, the Ollama framework, etc.
Introduction of Retrieval-Augmented Generation (RAG)
Learn about the Data Ingestion Pipeline for the Qdrant vector database
Learn about the RAG Pipeline
Access the prototype using audio-based input/output (audio bot).
Audio bot using speech-to-text and text-to-speech
Making Qdrant client for making queries
Creating context using documents retrieved from the Qdrant database
Using Llama 3.2 as a Large Language Model (LLM) using Ollama and Langchain framework
Create a prompt template using instruction and context, and make a query using the template and langchain
Using Llama to generate the answer to the question using the given context
2. Large Language Model
Large Language Model (LLM) is an artificial intelligence (AI) model with billions of parameters, trained on huge amounts of data to comprehend (human-generated text understanding) and produce language similar to a human’s (text generation comparable to that generated by humans). LLMs have learned linguistic structures, relationships, and patterns in human-generated data. It has also gained a huge amount of internal knowledge (in the form of model weights) from the data on which these models have been trained.
Transformer architecture is often the foundation of large language models, allowing them to
process sequential data that makes it suitable for tasks like text generation, translation, and question-answering)
learn contextual relationships such as word meanings, syntax, and semantics.
generate human-like language, i.e., produce coherent, context-specific text that resembles human-generated content
Some key characteristics of large language models include:
Trained on vast training data:
Scalability: They can handle long sequences and complex tasks.
Parallel processing
Figure 1
LLMs are divided into three categories: 1) encoder only, 2) decoder only, and 3) encoder-decoder models. Examples of LLMs are BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-to-Text Transfer Transformer), respectively. General use cases are shown in the Figure 1 above.
Major concerns regarding LLMs are:
Data bias: LLMs have the potential to reinforce biases found in the training data.
Interpretability: It can be difficult to comprehend the decision-making process of an LLM.
Security: Adversarial attacks can target LLMs.
2.1 Llama
Llama is the popular open-source LLM model from Facebook. Being open-source, we can fine-tune, distill, and deploy and use in our use cases (provided the above-mentioned concerns are taken care of). Llam is a decoder-only language model, which means that it uses a transformer architecture with only decoder layers to generate text.
At the time of writing this article, current versions are available in three flavors:
Llama 3.1: With Multilingual capability and available in two versions 1) 8B with 8 billion parameters and 2) 405B with 405 billion parameters
Llama 3.2: Lightweight and Multimodal and available in 1) Lightweight 1B and 3B 2) Multimodal 11B and 90B
Llama 3.3: Multilingual with 70B parameters
In the current prototype, I have used Llama 3.2.
2.2 Install, run, and different ways to access Llama
Large language models are locked in time. It has learned the knowledge that was available till the time when it was being trained and released. These models are trained on Internet-scale open data. So when you ask general questions, it would give very good answers, but it may fail to answer or hallucinate if you go very specific to your personal or enterprise data. The reason is that it usually does not have the right context of your requirements that are very specific to your application.
Retrieval-augmented generation (RAG) combines the strength of generative AI and retrieval techniques. It helps in providing the right context for the LLM along with the question being asked. This way we get the LLM to generate more accurate and relevant content. This is a cost-effective way to improve the output of LLMs without retraining them. The following diagram depicts the RAG architecture:
Fig1: Conceptual flow of using RAG with LLM
There are two primary components of the RAG:
Retrieval: This component is responsible for searching and retrieving relevant information related to the user’s query from various knowledge sources such as documents, articles, databases, etc.
Generation: This component does an excellent job of crafting coherent and contextually rich responses to user queries.
A question submitted by the user is routed to the retrieval component. Using the embedding model, the retrieval component converts the query text to the embedding vector. After that, it looks through the vector database to locate a small number of vectors that match the query text and satisfy the threshold requirements for the similarity score and distance metric. These vectors are transformed back to the text and used as the context. This context, along with the prompt and query, is put in the prompt template and sent to the LLM. LLM returns the generated text that is more correct and relevant to the user’s query.
4. RAG (Data Ingestion Pipeline)
In order for the retrieval component to have a searchable index of preprocessed data, we must first build a data input pipeline. The following diagram in fig2 depicts the data ingestion pipeline. Knowledge sources can be web pages, text documents, pdf documents, etc. Texts need to be extracted from these sources. I am using PDF documents as the only knowledge source for this prototype.
Text Extraction: To extract the text from the PDF, various Python libraries can be used, such as PyPDF2, pdf2txt, PDFMiner, etc. If PDF is scanned PDF, libraries such as unstructured, pdf2image, and pytesseract can be utilized. The quality of the text can be maintained by performing cleanups such as removing extraneous characters, fixing formatting issues, whitespace, special characters, punctuation, spell checking, etc. Language detection may also be required if knowledge sources can have text coming in multiple languages, or a single document may contain multiple languages.
Handling Multiple Pages: Maintaining the context across pages is important. It is recommended that the document be segmented into logical units, such as paragraphs or sections, to preserve the context. Extracting metadata such as document titles, authors, page numbers, creation dates, etc., is crucial for improving searchability and answering user queries.
Fig2: RAG data ingestion pipeline
Note: I have manually downloaded the PDFs of all the chapters of the book “Democratic Politics” of class IX of the NCERT curriculum. These PDFs will be the knowledge source for our application.
4.1 Implementation step by step
Step 1: Install the necessary libraries
pip install pdfminer.six
pip install langchain-ollama
Imports:
from langchain_community.document_loaders import PDFMinerLoaderfrom langchain.text_splitter import CharacterTextSplitterfrom qdrant_client import QdrantClient
Step 2: Load the pdf file and extract the text from it
Step 3: Split the text into smaller chunks with overlap
CHUNK_SIZE=1000# chunk size not greater than 1000 charsCHUNK_OVERLAP=30# a bit of overlap is required for continued contexttext_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)docs = text_splitter.split_documents(pdf_content)# Make a list of split docsdocuments = []for doc in docs: documents.append(doc.page_content)
Step 4: Embed and store the documents in the vector database
FastEmbed is a lightweight, fast Python library built for embedding generation. Qdrant vector database uses this embedding library by default. Following is the code snippet for inserting in the vector database.
# 5. Make a query from the vectordb(qdrant)search_results = qdrant_client.query(collection_name="ix-sst-ncert-democratic-politics",query_text="What is democracy?")for search_result in search_results:print(search_result.document, search_result.score)
4.2 Complete Code data_ingestion.py
################################################################ Data ingestion pipeline # 1. Taking the input pdf file# 2. Extracting the content# 3. Divide into chunks# 4. Use embeddings model to convet to the embedding vector# 5. Store the embedding vectors to the qdrant (vector database)################################################################import osfrom langchain_community.document_loaders import PDFMinerLoaderfrom langchain.text_splitter import CharacterTextSplitterfrom qdrant_client import QdrantClientpath ="ix-sst-ncert-democratic-politics"filenames =next(os.walk(path))[2]for i, file_name inenumerate(filenames):print(f"Data ingestion for the chapter: {i}")# 1. Load the pdf document and extract text from it loader = PDFMinerLoader(path +"/"+ file_name) pdf_content = loader.load()print(pdf_content)# 2. Split the text into small chunksCHUNK_SIZE=1000# chunk size not greater than 1000 charsCHUNK_OVERLAP=30# a bit of overlap is required for continued context text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP) docs = text_splitter.split_documents(pdf_content)# Make a list of split docs documents = []for doc in docs: documents.append(doc.page_content)# 3. Create vectordatabase(qdrant) client qdrant_client = QdrantClient(url="http://localhost:6333")# 4. Add document chunks in vectordb qdrant_client.add(collection_name="ix-sst-ncert-democratic-politics",documents=documents,#metadata=metadata,#ids=ids )# 5. Make a query from the vectordb(qdrant) search_results = qdrant_client.query(collection_name="ix-sst-ncert-democratic-politics",query_text="What is democracy?" )for search_result in search_results:print(search_result.document, search_result.score)
5. RAG (Information Retrieval and Generation) – Audio Bot
I am making an audio bot that will answer questions from the chapters of the book “Democratic Politics” of class IX of the NCERT(India) curriculum. If you want to learn about making an audio bot, you can read my article on the topic “Making a talking bot using Llama3.2:1b running on Raspberry Pi 4 Model-B 4GB“.
5.1 Audio Bot Implementation
The following diagram depicts the overall flow of the audio bot and how it interacts with the RAG system. A user interacts with the audio bot using the microphone. The microphone captures the speech audio signal and passes it on to the speech-to-text library (I am using faster_whisper) which in turn converts to a text query that is in turn passed on to the RAG system as a query. When the RAG system comes up with the response text, this text is passed on to the text-to-speech library (I am using pyttsx3) that in turn converts text to audio which is then played by the speaker so the user can listen to the response.
faster-whisper is a reimplementation of OpenAI’s Whisper model using CTranslate2, which is a fast inference engine for Transformer models.
Installation: pip install faster-whisper
Save the following code in Python file say "speech-to-text.py" and run python speech-to-text.py
from faster_whisper import WhisperModelmodel_size ="small.en"model = WhisperModel(model_size, device="cpu", compute_type="int8")# Transcribetranscription = model.transcribe(audio="basic_output1.wav",language="en",)seg_text =''for segment in transcription[0]: seg_text = segment.textprint(seg_text)
Sample input audio file:
Output text: “Please ask me something. I’m listening now”
5.1.3 Text-to-Speech
The best offline text-to-speech library that works on resource-constrained devices is “pyttsx3“.
Installation:pip install pyttsx3
Save the following code in a Python file say "text-to-speech.py" and run python text-to-speech.py
Code Snippet:
import pyttsx3engine = pyttsx3.init()engine.setProperty('volume', 1)engine.setProperty('rate', 130)voices = engine.getProperty('voices')engine.setProperty('voice', voices[1].id)engine.setProperty('voice', 'english+f3')text_to_speak ="I got your question. Please bear " \"with me while I retrieve the answer."engine.say(text_to_speak)# Folloing is the optional line: If you want # also to save audio fileengine.save_to_file(text_to_speak, 'speech.wav') engine.runAndWait()
Sample input text: “I got your question. Please bear with me while I retrieve the answer.”
The following code snippet creates a template, the template is used to create the prompt, create a reference to the llama model, chain (langchain pipeline for executing the query) is created using prompt and model, and finally chain is invoked to execute the query to get the response formed by LLM using the retrieved context.
# 4. Using LLM for forming the answertemplate ="""Instruction: {instruction}Contaxt: {contaxt}Query: {query}"""prompt = ChatPromptTemplate.from_template(template)model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm modelchain = prompt | modelbot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", "contaxt": contaxt,"query": query })
5.3 Complete Code audiobot.py
Following is the code snippet for the audio bot. Save the file as audiobot.py
import pyaudioimport waveimport pyttsx3from qdrant_client import QdrantClientfrom langchain_ollama.llms import OllamaLLMfrom langchain_core.prompts import ChatPromptTemplatefrom faster_whisper import WhisperModel# Load the Speech to Text Model (faster-whisper: pip install faster-whisper)whishper_model_size ="small.en"whishper_model = WhisperModel(whishper_model_size, device="cpu", compute_type="int8")CHUNK=512FORMAT= pyaudio.paInt16 #paInt8CHANNELS=1RATE=44100#sample rateRECORD_SECONDS=7WAVE_OUTPUT_FILENAME="pyaudio-output.wav"defspeak(text_to_speak): engine = pyttsx3.init() engine.setProperty('volume', 1) engine.setProperty('rate', 130) voices = engine.getProperty('voices') engine.setProperty('voice', voices[1].id) engine.setProperty('voice', 'english+f3') engine.say(text_to_speak) engine.runAndWait()speak("I am an AI bot. I have learned the book \"democratic politics\" of class 9 published by N C E R T. You can ask me questions from this book.")whileTrue: speak("I am listening now for you question.") p = pyaudio.PyAudio() stream = p.open(format=FORMAT,channels=CHANNELS,rate=RATE,input=True,frames_per_buffer=CHUNK) #bufferprint("* recording") frames = []for i inrange(0, int(RATE/CHUNK*RECORD_SECONDS)): data = stream.read(CHUNK) frames.append(data) # 2 bytes(16 bits) per channel stream.stop_stream() stream.close() p.terminate() wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb') wf.setnchannels(CHANNELS) wf.setsampwidth(p.get_sample_size(FORMAT)) wf.setframerate(RATE) wf.writeframes(b''.join(frames)) wf.close()# Transcribe transcription = whishper_model.transcribe(audio=WAVE_OUTPUT_FILENAME,language="en", ) seg_text =''for segment in transcription[0]: seg_text = segment.textprint(f'\nUser: {seg_text}')if seg_text =='': speak("Probably you did not say anything.")continueelse: text_to_speak ="I got your question. Please bear with me " \+"while I retrieve about the answer." speak(text_to_speak)# 1. Create vector database(qdrant) client qdrant_client = QdrantClient(url="http://localhost:6333")# 2. Make a query to the vectordb (qdrant)#query = "explain democracy in estonia?" query = seg_text search_results = qdrant_client.query(collection_name="ix-sst-ncert-democratic-politics",query_text=query ) context ="" no_of_docs =2 count =1for search_result in search_results:if search_result.score >=0.8:print(f"Retrieved document: {search_result.document}, Similarity score: {search_result.score}") context = context + search_result.documentif count >= no_of_docs:break count = count +1#print(f"Retrieved Context: {context}")if context =="":print("Context is blank. Could not find any relevant information in the given sources.") speak("I did not find anything in the book about the question.")continue# 4. Using LLM for forming the answer template ="""Instruction: {instruction} Context: {context} Query: {query} """ prompt = ChatPromptTemplate.from_template(template) model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm model chain = prompt | model bot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", "context": context,"query": query })print(f'\nBot: {bot_response}') speak(bot_response)
6. Libraries Used
Following is the list of libraries used in the prototype implementation. These can be installed from the Python pip command.
Activate using the activate command in env1\Scripts\activate if on Windows. Activate command is there in the bin directory on the Linux system.
python -m pip install -r requirements.txt
python data_ingestion.py
python audiobot.py
10. My Conversation with the Audio Bot
9. Further Improvement
In the current prototype, the chunk size is of fixed length, CHUNK_SIZE = 1000 and CHUNK_OVERLAP = 30. For further improvement, the document can be split into logical units, such as paragraphs/sections, to maintain a better context.
10. References
A Practical Approach to Retrieval Augmented Generation Systems by Mehdi Allahyari and Angelina Yang
Install, run, and access Llama using Ollama. – link
How to Record, Save, and Play Audio in Python? – link
Making a talking bot using Llama3.2:1b running on Raspberry Pi 4 Model-B 4GB – link
In this article, I describe my experiment on making a talking bot using a Large Language Model Llama3.2:1b and running it successfully on the Raspberry Pi 4 Model-B with 4GB RAM. Llama3.2:1b is the quantized version of the Llama model with 1 billion parameters for use on resource-constrained devices from Facebook. I have kept this bot primarily in question-answering mode to keep things simple. The bot is supposed to answer all the questions that llama3.2:1b can answer from its learned knowledge in the model. The objective is to run this completely offline without needing the Internet.
My Setup
The following picture describes my setup which consists of a Raspberry Pi to host the LLM (llama3.2:1b), a mic for asking questions, and a pair of speakers to play the answers from the bot. I have used the Internet while doing the installation etc. but the bot works in offline mode.
Following is the overall design explaining the end-to-end flow.
The user asks the question in the external microphone connected to the Raspberry Pi. This audio signal captured by the microphone is converted to text using a speech-to-text library. Text is sent to the Llama model running on the Raspberry Pi. The Llama model answers the question in the form of text that is sent to the text-to-speech library. The output of the text-to-speech is audio that is played and can be listened to by the user on the speaker.
Following are the steps of the setup:
Install, run, and access Llama
Installation and accessing Speech-to-text library
Installation of text-to-speech library
Record, Save, and Play audio
Running the code (the complete code)
1. Install, run, and how to access llama using API
The Llama model is the core of this bot product. So before we move further, this should be installed and running. Please refer to the separate post on the topic, “Install, run, and access Llama using Ollama“. This post also describes the details of how to access the running model using the API.
2. Installation of speech-to-text library and how to use
I tried many speech-to-text libraries and finally satteled with “faster-whisper“. With the help of CTranslate2, a quick inference engine for Transformer models, faster-whisper is a reimplementation of OpenAI’s Whisper model. The performance of this library on the Raspberry Pi was also satisfactory. Works offline.
Installation: pip install faster-whisper
Save the following code in Pythone file say "speech-to-text.py" and run python speech-to-text.py
Code Snippet:
from faster_whisper import WhisperModelmodel_size ="small.en"model = WhisperModel(model_size, device="cpu", compute_type="int8")# Transcribetranscription = model.transcribe(audio="basic_output1.wav",language="en",)seg_text =''for segment in transcription[0]: seg_text = segment.textprint(seg_text)
Sample input audio file:
Output text: “Please ask me something. I’m listening now”
3. Installation of text-to-speech library and how to use
The best offline text-to-speech library that works on resource-constrained devices is “pyttsx3“.
Installation:pip install pyttsx3
Save the following code in a Python file say "text-to-speech.py" and run python text-to-speech.py
Code Snippet:
import pyttsx3engine = pyttsx3.init()engine.setProperty('volume', 1)engine.setProperty('rate', 130)voices = engine.getProperty('voices')engine.setProperty('voice', voices[1].id)engine.setProperty('voice', 'english+f3')text_to_speak ="I got your question. Please bear " \"with me while I retrieve the answer."engine.say(text_to_speak)# Folloing is the optional line: If you want # also to save audio fileengine.save_to_file(text_to_speak, 'speech.wav') engine.runAndWait()
Sample input text: “I got your question. Please bear with me while I retrieve the answer.”
How do you install the Llama model using the Ollama framework?
Running Llama models
Different ways to access the Ollama model
Access the deployed models using Page Assist plugin in the Web Browsers
Access the Llama model using HTTP API in Python Language
Access the Llama model using the Langchain Library
1. What is Ollama?
Ollama is an open-source tool/framework that facilitates users in running large language models (LLMs) on their local computers, such as PCs, edge devices like Raspberry Pi, etc.
2. How to install it?
Visit https://ollama.com/download. You can download and install it based on your PC’s OS, such as Mac, Linux, and Windows.
3. Running Llama 3.2
Five versions of Llama 3.2 models are available: 1B, 3B, 11B, and 90B. ‘B’ indicates billions. For example, 1B means that the model has been trained on 1 billion parameters. 1B and 3B are text-only models, whereas 11B and 90B are multimodal (text and images).
Run 1B model:ollama run llama3.2:1b
Run 3B model:ollama run llama3.2
After running these models on the terminal, we can interact with the model using the terminal.
4. Access the deployed models using Web Browsers
Page Assist is an open-source browser extension that provides a sidebar and web UI for your local AI model. It allows you to interact with your model from any webpage.
5. Access the Llama model using HTTP API in Python Language
Before running the following code, you should have performed steps 1 and 2 mentioned above. That is, you have installed Ollama, and you are running the Llama model under the Ollama environment. When you run the Llama under Ollama, it provides access to the model in the following two ways: 1) through the command line and 2) through an HTTP API on port 11434. The following code is nothing but Python code for accessing the model through the HTTP API.
import jsonimport requestsdata ='{}'data = json.loads(data)data["model"] ="llama3.2:1b"data["stream"] =Falsedata["prompt"] ="What is Newton's law of motion?"+" Answer in short."# Sent to Chatbotr = requests.post('http://127.0.0.1:11434/api/generate', json=data)response_data = json.loads(json.dumps(r.json()))# Print User and Bot Messageprint(f'\nUser: {data["prompt"]}')bot_response = response_data['response']print(f'\nBot: {bot_response}')
6. Access the Llama model using the Langchain Library
from langchain_ollama.llms import OllamaLLMfrom langchain_core.prompts import ChatPromptTemplatequery ="What is Newton's law of motion?"template ="""Instruction: {instruction}Query: {query}"""prompt = ChatPromptTemplate.from_template(template)model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm modelchain = prompt | modelbot_response = chain.invoke({"instruction": "Answer the question. If you cannot answer the question, answer with \"I don't know.\"", "query": query })print(f'\nUser: {query}')print(f'\nBot: {bot_response}')
7. Running ollama for remote access
By default, the Ollama service runs locally and is not accessible remotely. To make Ollama remotely accessible, we need to set the following environment variables:
Ever needed to add voice recording to your Python app? Maybe you’re building a transcription tool, a voice note feature, or just want to capture audio for analysis. Whatever the reason, working with audio in Python turns out to be surprisingly straightforward once you know which libraries to use.
I recently needed to implement audio recording for a project, and after trying several approaches, I found PyAudio to be the most reliable option. It’s not perfect – the installation can be a pain, especially on Linux – but it works consistently across platforms and gives you fine-grained control over the recording process.
This guide walks through the complete workflow: checking if your microphone works, recording audio, saving it as a WAV file, and playing it back. I’ll show you the actual code I use, along with explanations of what each parameter does and why it matters.
What you’ll need:
Python 3.6 or higher
A working microphone
About 10 minutes to get everything set up
What you’ll learn:
Installing PyAudio and handling its dependencies
Recording audio from your microphone
Understanding audio parameters (sample rate, bit depth, channels)
Saving recordings as WAV files
Playing back audio files
The code examples are meant to be copy-paste ready. I’ve kept them minimal to focus on the core concepts, but I’ll also show you how to handle common edge cases like missing microphones or buffer overflows.
2. Audio Recording Workflow
The diagram below shows the complete audio recording pipeline we’ll be implementing. Here’s what happens at each stage:
Speak – You talk into your microphone, which converts sound waves into an electrical signal
Audio Signal – The microphone sends this analog signal to your computer’s sound card
Record Audio – PyAudio captures the digital audio stream and stores it in memory as chunks of data
Save Audio – The Python wave module writes these chunks to a WAV file on your disk
Play Audio – SimpleAudio (or other libraries) reads the WAV file and sends it back through your speakers
Each arrow represents data flowing through your system. We’ll write Python code to control steps 3, 4, and 5 – the actual recording, saving, and playback.
Figure 1: Audio Recording Workflow
2.1 Why This Workflow Matters
Understanding this pipeline helps you debug issues more effectively. If recording fails, you’ll know whether the problem is with hardware (microphone), drivers (sound card), your code (PyAudio), or file handling (wave module). Each stage has different failure modes and solutions.
Let’s start with the installation steps you’ll need.
3. Installation
Estimated time: 5-10 minutes
Before we jump into the code, we need to get PyAudio installed. Fair warning: this is usually the trickiest part of working with audio in Python. PyAudio requires PortAudio, a cross-platform audio library, to be installed at the system level before you can install the Python package.
3.1 Why Installation Can Be Tricky
PyAudio is a Python wrapper around PortAudio (a C library). When you run pip install pyaudio, pip tries to compile the C extension, which fails if PortAudio isn’t already on your system. This catches a lot of people off guard because most Python packages “just work” with pip.
3.2 System Dependencies First
Install the system-level audio libraries based on your operating system:
Once the system dependencies are in place, install the required Python libraries:
bash
pip install pyaudio wave simpleaudio
Note: The wave module is actually built into Python’s standard library, so you don’t need to install it separately. I’m listing it here for completeness.
3.4 Verify Installation
Let’s make sure everything works. Run this quick check:
python
import pyaudioimport pprintp = pyaudio.PyAudio()pp = pprint.PrettyPrinter(indent=4)try: device_info = p.get_default_input_device_info() pp.pprint(device_info)print(f"\n✓ Microphone available: {device_info.get('name', 'Unknown')}")exceptIOErroras e:print(f"✗ No microphone available: {e}")print("Please check your audio input devices and try again.")exceptExceptionas e:# Catch any other unexpected errorsprint(f"✗ Unexpected error: {e}")finally: p.terminate() # Always cleanup, no matter what happens
If you see your microphone details printed out, you’re good to go. If you get an error, double-check that:
Your microphone is physically connected
Your OS recognizes the microphone (check system settings)
PyAudio installed correctly (try pip list | grep -i pyaudio)
3.5 Common Installation Issues
"command 'gcc' failed with exit status 1"
This means the C compiler can’t find PortAudio headers. Go back and install the system dependencies for your OS.
"No module named '_portaudio'"
The PyAudio installation didn’t complete properly. Try uninstalling and reinstalling:
This is just a warning, not an error. PyAudio tries to connect to JACK (a professional audio server) but falls back to ALSA if it’s not available. You can safely ignore this.
ImportError on simpleaudio
On Linux, simpleaudio needs ALSA development files:
If that runs without errors, you’re ready to start recording audio.
4. Verifying Your Microphone (Check mic Check)
Now that we have PyAudio installed, let’s make sure your microphone is recognized and working properly. This step is crucial because audio issues are much easier to debug before you start recording.
4.1 Check Default Microphone
First, let’s verify that PyAudio can detect your default microphone:
python
import pyaudioimport pprintp = pyaudio.PyAudio()pp = pprint.PrettyPrinter(indent=4)try: device_info = p.get_default_input_device_info() pp.pprint(device_info)print(f"\n✓ Microphone available: {device_info.get('name', 'Unknown')}")exceptIOErroras e:print(f"✗ No microphone available: {e}")print("Please check your audio input devices and try again.")finally: p.terminate()
Example Output
If everything is working, you’ll see something like this:
name: Your microphone’s name (useful if you have multiple devices)
defaultSampleRate: The default recording rate (usually 44100 Hz)
maxInputChannels: Maximum channels supported (1=mono, 2=stereo)
4.2 List All Available Devices
If you have multiple microphones or want to choose a specific one, use this code to list all available input devices:
python
import pyaudiop = pyaudio.PyAudio()try:print("Available Audio Input Devices:")print("="*70) device_count = p.get_device_count()if device_count ==0:print("✗ No audio devices found")else: input_devices_found =Falsefor i inrange(device_count):try: info = p.get_device_info_by_index(i)if info['maxInputChannels'] >0: input_devices_found =Trueprint(f"\nDevice Index: {i}")print(f" Name: {info['name']}")print(f" Max Input Channels: {info['maxInputChannels']}")print(f" Default Sample Rate: {int(info['defaultSampleRate'])} Hz")print(f" Host API: {info['hostApi']}")print("-"*70)exceptExceptionas e:print(f"Warning: Could not get info for device {i}: {e}")continueifnot input_devices_found:print("\n✗ No input devices found. Please connect a microphone.")exceptExceptionas e:print(f"✗ Error listing devices: {e}")finally: p.terminate()
Why This Matters
If you see multiple devices, you’ll need to specify which one to use when recording. Make note of the Device Index number – you’ll use this later if you want to record from a specific microphone.
For example, if you have:
Device 0: Built-in Microphone
Device 2: USB Microphone
Device 4: Bluetooth Headset
You can choose Device 2 (USB Microphone) by passing input_device_index=2 when opening the audio stream.
Why Test Recording Capability?
The previous checks only verify that PyAudio can see your microphone. This test ensures it can actually capture audio data. Some issues (like incorrect permissions, driver problems, or conflicting applications) only appear when you try to record.
4.3 Test Recording Capability
Here’s a quick test to verify your microphone can actually capture audio (not just that it exists):
python
import pyaudioCHUNK=1024FORMAT= pyaudio.paInt16CHANNELS=1RATE=44100p = pyaudio.PyAudio()stream =Nonetry: stream = p.open(format=FORMAT,channels=CHANNELS,rate=RATE,input=True,frames_per_buffer=CHUNK )print("Testing microphone...")print("Speak into your microphone for 2 seconds...")for i inrange(0, int(RATE/CHUNK*2)): data = stream.read(CHUNK, exception_on_overflow=False)print("✓ Microphone test successful!")print("Your microphone is working and ready to record.")exceptIOErroras e:print(f"✗ Microphone test failed: {e}")print("\nPossible issues:")print("- Microphone is not connected properly")print("- Another application is using the microphone")print("- Sample rate not supported by your device")exceptExceptionas e:print(f"✗ Unexpected error during microphone test: {e}")finally:# Cleanup stream if it was openedif stream isnotNone:try:if stream.is_active(): stream.stop_stream() stream.close()except:pass# Always cleanup PyAudio p.terminate()
This test actually attempts to record audio, which catches issues that won’t show up in a simple device query.
4.4 Troubleshooting Tips
No Devices Found
If you don’t see any input devices:
Check physical connection: Make sure your microphone is plugged in
Check OS settings:
macOS: System Preferences → Sound → Input
Windows: Settings → System → Sound → Input
Linux: Run arecord -l to list recording devices
Restart your application: Sometimes the OS needs to refresh device list
Multiple Devices Listed
If you see devices you don’t recognize:
Some sound cards appear as multiple devices (different modes/configs)
Virtual audio devices (like Loopback or Soundflower) also show up
When in doubt, test each device index to find the right one
Sample Rate Errors
If you get errors about unsupported sample rates, check which rates your device supports:
python
import pyaudiop = pyaudio.PyAudio()device_index =0# Change this to your devicetry:print(f"Testing supported sample rates for device {device_index}...")# Verify device existstry: device_info = p.get_device_info_by_index(device_index)print(f"Device: {device_info['name']}\n")exceptIOError:print(f"✗ Device {device_index} not found")print("Run the device listing code to see available devices.")else:for rate in [8000, 16000, 22050, 44100, 48000, 96000]:try: supported = p.is_format_supported( rate,input_device=device_index,input_channels=1,input_format=pyaudio.paInt16 )if supported:print(f"✓ {rate} Hz - Supported")exceptValueError:print(f"✗ {rate} Hz - Not supported")exceptExceptionas e:print(f"✗ {rate} Hz - Error testing: {e}")exceptExceptionas e:print(f"✗ Error during sample rate testing: {e}")finally: p.terminate()
Most modern microphones support 44100 Hz (CD quality) and 48000 Hz (professional audio), so stick with one of those.
4.5 Ready to Record?
If your microphone shows up in the device list and the test recording runs without errors, you’re all set! The next step is understanding the audio parameters before we write the actual recording code.
5. Record the audio
Now for the main event – actually recording audio! The process is straightforward: open an audio stream, read data in chunks, and store those chunks in memory.
5.1 How Recording Works
Think of it like filling a bucket with water from a tap:
The microphone is the tap (constant stream of audio)
The CHUNK size is your bucket (how much you grab at once)
The frames list is your storage tank (where you keep everything)
The following is the code snippet to record the audio:
import pyaudioimport waveCHUNK=512FORMAT= pyaudio.paInt16CHANNELS=1RATE=44100RECORD_SECONDS=5p = pyaudio.PyAudio()stream =Nonetry: stream = p.open(format=FORMAT,channels=CHANNELS,rate=RATE,input=True,frames_per_buffer=CHUNK)print("* Recording started...") frames = []for i inrange(0, int(RATE/CHUNK*RECORD_SECONDS)):try: data = stream.read(CHUNK, exception_on_overflow=False) frames.append(data)exceptIOErroras e:print(f"Warning: Buffer overflow - {e}")continueprint("* Recording complete")exceptIOErroras e:print(f"✗ Recording failed: {e}")print("Check that your microphone is connected and not in use.") frames = []exceptKeyboardInterrupt:print("\n✗ Recording interrupted by user") frames = []exceptExceptionas e:print(f"✗ Unexpected error: {e}") frames = []finally:# Cleanup streamif stream isnotNone:try:if stream.is_active(): stream.stop_stream() stream.close()except:pass# Cleanup PyAudio p.terminate()
5.2 Understanding the Recording Loop
The heart of the recording code is this loop:
for i inrange(0, int(RATE/CHUNK*RECORD_SECONDS)): data = stream.read(CHUNK, exception_on_overflow=False) frames.append(data)
This calculates how many chunks we need to record for the desired duration:
RATE (44100) / CHUNK (512) = ~86 chunks per second
86 chunks/sec × 5 seconds = 430 total chunks to record
Each iteration grabs one chunk (512 samples) and adds it to our frames list.
5.3 Why `exception_on_overflow=False`?
This parameter tells PyAudio not to crash if the buffer overflows (when your CPU is too busy to process audio fast enough). Instead, it just drops some frames and continues. For most applications, losing a few milliseconds is better than crashing.
5.4 Parameters Explained
Let’s break down what each parameter means:
CHUNK = 512
Buffer size: how many audio samples to process at once
Smaller values = lower latency, higher CPU usage
512-2048 is typical for recording
Too small: choppy audio, high CPU
Too large: delayed response, memory issues
FORMAT = pyaudio.paInt16
Sample format: how audio data is represented
`paInt16` = 16-bit integers (range: -32768 to 32767)
This is CD-quality and works for most applications
`p.open()` tells your sound card to start capturing audio
The sound card continuously fills an internal buffer with audio data
`stream.read(CHUNK)` grabs the oldest CHUNK samples from that buffer
We store those samples in our `frames` list
Repeat until we have enough audio
`stream.stop_stream()` tells the sound card to stop
5.6 Error Handling Explained
The code handles three types of errors:
IOError: Microphone issues (disconnected, in use, unsupported format)
KeyboardInterrupt: User presses Ctrl+C to stop recording
Exception: Any unexpected error
All errors result in an empty `frames` list, which the save function will detect and handle appropriately.
6. Saving the audio
After recording, we have a list of raw audio chunks in memory. Now we need to save them to a file that other programs can open. We’ll use Python’s built-in `wave` module to create a WAV file.
6.1 Why WAV Format?
WAV (Waveform Audio File Format) is uncompressed audio:
✅ Universal compatibility (plays everywhere) ✅ No quality loss ✅ Simple to work with ❌ Large file sizes (~10MB per minute of CD-quality audio)
For production applications, you might want to compress to MP3 or OGG afterwards, but WAV is perfect for recording because it’s simple and reliable.
Here’s the saving code:
WAVE_OUTPUT_FILENAME="myaudio.wav"wf =Nonetry:ifnot frames:print("✗ No audio data to save")else:# Save the recorded audio file wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb') wf.setnchannels(CHANNELS) wf.setsampwidth(p.get_sample_size(FORMAT)) wf.setframerate(RATE) wf.writeframes(b''.join(frames))print(f"✓ Audio saved to {WAVE_OUTPUT_FILENAME}")# Calculate and display file info duration =len(frames) *CHUNK/RATEprint(f" Duration: {duration:.1f} seconds")exceptIOErroras e:print(f"✗ Failed to save audio file: {e}")print("Check file permissions and disk space.")exceptExceptionas e:print(f"✗ Unexpected error saving file: {e}")finally:if wf isnotNone:try: wf.close()except:pass
6.2 Understanding the WAV File Structure
A WAV file has two parts:
1. Header (metadata):
Number of channels (mono/stereo)
Sample width (bytes per sample)
Sample rate (samples per second)
Total number of frames
2. Data (the actual audio):
All the audio samples concatenated together
For our recording: `b''.join(frames)` combines all chunks
6.3 What Each Line Does
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb') # Open file in write-binary modewf.setnchannels(CHANNELS) # 1 channel (mono)wf.setsampwidth(p.get_sample_size(FORMAT)) # 2 bytes (16-bit)wf.setframerate(RATE) # 44100 samples/secondwf.writeframes(b''.join(frames)) # Write all audio data
6.4 File Size Calculation
The code calculates and displays file information:
duration =len(frames) *CHUNK/RATE# len(frames) = number of chunks recorded# CHUNK = samples per chunk# RATE = samples per second# Result: duration in secondsfile_size = os.path.getsize(WAVE_OUTPUT_FILENAME) /1024# Get file size in KB
If recording failed or was interrupted, `frames` might be empty. Attempting to save an empty WAV file would create a corrupt file that can’t be played. The check prevents this:
ifnot frames:print("✗ No audio data to save")
6.6 Error Handling
The code handles file system errors:
IOError: Permission denied, disk full, invalid filename
Exception: Unexpected errors during file writing
The `finally` block ensures the file is properly closed even if an error occurs, preventing data corruption.
7. Play the audio
Now that we’ve recorded and saved our audio, let’s play it back to verify everything worked. We’ll use the `simpleaudio` library, which lives up to its name – it’s refreshingly simple.
7.1 Why SimpleAudio?
There are several audio playback libraries for Python:
simpleaudio: Simple, lightweight, no dependencies
pygame: Full game library (overkill for just audio)
pyaudio: Can play audio but requires more setup
playsound: Even simpler but less control
I chose simpleaudio because it’s the sweet spot between simplicity and functionality.
Here’s the playback code:
import simpleaudio as safilename ="myaudio.wav"try:print(f"▶ Playing: {filename}") wave_obj = sa.WaveObject.from_wave_file(filename) play_obj = wave_obj.play() play_obj.wait_done()print("✓ Playback complete")exceptFileNotFoundError:print(f"✗ File not found: {filename}")print("Make sure the file exists and the path is correct.")exceptExceptionas e:print(f"✗ Playback failed: {e}")print("Check that your audio output device is working.")
7.2 How Playback Works
The process is straightforward:
1. Load the WAV file:
wave_obj = sa.WaveObject.from_wave_file(filename)
This reads the file and creates a playback object in memory.
2. Start playback:
play_obj = wave_obj.play()
This sends the audio to your speakers and returns immediately (non-blocking).
3. Wait for completion:
play_obj.wait_done()
This blocks until the audio finishes playing.
7.3 Why Non-blocking Playback?
The `play()` method returns immediately without waiting for audio to finish. This allows you to:
Show a progress bar while audio plays
Allow users to stop playback early
Play multiple sounds simultaneously
Continue other work while audio plays
If you don’t need these features, just call `wait_done()` immediately.
7.4 Error Handling
The code handles two specific errors:
FileNotFoundError: The WAV file doesn’t exist
Check the filename and path
Make sure the recording/saving steps completed successfully
General Exception: Audio device issues
Sound card not working
File format corrupted
Output device in use by another application
7.5 Alternative: Play Without Installing simpleaudio
If you can’t install simpleaudio, PyAudio can also play audio:
import pyaudioimport wave# Open the WAV filewf = wave.open("myaudio.wav", 'rb')# Initialize PyAudiop = pyaudio.PyAudio()# Open streamstream = p.open(format=p.get_format_from_width(wf.getsampwidth()),channels=wf.getnchannels(),rate=wf.getframerate(),output=True)# Play audiodata = wf.readframes(1024)while data: stream.write(data) data = wf.readframes(1024)# Cleanupstream.close()p.terminate()wf.close()
However, simpleaudio’s API is much cleaner for basic playback.
8. Complete Code
Here’s everything put together in one production-ready script. This code includes all the error handling, resource cleanup, and user feedback we’ve discussed. You can copy-paste this directly and it will work.
8.1 What This Script Does
Records 5 seconds of audio from your default microphone
Saves it as “myaudio.wav” in the current directory
Displays recording progress and file information
Handles errors gracefully with helpful messages
Cleans up all resources properly
8.2 Customization Options
Before running, you can modify these parameters at the top:
CHUNK: 512 works well; try 1024 for lower CPU usage
CHANNELS: Change to 2 for stereo recording
RATE: Use 48000 for higher quality or 16000 for smaller files
RECORD_SECONDS: Any positive number
WAVE_OUTPUT_FILENAME: Change the output filename
import pyaudioimport waveCHUNK=512FORMAT= pyaudio.paInt16CHANNELS=1RATE=44100RECORD_SECONDS=5WAVE_OUTPUT_FILENAME="myaudio.wav"p = pyaudio.PyAudio()stream =Nonewf =Nonetry:# Open audio stream stream = p.open(format=FORMAT,channels=CHANNELS,rate=RATE,input=True,frames_per_buffer=CHUNK)print("* Recording started...")print(f" Duration: {RECORD_SECONDS} seconds") frames = []# Record audiofor i inrange(0, int(RATE/CHUNK*RECORD_SECONDS)):try: data = stream.read(CHUNK, exception_on_overflow=False) frames.append(data)exceptIOErroras e:print(f"Warning: Buffer overflow - continuing...")continueprint("* Recording complete")# Save the recorded audio fileif frames: wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb') wf.setnchannels(CHANNELS) wf.setsampwidth(p.get_sample_size(FORMAT)) wf.setframerate(RATE) wf.writeframes(b''.join(frames))# Calculate file info duration =len(frames) *CHUNK/RATEimport os file_size = os.path.getsize(WAVE_OUTPUT_FILENAME) /1024# KBprint(f"✓ Saved to {WAVE_OUTPUT_FILENAME}")print(f" Duration: {duration:.1f} seconds")print(f" Size: {file_size:.1f} KB")else:print("✗ No audio data recorded")exceptIOErroras e:print(f"✗ Recording failed: {e}")print("Check that your microphone is connected and not in use.")exceptKeyboardInterrupt:print("\n✗ Recording interrupted by user")exceptExceptionas e:print(f"✗ Unexpected error: {e}")finally:# Cleanup streamif stream isnotNone:try:if stream.is_active(): stream.stop_stream() stream.close()except:pass# Cleanup PyAudioif p isnotNone:try: p.terminate()except:pass# Cleanup wave fileif wf isnotNone:try: wf.close()except:pass
stream = p.open(..., input_device_index=0) # Try 0, 1, 2, etc.
Frequent buffer overflows
Problem: Audio is choppy with many “Buffer overflow” warnings
Solutions:
# Increase CHUNK sizeCHUNK=2048# or 4096# Lower sample rateRATE=22050# instead of 44100# Use callback-based recording (more advanced)
Recording is silent or too quiet
Problem: Audio file plays but volume is very low
Solutions:
Check microphone input level in system settings
Try a different microphone or input device
Boost volume after recording (requires additional processing)
9.3 Playback Problems
“No such file or directory”
Problem: WAV file wasn’t created or is in wrong location
Solutions:
import os# Check if file existsif os.path.exists("myaudio.wav"):print("File found!")else:print("File not found. Check recording step.")# Use absolute pathWAVE_OUTPUT_FILENAME="/full/path/to/myaudio.wav"
“wave_obj.play() does nothing”
Problem: Playback doesn’t work
Solutions:
# Make sure you wait for playbackplay_obj = wave_obj.play()play_obj.wait_done() # Don't forget this!# Check output device# Try playing in VLC or other player to verify file is good
9.4 Permission Issues (Linux)
"Permission denied" accessing audio device
Problem: User doesn’t have audio permissions
Solution:
# Add user to audio groupsudo usermod -a -G audio $USER# Log out and log back in# Verify membershipgroups | grep audio
from pydub import AudioSegment# Convert to MP3 (much smaller)sound = AudioSegment.from_wav("recording.wav")sound.export("recording.mp3", format="mp3", bitrate="128k")# Reduces file size by ~90%
10.6 Testing
Write tests for audio functionality:
import unittestimport osclassTestAudioRecording(unittest.TestCase):deftest_recording_creates_file(self): record_audio(duration=1)self.assertTrue(os.path.exists("test.wav"))deftest_file_size_reasonable(self): record_audio(duration=1) size = os.path.getsize("test.wav")# 1 second at 44100 Hz, 16-bit, mono ≈ 88KBself.assertGreater(size, 80000)self.assertLess(size, 100000)
10.7 Logging
Add logging instead of print statements:
import logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)# Instead of print()logger.info("Recording started")logger.warning("Buffer overflow detected")logger.error("Failed to save file")
10.8 Memory Management
For very long recordings, write to disk incrementally:
wf = wave.open(filename, 'wb')wf.setnchannels(CHANNELS)wf.setsampwidth(p.get_sample_size(FORMAT))wf.setframerate(RATE)# Write as you record (no frames list needed)for i inrange(total_chunks): data = stream.read(CHUNK) wf.writeframes(data) # Write immediatelywf.close()