Large Language Models (LLMs) Inference and Serving

You might have encountered some performance issues while executing LLM (Large Language Model) models for the production environment. You may consider using an inference engine or server that handles various such issues off the shelf for you. Following is an overview of the architecture for LLMs’ inference and serving.

Several LLM inference engines and servers are available to deploy and serve LLMs in production. The following are a few and most prominent among them:

  1. vLLM
  2. LightLLM
  3. LMDeploy
  4. SGLang
  5. OpenLLM
  6. Triton Inference Server with TensorRT-LLM
  7. Ray Serve
  8. Hugging Face – Text Generation Inference (TGI)
  9. DeepSpeed-MII
  10. CTranslate2
  11. BentoML
  12. MLC LLM

Throughput vs. Latency

Throughput and latency are two important metrics for evaluating LLM inference and serving. Throughput refers to the number of output tokens an LLM can generate per second. Latency refers to the time it takes for a large language model (LLM) to process an input and generate a response. For latency, the important metric is “Time to First Token” (TTFT) which refers to the amount of time it takes for a language model to generate the first token of its response after receiving a prompt.

LLM inference engines and servers are intended to optimise LLM memory utilisation and performance in the production environment. They assist you with achieving high throughput and low latency, guaranteeing that your LLMs can handle a huge number of requests while responding rapidly. Based on your specific use case, you may also have additional factors that would influence your decision to select a particular engine/server.

Inferene Engine vs. Inference Server

Inference engines run the models and are in charge of everything needed for the generating process. Inference Servers handle incoming and outgoing HTTP and gRPC queries from end users for your application, as well as metrics for measuring the deployment performance of your LLM.

Techniques/Terminologies used across these Frameworks

  1. KV cache
  2. PagedAttention
  3. Batching
  4. Support for quantisation
  5. LoRA: Low-Rank Adaptation: It is a technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
  6. Tool calling: Tool calling, also known as function calling, is a technique that enables Large Language Models (LLMs) to request information from external tools. This enables LLMs to obtain information for which they were not trained or perform actions that are beyond their own capacities.
  7. Reasoning models: e.g. DeepSeek R1
  8. Structured Outputs:
    • outlines: Structured Text Generation
    • LMFE (Language Model Format Enforcer): Enforce the output format (JSON Schema, Regex etc) of a language model.
    • xgrammar: Efficient, Flexible and Portable Structured Generation
  9. Automatic Prefix Caching (APC): APC is a technique that speeds up Large Language Models (LLMs) by reusing cached results for similar queries.
  10. Speculative Decoding [4][5]
  11. Chunked Prefill
  12. Prompt Adapter
  13. Beam Search
  14. Guided decoding
  15. AsyncLM: It improves LLM’s operational efficiency by enabling LLMs to generate and execute function calls concurrently.
  16. Prompt logprobs (logarithm of probability)
  17. kserve: Standardized Serverless ML Inference Platform on Kubernetes
  18. kubeai: AI Inference Operator for Kubernetes. The easiest way to serve ML models in production. Supports VLMs, LLMs, embeddings, and speech-to-text.
  19. llama-stack: Composable building blocks to build Llama Apps

Many of the terms and techniques have been left above without any elaboration. These are pointers for exploring in detail.

References

  1. Best LLM Inference Engines and Servers to deploy LLMs in Production
  2. Efficient Memory Management for Large Language Model Serving with PagedAttention
  3. LoRA: Low-Rank Adaptation of Large Language Models
  4. Fast Inference from Transformers via Speculative Decoding
  5. Looking back at speculative decoding

How do you choose among competing open-source products? Example comparison of open-source vector databases.

Following are the questions that should be running in your mind when you are to choose from competing open-source products:

  1. Are there any other users out there?
  2. Is it the most popular in this category?
  3. Is this technology in decline?

The popularity and traction of GitHub can be inferred from their star histories. You can use star-history.com to make a comparison based on these two metrics. Refer to the tutorial for details.

Following is the comparison of vector databases: qdrant, chroma, weaviate, marqo, milvus, and vespa.

Star History
Star History

Hands-on Tutorial on Making an Audio Bot using LLM, and RAG

1. Learning Outcome

  1. Learn about Large Language Models (LLMs), their installation, access through HTTP API, the Ollama framework, etc.
  2. Introduction of Retrieval-Augmented Generation (RAG)
  3. Learn about the Data Ingestion Pipeline for the Qdrant vector database
  4. Learn about the RAG Pipeline
  5. Access the prototype using audio-based input/output (audio bot).
  6. Audio bot using speech-to-text and text-to-speech
  7. Making Qdrant client for making queries
  8. Creating context using documents retrieved from the Qdrant database
  9. Using Llama 3.2 as a Large Language Model (LLM) using Ollama and Langchain framework
  10. Create a prompt template using instruction, and context, and make a query using the template and langchain
  11. Using Llama to generate the answer to the question using the given context

2. Large Language Model

Large Language Model (LLM) is an artificial intelligence (AI) model trained to comprehend and produce language similar to a human’s. It can learn linguistic structures, relationships, and patterns since it has been trained on enormous volumes of text data. Transformer architecture is often the foundation of large language models, allowing them to

Large Language Model (LLM) is an artificial intelligence (AI) model trained to comprehend and produce language similar to a human’s. It can learn linguistic structures, relationships, and patterns since it has been trained on enormous volumes of text data. Transformer architecture is often the foundation of large language models, allowing them to

  1. process sequential data that makes it suitable for tasks like text generation, translation, and question-answering)
  2. learn contextual relationships such as word meanings, syntax, and semantics.
  3. generate human-like language, i.e. produce coherent, context-specific text that resembles human-generated content

Some key characteristics of large language models include:

  1. Trained on vast training data:
  2. Scalability: They can handle long sequences and complex tasks.
  3. Parallel processing

Some of the popular examples of Large Language Models are BERT (Bidirectional Encoder Representations from Transformers), RoBERTa (Robustly Optimized BERT Pretraining Approach), XLNet, T5 (Text-to-Text Transfer Transformer), Llama, etc.

Some of the popular use cases are:

  1. Language Translation
  2. Text summarization
  3. Sentiment analysis
  4. Chatbots and virtual assistants
  5. Question answering
  6. Content generation

Major concerns regarding LLMs are

  1. Data bias: LLMs have the potential to reinforce biases found in the training data.
  2. Interpretability: It can be difficult to comprehend the decision-making process of an LLM.
  3. Security: Adversarial attacks can target LLMs.

3.1 Llama

Llama is an open-source AI model you can fine-tune, distill, and deploy anywhere. Current versions are available in three flavors:

  1. Llama 3.1: With Multilingual capability and available in two versions 1) 8B with 8 billion parameters and 2) 405B with 405 billion parameters
  2. Llama 3.2: Lightweight and Multimodal and available in 1) Lightweight 1B and 3B 2) Multimodal 11B and 90B
  3. Llama 3.3: Multilingual with 70B parameters

In the current prototype, I have used Llama 3.2.

3.2 Install, run, and different ways to access Llama

Please refer to my separate post on this topic, Install, run, and access Llama using Ollama.

3. Retrieval-Augmented Generation (RAG)

Large language models are locked in time. It has learned the knowledge that was available till the time when it was being trained and released. These models are trained on Internet-scale open data. So when you ask general questions, it would give very good answers, but it may fail to answer or hallucinate if you go very specific to your personal or enterprise data. The reason is that it usually does not have the right context of your requirements that are very specific to your application.

Retrieval-augmented generation (RAG) combines the strength of generative AI and retrieval techniques. It helps in providing the right context for the LLM along with the question being asked. This way we get the LLM to generate more accurate and relevant content. This is a cost-effective way to improve the output of LLMs without retraining them. The following diagram depicts the RAG architecture:

Fig1: Conceptual flow of using RAG with LLM

There are two primary components of the RAG:

  1. Retrieval: This component is responsible for searching and retrieving relevant information related to the user’s query from various knowledge sources such as documents, articles, databases, etc.
  2. Generation: This component does an excellent job of crafting coherent and contextually rich responses to user queries.

A question submitted by the user is routed to the retrieval component. Using the embedding model, the retrieval component converts the query text to the embedding vector. After that, it looks through the vector database to locate a small number of vectors that match the query text and satisfy the threshold requirements for the similarity score and distance metric. These vectors are transformed back to the text and used as the context. This context, along with the prompt and query, is put in the prompt template and sent to the LLM. LLM returns the generated text that is more correct and relevant to the user’s query.

4. RAG (Data Ingestion Pipeline)

In order for the retrieval component to have a searchable index of preprocessed data, we must first build a data input pipeline. The following diagram in fig2 depicts the data ingestion pipeline. Knowledge sources can be web pages, text documents, pdf documents, etc. Texts need to be extracted from these sources. I am using PDF documents as the only knowledge source for this prototype.

  1. Text Extraction: To extract the text from the PDF, various Python libraries can be used, such as PyPDF2, pdf2txt, PDFMiner, etc. If PDF is scanned PDF, libraries such as unstructured, pdf2image, and pytesseract can be utilized. The quality of the text can be maintained by performing cleanups such as removing extraneous characters, fixing formatting issues, whitespace, special characters, punctuation, spell checking, etc. Language detection may also be required if knowledge sources can have text coming in multiple languages, or a single document may contain multiple languages.
  2. Handling Multiple Pages: Maintaining the context across pages is important. It is recommended that the document be segmented into logical units, such as paragraphs or sections, to preserve the context. Extracting metadata such as document titles, authors, page numbers, creation dates, etc., is crucial for improving searchability and answering user queries.

Fig2: RAG data ingestion pipeline

Note: I have manually downloaded the PDFs of all the chapters of the book “Democratic Politics” of class IX of the NCERT curriculum. These PDFs will be the knowledge source for our application.

4.1 Implementation step by step

Step 1: Install the necessary libraries

  1. pip install pdfminer.six
  2. pip install langchain-ollama

Imports:

from langchain_community.document_loaders import PDFMinerLoader
from langchain.text_splitter import CharacterTextSplitter
from qdrant_client import QdrantClient

Step 2: Load the pdf file and extract the text from it

loader = PDFMinerLoader(path + "/" + file_name)
pdf_content = loader.load()

Step 3: Split the text into smaller chunks with overlap

CHUNK_SIZE = 1000 # chunk size not greater than 1000 chars
CHUNK_OVERLAP = 30 # a bit of overlap is required for continued context

text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
docs = text_splitter.split_documents(pdf_content)

# Make a list of split docs
documents = []
for doc in docs:
    documents.append(doc.page_content)

Step 4: Embed and store the documents in the vector database

FastEmbed is a lightweight, fast Python library built for embedding generation. Qdrant vector database uses this embedding library by default. Following is the code snippet for inserting in the vector database.

# 3. Create vectordatabase(qdrant) client 
qdrant_client = QdrantClient(url="http://localhost:6333")

# 4. Add document chunks in vectordb
qdrant_client.add(
    collection_name="ix-sst-ncert-democratic-politics",
    documents=documents,
    #metadata=metadata,
    #ids=ids
)

Step 5: Making a sample query

# 5. Make a query from the vectordb(qdrant)
search_results = qdrant_client.query(
    collection_name="ix-sst-ncert-democratic-politics",
    query_text="What is democracy?"
)

for search_result in search_results:
    print(search_result.document, search_result.score)

4.2 Complete Code data_ingestion.py

###############################################################
# Data ingestion pipeline 
# 1. Taking the input pdf file
# 2. Extracting the content
# 3. Divide into chunks
# 4. Use embeddings model to convet to the embedding vector
# 5. Store the embedding vectors to the qdrant (vector database)
################################################################
import os
from langchain_community.document_loaders import PDFMinerLoader
from langchain.text_splitter import CharacterTextSplitter
from qdrant_client import QdrantClient

path = "ix-sst-ncert-democratic-politics"
filenames = next(os.walk(path))[2]

for i, file_name in enumerate(filenames):
    print(f"Data ingestion for the chapter: {i}")

    # 1. Load the pdf document and extract text from it
    loader = PDFMinerLoader(path + "/" + file_name)
    pdf_content = loader.load()
    print(pdf_content)

    # 2. Split the text into small chunks
    CHUNK_SIZE = 1000 # chunk size not greater than 1000 chars
    CHUNK_OVERLAP = 30 # a bit of overlap is required for continued context

    text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
    docs = text_splitter.split_documents(pdf_content)

    # Make a list of split docs
    documents = []
    for doc in docs:
        documents.append(doc.page_content)

    # 3. Create vectordatabase(qdrant) client 
    qdrant_client = QdrantClient(url="http://localhost:6333")

    # 4. Add document chunks in vectordb
    qdrant_client.add(
        collection_name="ix-sst-ncert-democratic-politics",
        documents=documents,
        #metadata=metadata,
        #ids=ids
    )

    # 5. Make a query from the vectordb(qdrant)
    search_results = qdrant_client.query(
        collection_name="ix-sst-ncert-democratic-politics",
        query_text="What is democracy?"
    )

    for search_result in search_results:
        print(search_result.document, search_result.score)

5. RAG (Information Retrieval and Generation) – Audio Bot

I am making an audio bot that will answer questions from the chapters of the book “Democratic Politics” of class IX of the NCERT(India) curriculum. If you want to learn about making an audio bot, you can read my article on the topic “Making a talking bot using Llama3.2:1b running on Raspberry Pi 4 Model-B 4GB“.

5.1 Audio Bot Implementation

The following diagram depicts the overall flow of the audio bot and how it interacts with the RAG system. A user interacts with the audio bot using the microphone. The microphone captures the speech audio signal and passes it on to the speech-to-text library (I am using faster_whisper) which in turn converts to a text query that is in turn passed on to the RAG system as a query. When the RAG system comes up with the response text, this text is passed on to the text-to-speech library (I am using pyttsx3) that in turn converts text to audio which is then played by the speaker so the user can listen to the response.

5.1.1 Recording Audio from Microphone

I have a detailed blog on this topic. Please refer: How to Record, Save and Play Audio in Python?

5.1.2 Speech-to-Text

faster-whisper is a reimplementation of OpenAI’s Whisper model using CTranslate2, which is a fast inference engine for Transformer models.

Installation: pip install faster-whisper

Save the following code in Python file say "speech-to-text.py" and run python speech-to-text.py

from faster_whisper import WhisperModel

model_size = "small.en"
model = WhisperModel(model_size, device="cpu", 
                     compute_type="int8")

# Transcribe
transcription = model.transcribe(
    audio="basic_output1.wav",
    language="en",
)

seg_text = ''
for segment in transcription[0]:
    seg_text = segment.text

print(seg_text)

Sample input audio file:

Output text: “Please ask me something. I’m listening now”

5.1.3 Text-to-Speech

The best offline text-to-speech library that works on resource-constrained devices is “pyttsx3“.

Installation: pip install pyttsx3

Save the following code in a Python file say "text-to-speech.py" and run python text-to-speech.py

Code Snippet:

import pyttsx3

engine = pyttsx3.init()
engine.setProperty('volume', 1)
engine.setProperty('rate', 130)
voices = engine.getProperty('voices')
engine.setProperty('voice', voices[1].id)
engine.setProperty('voice', 'english+f3')
text_to_speak = "I got your question. Please bear " \
    "with me while I retrieve the answer."
engine.say(text_to_speak)
# Folloing is the optional line: If you want 
# also to save audio file
engine.save_to_file(text_to_speak, 'speech.wav') 
engine.runAndWait()

Sample input text: “I got your question. Please bear with me while I retrieve the answer.”

5.1.4 Play Audio on the Speaker

I have a detailed blog on this topic. Please refer: How to Record, Save and Play Audio in Python?

5.2 RAG (Generation) using LLM (Llama3.2)

Following code snippet makes a query to the qdrant, retrieves relevant documents, selects few documents based on the similarity score.

search_results = qdrant_client.query(
    collection_name="ix-sst-ncert-democratic-politics",
    query_text=query
)
contaxt = ""
for search_result in search_results:
    if search_result.score >= 0.7:
        contaxt + search_result.document

The following code snippet creates a template, the template is used to create the prompt, create a reference to the llama model, chain (langchain pipeline for executing the query) is created using prompt and model, and finally chain is invoked to execute the query to get the response formed by LLM using the retrieved context.

# 4. Using LLM for forming the answer
template = """Instruction: {instruction}
Contaxt: {contaxt}
Query: {query}
"""

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm model

chain = prompt | model

bot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", 
        "contaxt": contaxt,
        "query": query
      })

5.3 Complete Code audiobot.py

Following is the code snippet for the audio bot. Save the file as audiobot.py

import pyaudio
import wave
import pyttsx3
from qdrant_client import QdrantClient
from langchain_ollama.llms import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate
from faster_whisper import WhisperModel

# Load the Speech to Text Model (faster-whisper: pip install faster-whisper)
whishper_model_size = "small.en"
whishper_model = WhisperModel(whishper_model_size, device="cpu", 
                     compute_type="int8")

CHUNK = 512 
FORMAT = pyaudio.paInt16 #paInt8
CHANNELS = 1
RATE = 44100 #sample rate
RECORD_SECONDS = 7
WAVE_OUTPUT_FILENAME = "pyaudio-output.wav"

def speak(text_to_speak):
    engine = pyttsx3.init()
    engine.setProperty('volume', 1)
    engine.setProperty('rate', 130)
    voices = engine.getProperty('voices')
    engine.setProperty('voice', voices[1].id)
    engine.setProperty('voice', 'english+f3')
    engine.say(text_to_speak)
    engine.runAndWait()

speak("I am an AI bot. I have learned the book \"democratic politics\" of class 9 published by N C E R T. You can ask me questions from this book.")

while True:
    speak("I am listening now for you question.")

    p = pyaudio.PyAudio()
    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK) #buffer

    print("* recording")
    frames = []

    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
        data = stream.read(CHUNK)
        frames.append(data) # 2 bytes(16 bits) per channel

    stream.stop_stream()
    stream.close()
    p.terminate()

    wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
    wf.setnchannels(CHANNELS)
    wf.setsampwidth(p.get_sample_size(FORMAT))
    wf.setframerate(RATE)
    wf.writeframes(b''.join(frames))
    wf.close()

    # Transcribe
    transcription = whishper_model.transcribe(
        audio=WAVE_OUTPUT_FILENAME,
        language="en",
    )
    seg_text = ''
    for segment in transcription[0]:
        seg_text = segment.text
    
    print(f'\nUser: {seg_text}')

    if seg_text == '':
        speak("Probably you did not say anything.")
        continue
    else:
        text_to_speak = "I got your question. Please bear with me " \
            + "while I retrieve about the answer."
        speak(text_to_speak)

    # 1. Create vector database(qdrant) client 
    qdrant_client = QdrantClient(url="http://localhost:6333")

    # 2. Make a query to the vectordb (qdrant)
    #query = "explain democracy in estonia?"
    query = seg_text

    search_results = qdrant_client.query(
        collection_name="ix-sst-ncert-democratic-politics",
        query_text=query
    )

    context = ""
    no_of_docs = 2
    count = 1
    for search_result in search_results:
        if search_result.score >= 0.8:
            print(f"Retrieved document: {search_result.document}, Similarity score: {search_result.score}")
            context = context + search_result.document
        if count >= no_of_docs:
            break
        count = count + 1

    #print(f"Retrieved Context: {context}")

    if context == "":
        print("Context is blank. Could not find any relevant information in the given sources.")
        speak("I did not find anything in the book about the question.")
        continue

    # 4. Using LLM for forming the answer
    template = """Instruction: {instruction}
    Context: {context}
    Query: {query}
    """

    prompt = ChatPromptTemplate.from_template(template)

    model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm model

    chain = prompt | model

    bot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", 
                "context": context,
                "query": query
                })

    print(f'\nBot: {bot_response}')
 
    speak(bot_response)

6. Libraries Used

Following is the list of libraries used in the prototype implementation. These can be installed from the Python pip command.

  1. qdrant-client
  2. pdfminer.six
  3. fastembed
  4. pyaudio
  5. pyttsx3
  6. langchain-ollama
  7. faster-whisper

7. Steps

  1. Create env python -m venv env1
  2. Activate using the activate command in env1\Scripts\activate if on Windows. Activate command is there in the bin directory on the Linux system.
  3. python -m pip install -r requirements.txt
  4. python data_ingestion.py
  5. python audiobot.py

10. My Conversation with the Audio Bot

9. Further Improvement

In the current prototype, the chunk size is of fixed length, CHUNK_SIZE = 1000 and CHUNK_OVERLAP = 30. For further improvement, the document can be split into logical units, such as paragraphs/sections, to maintain a better context.

10. References

  1. A Practical Approach to Retrieval Augmented Generation Systems by Mehdi Allahyari and Angelina Yang
  2. Install, run, and access Llama using Ollama. – link
  3. How to Record, Save, and Play Audio in Python? – link
  4. Making a talking bot using Llama3.2:1b running on Raspberry Pi 4 Model-B 4GB – link
  5. fastembed library – link
  6. Qdrant – link
  7. pyttsx3 – link

Making a talking bot using Llama3.2:1b running on Raspberry Pi 4 Model-B 4GB

Introduction

In this article, I describe my experiment on making a talking bot using a Large Language Model Llama3.2:1b and running it successfully on the Raspberry Pi 4 Model-B with 4GB RAM. Llama3.2:1b is the quantized version of the Llama model with 1 billion parameters for use on resource-constrained devices from Facebook. I have kept this bot primarily in question-answering mode to keep things simple. The bot is supposed to answer all the questions that llama3.2:1b can answer from its learned knowledge in the model. The objective is to run this completely offline without needing the Internet.

My Setup

The following picture describes my setup which consists of a Raspberry Pi to host the LLM (llama3.2:1b), a mic for asking questions, and a pair of speakers to play the answers from the bot. I have used the Internet while doing the installation etc. but the bot works in offline mode.

Following is the overall design explaining the end-to-end flow.

The user asks the question in the external microphone connected to the Raspberry Pi. This audio signal captured by the microphone is converted to text using a speech-to-text library. Text is sent to the Llama model running on the Raspberry Pi. The Llama model answers the question in the form of text that is sent to the text-to-speech library. The output of the text-to-speech is audio that is played and can be listened to by the user on the speaker.

Following are the steps of the setup:

  1. Install, run, and access Llama
  2. Installation and accessing Speech-to-text library
  3. Installation of text-to-speech library
  4. Record, Save, and Play audio
  5. Running the code (the complete code)

1. Install, run, and how to access llama using API

The Llama model is the core of this bot product. So before we move further, this should be installed and running. Please refer to the separate post on the topic, “Install, run, and access Llama using Ollama“. This post also describes the details of how to access the running model using the API.

2. Installation of speech-to-text library and how to use

I tried many speech-to-text libraries and finally satteled with “faster-whisper“. With the help of CTranslate2, a quick inference engine for Transformer models, faster-whisper is a reimplementation of OpenAI’s Whisper model. The performance of this library on the Raspberry Pi was also satisfactory. Works offline.

Installation: pip install faster-whisper

Save the following code in Pythone file say "speech-to-text.py" and run python speech-to-text.py

Code Snippet:

from faster_whisper import WhisperModel

model_size = "small.en"
model = WhisperModel(model_size, device="cpu", 
                     compute_type="int8")

# Transcribe
transcription = model.transcribe(
    audio="basic_output1.wav",
    language="en",
)

seg_text = ''
for segment in transcription[0]:
    seg_text = segment.text

print(seg_text)

Sample input audio file:

Output text: “Please ask me something. I’m listening now”

3. Installation of text-to-speech library and how to use

The best offline text-to-speech library that works on resource-constrained devices is “pyttsx3“.

Installation: pip install pyttsx3

Save the following code in a Python file say "text-to-speech.py" and run python text-to-speech.py

Code Snippet:

import pyttsx3

engine = pyttsx3.init()
engine.setProperty('volume', 1)
engine.setProperty('rate', 130)
voices = engine.getProperty('voices')
engine.setProperty('voice', voices[1].id)
engine.setProperty('voice', 'english+f3')
text_to_speak = "I got your question. Please bear " \
    "with me while I retrieve the answer."
engine.say(text_to_speak)
# Folloing is the optional line: If you want 
# also to save audio file
engine.save_to_file(text_to_speak, 'speech.wav') 
engine.runAndWait()

Sample input text: “I got your question. Please bear with me while I retrieve the answer.”

4. Record, Save, and Play audio

For recording, saving, and playing the audio, I have a separate post. Please refer “How to Record, Save and Play audio in Python?“.

5. The complete code

Following is the complete code. Save the following code in say “llama-bot.py” and run python llama-bot.py

Code Snippet:

import requests
import pyaudio
import wave
import json
import pyttsx3
from subprocess import call
from faster_whisper import WhisperModel
# Load Model
model_size = "small.en"
model = WhisperModel(model_size, device="cpu", 
                     compute_type="int8")

#p = pyaudio.PyAudio()

CHUNK = 512 
FORMAT = pyaudio.paInt16 #paInt8
CHANNELS = 1
RATE = 44100 #sample rate
RECORD_SECONDS = 7
WAVE_OUTPUT_FILENAME = "pyaudio-output.wav"

engine = pyttsx3.init()
engine.setProperty('volume', 1)
engine.setProperty('rate', 130)
voices = engine.getProperty('voices')
engine.setProperty('voice', voices[1].id)
engine.setProperty('voice', 'english+f3')
engine.say("I am an AI bot. You can ask me questions.")
engine.runAndWait()

while True:
    engine = pyttsx3.init()
    engine.setProperty('volume', 1)
    engine.setProperty('rate', 130)
    voices = engine.getProperty('voices')
    engine.setProperty('voice', voices[1].id)
    engine.setProperty('voice', 'english+f3')
    engine.say("I am listening now. Please ask.")
    engine.runAndWait()

    p = pyaudio.PyAudio()
    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK) #buffer

    print("* recording")
    frames = []

    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
        data = stream.read(CHUNK)
        frames.append(data) # 2 bytes(16 bits) per channel

    stream.stop_stream()
    stream.close()
    p.terminate()

    wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
    wf.setnchannels(CHANNELS)
    wf.setsampwidth(p.get_sample_size(FORMAT))
    wf.setframerate(RATE)
    wf.writeframes(b''.join(frames))
    wf.close()
    
    engine = pyttsx3.init()
    engine.setProperty('volume', 1)
    engine.setProperty('rate', 130)
    voices = engine.getProperty('voices')
    engine.setProperty('voice', voices[1].id)
    engine.setProperty('voice', 'english+f3')
    text_to_say = "I got your question. Please bear with me " \
        + "while I retrieve about the answer."
    engine.say(text_to_say)
    engine.runAndWait()

    # Transcribe
    transcription = model.transcribe(
        audio=WAVE_OUTPUT_FILENAME,
        language="en",
    )
    seg_text = ''
    for segment in transcription[0]:
        seg_text = segment.text
    #print(seg_text)

    # Call llama
    data = '{}'
    data = json.loads(data)
    data["model"] = "llama3.2:1b"
    data["stream"] = False
    if seg_text == '':
        seg_text = 'Tell about yourself and how you can help.'
    data["prompt"] = seg_text + " Answer in one sentence."

    r = requests.post('http://127.0.0.1:11434/api/generate', 
                      json=data)
    data1 = json.loads(json.dumps(r.json()))

    # Print User and Bot Message
    print(f'\nUser: {seg_text}')
    bot_response = data1['response']
    print(f'\nBot: {bot_response}')
 
    # Text to Speech
    engine = pyttsx3.init()
    engine.setProperty('volume', 1)
    engine.setProperty('rate', 130)
    voices = engine.getProperty('voices')
    engine.setProperty('voice', voices[1].id)
    engine.setProperty('voice', 'english+f3')
    engine.say(bot_response)
    engine.runAndWait()

Sample of the conversation with the bot:

Install, run, and access Llama using Ollama

Learning Outcome

In this post, we will learn about:

  1. What is Ollama?
  2. How do you install the Llama model using the Ollama framework?
  3. Running Llama models
  4. Different ways to access the Ollama model
    • Access the deployed models using Page Assist plugin in the Web Browsers
    • Access the Llama model using HTTP API in Python Language
    • Access the Llama model using the Langchain Library

1. What is Ollama?

Ollama is an open-source tool/framework that facilitates users in running large language models (LLMs) on their local computers, such as PCs, edge devices like Raspberry Pi, etc.

2. How to install it?

Downloads and installations are available for Mac, Linux, and Windows. Visit https://ollama.com/download for instructions.

3. Running Llama 3.2

Five versions of Llama 3.2 models are available: 1B, 3B, 11B, and 90B. ‘B’ indicates billions. For example, 1B means that the model has been trained on 1 billion parameters. 1B and 3B are text-only models, whereas 11B and 90B are multimodal (text and images).

Run 1B model: ollama run llama3.2:1b

Run 3B model: ollama run llama3.2

After running these models on the terminal, we can interact with the model using the terminal.

4. Access the deployed models using Web Browsers

Page Assist is an open-source browser extension that provides a sidebar and web UI for your local AI model. It allows you to interact with your model from any webpage.

5. Access the Llama model using HTTP API in Python Language

import json
import requests

data = '{}'
data = json.loads(data)
data["model"] = "llama3.2:1b"
data["stream"] = False
data["prompt"] = "What is Newton's law of motion?" + " Answer in short."

# Sent to Chatbot
r = requests.post('http://127.0.0.1:11434/api/generate', json=data)
response_data = json.loads(json.dumps(r.json()))

# Print User and Bot Message
print(f'\nUser: {data["prompt"]}')
bot_response = response_data['response']
print(f'\nBot: {bot_response}')

6. Access the Llama model using the Langchain Library

Dependent library installation: pip install langchain-ollama

from langchain_ollama.llms import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate

query = "What is Newton's law of motion?"

template = """Instruction: {instruction}
Query: {query}
"""

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm model

chain = prompt | model

bot_response = chain.invoke({"instruction": "Answer the question. If you cannot answer the question, answer with \"I don't know.\"", 
                "query": query
                })

print(f'\nUser: {query}')
print(f'\nBot: {bot_response}')

How to Record, Save and Play audio in Python?

Libraries

The following are the required libraries.

  • PortAudio is a free, cross-platform, open-source, audio I/O library. It lets you write simple audio programs in ‘C’ or C++ that will compile and run on many platforms including Windows, Macintosh OS X, and Unix (OSS/ALSA).
  • PyAudio provides Python bindings for PortAudio. Following is the pip command.
    pip install pyaudio
  • wave module of Python3.
  • simpleaudio to play the saved wave audio file. Following is the pip command:
    pip install simpleaudio

Check mic check

Check if you have a working microphone on your system. Following is the code snippet you can use.

import pyaudio
import pprint 

p = pyaudio.PyAudio()
pp = pprint.PrettyPrinter(indent=4)

try:
    pp.pprint(p.get_default_input_device_info())
except:
    print("No mics availiable")

Example output:

Record the audio

Following is the code snippet to record the audio.

import pyaudio
import wave

p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK) #buffer

print("* recording")
frames = []

for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
    data = stream.read(CHUNK)
    frames.append(data) # 2 bytes(16 bits) per channel

stream.stop_stream()
stream.close()
p.terminate()

pyaudio.PyAudio() method acquires system resources for PortAudio. pyaudio.PyAudio.open() sets up a pyaudio.PyAudio.Stream to play or record audio. pyaudio.PyAudio.Stream.read() is used to read audio data from the stream. In the above code, all the audio frames have been collected in the frames list frames []. These frames will be used for saving the audio file in the later part of the code.

Meaning of parameters to the function open:

  1. FORMAT: PortAudio provides samples in raw PCM (Pulse-Code Modulation) format. That means each sample is an amplitude to be given to the DAC (digital-to-analog converter) in your sound card. For paInt16, this is a value from -32768 to 32767. For paFloat32, this is a floating-point value from -1.0 to 1.0. The sound card converts this values to a proportional voltage that then drives your audio equipment. paFloat32, paInt32, paInt24, paInt16, paInt8, paUInt8, paCustomFormat
  2. CHANNELS means how many samples will be there for each frame.
  3. RATE is the sampling rate (the number of frames per second)
  4. CHUNK is the number of frames the signals are split into. This is arbitrarily chosen.

In the last, the stream should be closed and all the resources acquired must be released.

Saving the audio

Following is the code snippet to save the audio.

# Save the recorded audio file
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()

wave module of Python3 is used for this purpose. Parameters are set as the same values that were used while recording the audio.

Play the audio

Following is the code snippet to play the audio. simpleaudio library I have used for this purpose. There are other libraries available that can be tried. simpleaudio I found to be simple enough.

# Play the recorded audio file
wave_obj = sa.WaveObject.from_wave_file("pyaudio-output.wav")
play_obj = wave_obj.play()
play_obj.wait_done()

Complete Code

import pyaudio
import wave
import simpleaudio as sa

CHUNK = 512 
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
RECORD_SECONDS = 5
WAVE_OUTPUT_FILENAME = "myaudio.wav"

p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK) #buffer

print("* recording")
frames = []

for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
    data = stream.read(CHUNK)
    frames.append(data) # 2 bytes(16 bits) per channel

stream.stop_stream()
stream.close()
p.terminate()

# Save the recorded audio file
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()

# Play the recorded audio file
wave_obj = sa.WaveObject.from_wave_file("myaudio.wav")
play_obj = wave_obj.play()
play_obj.wait_done()

References