Prompt Engineering

Prompt is a text set passed to the GenAI model as instructions. Given the prompt, the model responds with some generated text. A prompt can be questions, statements, or instructions. Prompt engineering facilitates the design of the prompt to enhance the generated text. It is also 1) a tool to evaluate the output of the model and 2) a tool for safety mitigation methods. There is no perfect prompt design. Prompt optimization and experimentation are done iteratively. The following figure 1 depicts a very basic example of the prompt.

Figure 1: A basic example of the prompt

Controlling Model Output by Adjusting Model Parameters

temperature and top_p parameters control the randomness of the output. Before a large language model (LLM) generates a token, it has many possible choices of tokens with different likelihoods assigned. Some tokens are most likely whereas some are least likely. For these two parameters to work, do_sample parameter should be set to True i.e. do_sample=True. It means that we are allowing the next token to be sampled from the set of all likely tokens.

temperature defines how likely the model will choose the least likely token. temperature=0 means the model will generate the same response every time, because it will always choose the most likely token every time. With the higher value of temperature, the model will give more chance to other less likely tokens also. With temperature=1, chances of any probable token for being selected will be mostly equally likely. For example, a value of 0.8 will produce the more diverse output, whereas the value of 0.2 will produce deterministic output. So we can say that temperature induces stochastic behaviour.

top_p also known as nucleus sampling allows the model to only select a subset of all likely tokens. Based on its value, the model will stop sampling as soon as the cumulative probability reaches the value of top_p. Value of 1 means, it will consider all tokens.

top_k parameter tells the model to select the number of tokens exactly equal to its value.

Based on the requirements of the use case, we choose to set these parameters. We need to find the right balance of randomness/diverse vs. deterministic/focused/coherent outputs.

Instruction-Based Prompting

Providing a large language model (AI) with clear, specific, and structured instructions to guide the response of the model is referred to as instruction-based prompting. The most basic prompt consists of two components:

  1. the instruction itself and
  2. the data required for the instruction.

The following diagram depicts a basic instruction prompt. Please note the instruction and data part of the prompt.

Figure 2: Instruction Prompt

To make the model very specific about the output, for example, if we want the output to be either “positive” or “negative”, we can use output indicators. The following diagram depicts the instruction prompt with output indicator.

Figure 3: Instruction prompt with output indicators

Different types of tasks require different formats of the prompt. The following diagram illustrates example formats for summarization, classification, and named-entity recognition.

Figure 4: Prompt format for summarization, classification and NER task

Following is the non-exclusive list of prompting techniques for improving the quality of the output.

  1. Specificity: Accurately describe what you want to achieve.
  2. Hallucination: LLMs can generate incorrect information with high confidence, called hallucination. To avoid this, we need to tell the model that if it does not know the answer, please respond with “I don’t know”.
  3. Order: Either begin or end your prompt with the instruction. LLMs tend to focus more on two ends of the prompt (beginning – primacy effect and end – recency effect). It mostly forgets the middle part in the long prompt.

As we saw above, common components of prompts are instruction, data, and output indicators. However, prompts are not limited to these components; we can build up a prompt that is as complex as we want. Other common components are

  1. Personal
  2. Instruction
  3. Context
  4. Format
  5. Audience
  6. Tone
  7. Data

The following is an example from book 1 that uses the above prompt components. This example demonstrates the modular nature of prompting. We can experiment by adding or removing components to see the effect.

Figure 5: Example of prompt showing use of the various components.

In-Context Learning – Providing examples

Giving examples to the LLM, greatly influences the output of the prompt. This is referred to as in-context learning. Zero-shot prompting uses no example, whereas one-shot prompting uses one example, and few-shot prompting uses two or more examples. The following diagram illustrates the examples of in-context learning.

Figure 6: Examples of in-context learning

While giving the examples, the user and the assistant should be differentiated clearly by mentioning the role as user or the role as assistant. By giving examples, we can be more clear in describing the model. But the model can still choose through random sampling and ignore the instruction.

Chain Prompting: Breaking up the Problem

We already know that we can break the prompt into modular components of the prompt to enhance the output of LLMs. Next level of strategy is to break the problem/task into subproblems/subtasks. We use separate prompts for subtasks and then chain the prompts in a sequence by passing the output of one prompt to the input of the other prompt, thus creating a continuous chain of interactions to solve our problem. This is called chain of prompt operations or prompt chaining. The prompt chaining can help in

  1. achieving better performance,
  2. boost the transparency of LLM application,
  3. increases controllability and reliability,
  4. debug problems with model responses more easily,
  5. improve performance in the different stages that need improvement,
  6. useful in building LLM-powered conversational assistants,
  7. improve the personalization and user experience of your application.

Use cases include

  1. Response validation: We can ask the LLM to validate the previously generated output or other LLM’s output.
  2. Parallel prompts: There can be use cases where we would be running multiple prompts in parallel, and then we would be merging the parallel outputs.
  3. Writing stories

Following is the example from the reference book1. This example illustrates the prompt chain that first creates a product name, then uses this name with product features to create a slogan, and finally uses features, product name, and slogan for creating the sales pitch.

Figure 7: Example of prompt chain

Reasoning with Generative Models

Reasoning is an important trait of human intelligence. LLMs as of today resemble this reasoning behaviour by memorization of training data and pattern matching. We need to work with LLMs by leveraging prompt engineering so that they can mimic the reasoning processes, and the output could be enhanced.

System 1 and System 2 Thinking Process by Daniel Kahneman

Daniel Kahneman in his famous book “Thinking Fast and Slow” introduced the concept of System 1 and System 2 thinking process in humans. According to him, “System 1” represents our fast, automatic, and intuitive thinking mode, while “System 2” is our slower, more deliberate and conscious mode of thinking, which requires effort and attention; essentially, System 1 is “thinking fast” and System 2 is “thinking slow.”.

Figure 8: System 1 and System 2 thinking, Image Source

Inducing System 1 and System 2 Thinking in LLMs

The majority of LLMs today rely on System 1 thinking, but researchers are working on techniques to encourage more System 2-type behaviour by using prompting methods like “Chain of Thought” to elicit intermediate levels of reasoning before arriving at a final response.

Chain-of-Thought: Thinking Before Answering

The main aim of chain-of-thought is to push the model towards system 2 thinking i.e. think before answering, and allowing the model to distribute more compute for the reasoning process. Here, reasoning is referred to as thoughts.

Chain of thought – a series of intermediate reasoning steps – significantly improves the ability of large language models to perform complex reasoning [3]. Prompting using chain-of-thought is called chain-of-thought prompting. This prompting technique enables LLMs to tackle complex arithmetic, commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning process is highlighted in the following example taken from the paper[3].

Figure 9: Chain-of-thought example; reasoning process is highlighted – source [3]

Chain-of-thought also requires one or more examples of reasoning. But there is something called zero-shot chain-of-thought that can be achieved by simply using the phrase “Let’s think step-by-step“. Though this phrase does not need to be exactly the same. Small variation should be fine. The following is an example of zero-shot chain-of-thought.

Figure 10: Example of zero-shot chain-of-thought – source[1]

Self-Consistency: Sampling Outcomes

The paper[2] writes about the self-consistency as follows: “It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer.

We first prompt the language model with chain-of-thought prompting, then instead of greedily decoding the optimal reasoning path, ‘sample-and-marginalize’ decoding procedure is followed. Sample-and-marginalize decoding procedure:

  1. prompt the language model with chain-of-thought (CoT) prompting
  2. replace the ‘greedy decode’ in CoT prompting by sampling from the language model’s decode to generate a diverse set of reasoning paths, and
  3. marginalize out the reasoning paths and aggregate by choosing the most consistent answer in the final answer set.

The following diagram, from the paper[4], illustrates the concept.

Figure 11: Example of self-consistency in CoT[4]

Tree of Thoughts: Deliberate Problem Solving

This is another effort in the direction of pushing the model towards system 2 level of thinking in humans. Following is a quote from Newell et al.[6]

” A genuine problem-solving process involves the repeated use of available information to initiate exploration, which discloses, in turn, more information until a way to attain the solution in finally discovered.”

Paper[5] explains the Tree-of-Thought (ToT) as follows. ToT generalizes over the popular “Chain of Thought” approach to prompting language models, and enables exploration over a coherent unit of text (“thoughts”) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.

The following diagram from the paper[5] illustrates the various approaches to problem solving with LLMs. Each rectangle box represents a thought.

Figure 12: Various approaches to problem solving with LLMs.

Output Verification

It is important to verify and control the output of the model to avoid breakdown of the model in production and to create a robust AI system. Reasons for validating the output may include

  1. Structure output: For example, we would need the output in JSON format.
  2. Valid output: Even if we restrict the output to few choices, it may still come up with new choice.
  3. Ethics: Free of profanity, personally identifiable information (PII), bias, cultural stereotypes, etc.
  4. Accuracy: Checking if output is factually accurate, coherent or free from hallucination

Except from controlling the parameters temperature and top_p, the following are the three ways to control the output of the GenAI model:

  1. Examples: Provide the number of examples of the expected output.
  2. Grammar: Control the token selection process
  3. Fine-tuning: Tune the model on data that contain the expected output.

Providing examples

To control the structure of the output, e.g., in JSON format, we can provide a few examples in this format for guiding the model to produce the output in the desired format. Still, the model will behave in a certain way, is not guaranteed. Some models may be better than others in following instructions.

Grammar: Constrained Sampling

Libraries have been developed to constrain and validate the output of generative models such as:

  1. Guidance: An efficient programming paradigm for steering language models. With Guidance, you can control how output is structured and get high-quality output for your use case—while reducing latency and cost vs. conventional prompting or fine-tuning. It allows users to constrain generation (e.g. with regex and CFGs) as well as to interleave control (conditionals, loops, tool use) and generation seamlessly.
  2. Guardrails: This is a Python framework that helps build reliable AI applications by performing two key functions:
    • Guardrails runs Input/Output Guards in your application that detect, quantify and mitigate the presence of specific types of risks. To look at the full suite of risks, check out Guardrails Hub.
    • Guardrails help you generate structured data from LLMs.
  3. LMQL: This is a programming language for LLMs. Robust and modular LLM prompting using types, templates, constraints and an optimizing runtime.

There is another way where we can define grammars or rules that LLM should follow when choosing the next token. For example, in llama-cpp-python we can specify response_format as JSON object, if we want the output in the JSON format.

References

  1. Book: Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
  2. https://www.promptingguide.ai/
  3. Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models by Jason Wei et. al., Google Research, Brain Team
  4. Paper: Self-Consistency improves Chain-of-Thought Reasoning in Language Models by Wang et. al., Google Research, Brain Team
  5. Tree of Thoughts: Deliberate Problem Solving with Large Language Models by Shunyu yao et al. NIPS – 2023
  6. Report on a general problem solving program by A. Newell et al. in IFIP congress – 1959

Leave a Comment

Your email address will not be published. Required fields are marked *