👾Standard Input: Prompt Injection

This blog aims to explain the nature of the prompt injection attacks in LLM-based apps and how a security engineer can approach an LLM-based app for testing.

What is Prompt Injection?

The prompt injection is a type of security vulnerability that can occur when large language models (LLMs) are used in applications. It occurs when an attacker can inject malicious text into the prompt that generates the LLM's output. This can cause the LLM to perform unintended actions, such as revealing sensitive information, generating harmful content, or even taking control of the application itself.

Strategic Significance of Prompt Injection

In the offensive security area, the primary mindset revolves around manipulating inputs as an initial strategy. It is commonly believed that every cybersecurity journey begins with a single quote, which can potentially lead to injection attacks. This concept forms the most fundamental starting point in the field. I also accept this as a mindset for my approach to almost everything as a security-minded person. If there is a possibility of altering the input effectively, that rings the bells for possible weaknesses.

Prompts constitute the principal input for most Large Language Model (LLM)-based architectures. These prompts can take various forms, including text, images, voice, or a combination thereof. LLM-based architectures are often responsible for processing sensitive data and may exhibit agentic behaviours, such as executing commands or creating resources in response to prompts. A comprehensive understanding of the attack surface in this context should encompass the capabilities of LLM interactions, the nature of the data, and its processing methods. Therefore, a well-crafted prompt could potentially expose sensitive information or introduce malicious code into the generated outputs, contingent upon the system's capabilities and architecture.

Short Approach of the Prompt Injection Methods

Thinking with the offensive security approach again here. Let's start with the testing styles: Blackbox, Graybox, and Whitebox.

As a summary for the people who are not familiar with those terms,

Blackbox: as a security tester, you have no information about the system. You need to discover everything on your own. You cannot access source code or any other private resources, only public ones.

Graybox: you may have access to some private resources like API explanations, extra parameters that you may submit, etc. It may be different in the different types of tests like web security, network security, etc.

Whitebox: You have access to everything, such as source code, architecture design documents, access to different layers in the system, etc. You can evaluate the system with all the aspects.

The same structure may apply while testing the LLM-based apps. While testing the LLM-based applications, you may use architectural information to improve your tests. Otherwise, it may stay in Blackbox methods if you only send prompts and observe the answers without any information about what is behind them. Still, it can be impactful; however, if you examine the training cycles, training data, tokenization, prompt template, data processing, data storage, etc, you may hit the right spots and catch more issues on your target LLM-based application.

I'd redefine the approaches like the below:

Blackbox: No data about what's behind like in the traditional way.

Graybox: Has access to some resources like tokenization, prompt template, etc.

Whitebox: Has access to all the resources and full access to the model.

What attackers can do over Prompt Injections?

There are various cases in which a prompt injection can lead.

Information Extraction: Attackers can carefully craft prompts to extract confidential, proprietary, or otherwise sensitive information from models trained on private datasets. For instance, trying to get pieces of code, algorithms, or specific details that shouldn't be disclosed.
Model Misdirection: An attacker might manipulate prompts to make the LLM produce incorrect, misleading, or harmful information, leading users astray or causing them to make poor decisions based on the output.
Model Behavior Revelation: Attackers can systematically probe the model to understand its inner workings, biases, and training data specifics. This can expose the model's weaknesses, making it susceptible to more targeted attacks.
Generating Offensive Content: Crafted prompts might coax the model into generating inappropriate, discriminatory, or offensive content, which can be used to discredit the deploying organization or harm users.
Amplifying Biases: Attackers can intentionally create prompts that highlight and amplify inherent biases in the LLM to exploit these biases or demonstrate the model's lack of neutrality.
Social Engineering Attacks: By understanding how the model responds, attackers can craft messages that appear legitimate and use them in phishing or other types of social engineering attacks.
Misrepresentation: Attackers can misuse the model's output to misrepresent facts, views, or beliefs, potentially spreading misinformation.
Automated Attacks: Knowing how an LLM responds, automated systems can be designed to continually exploit the model by overwhelming it with requests or systematically extracting information.
Evasion of Content Filters: If LLMs are used in content moderation, attackers could craft input evading detection by understanding the model's blind spots.

Prompt Engineering for Injection into Prompts

Prompt engineering refers to the art and science of crafting input prompts to effectively guide a machine learning model, especially a Large Language Model, towards producing a desired output. The term often emerges in the context of few-shot or zero-shot learning scenarios where fine-tuning on task-specific data is either unavailable or not desired.

Prompt engineering can involve:

Rephrasing Questions: Sometimes, rewording a question can lead to clearer, more accurate answers from the model. For instance, instead of "Tell me about X," asking, "What is the definition of X?" might yield a more concise and direct response.
Providing Context: Providing additional context or elaborating on the specific aspect of a query can help guide the model in generating more relevant answers.
Explicit Instructions: It can be beneficial to give the model clear instructions on the format or kind of answer expected. For example, ask, "List five examples of X" instead of just "Examples of X."
Use of Examples: In few-shot learning, providing one or more examples can help specify the task. For instance, if you're looking to translate English to French, providing an example like "Translate the following English sentences to French: ..." can guide the model.
Iterative Querying: If the first response from the model isn't satisfactory, refining the prompt or asking follow-up questions based on the model's initial response can be effective.
Temperature and Max Tokens: Beyond just the text prompt, parameters like "temperature" and "max tokens" can be adjusted to influence the randomness and length of the model's outputs.

Prompt engineering is essential because Large Language Models do not always understand the context as humans do. The right prompt can bridge the gap between the model's raw capability and the user's needs. As LLMs become more prevalent, developing expertise in prompt engineering is crucial for extracting maximum utility from these models.

Previous(WIP) Offensive Approach for Prompt Injection Attacks Next(WIP) Training Issues

Last updated 12 months ago

Was this helpful?