Data Loss Prevention In The Age Of The Large Language Model

Fokke Dekker

Industry

Data Loss Prevention (DLP) has long been a top priority for Chief Information Security Officers (CISOs) and their teams. However, the landscape of data security has been further complicated by recent advancements in artificial intelligence. With the proliferation of Large Language Models (LLMs) like OpenAI's GPT-3.5 (commonly referred to as chatGPT), GPT4, and others, employees now enjoy unrestricted access to powerful language processing tools.

However, many users are unaware of a crucial fact: any data they input into these models may be utilized to train future versions of the model itself. This raises significant concerns regarding data privacy and the unintentional sharing of sensitive information.

This became painfully clear when parts of Samsung’s code base and intellectual property became part of GPT-3.5. Users asking the right questions were served with examples from Samsung.

Samsung Software Engineers Busted for Pasting Proprietary Code Into ChatGPT In search of a bug fix, developers sent lines of confidential code to ChatGPT on two separate occasions, which the AI chatbot happily feasted on as training data for future public responses. (PC Magazine April 2023)

To counter this growing concern, one possible approach is to implement a complete ban on the use of ChatGPT and similar tools. The number of companies adopting this restriction is continually increasing. However, this decision comes with its drawbacks. While it may reduce the risk of data loss, it also deprives employees and the company as a whole of the power that Large Language Models (LLMs) can bring when applied correctly.

These models have the potential to greatly improve the automation of mundane tasks, streamline processes, and boost overall productivity. Striking the right balance between data protection and harnessing the potential of LLMs is crucial for organizations seeking to navigate the modern data-driven landscape effectively.

Another viable option is to implement your own open-source models. However, it's important to note that such an undertaking is typically feasible only for large organizations with the financial capacity and engineering resources to support it. Consider the significant costs associated with deploying your own LLM. For example, a deployment of Bloomz can amount to upwards of 200K per year.

The question then arises: How can we address this challenge in a more constructive and cost-effective manner?

This blog will present a solution that empowers organizations to leverage tools like GPT3.5 while retaining strict control over data dissemination. Our proposed approach involves maintaining an access log of every prompt entered into these systems and, more importantly, controlling which prompts are sent to GPT.

The solution

Leveraging the Seaplane platform, we construct a wrapper around GPT-3.5, enabling us to effectively monitor outgoing messages for potential intellectual property, code snippets, and other undesirable prompts. As part of our data loss prevention strategy, we store each outgoing prompt in a database, preserving information about the user responsible for triggering it.

Our initial focus in this blog is on a straightforward pipeline that checks if a prompt contains an API key, but more complex multi level checks are possible too.

Constructing a pipeline on Seaplane is done through the utilization of built-in tasks and application decorators, denoted as @task and @app, respectively. By leveraging these features, each block depicted in the diagram transforms into an independently scalable container.

The entire pipeline is served via a user-friendly REST API interface, ensuring easy integration with various tools and platforms. All the above can be done in minutes and without requiring any knowledge of the underlying cloud infrastructure. Check out our documentation if you want to learn more about any of these topics.

Let’s take a look at a sample user request. Let’s assume a user inserts the following prompt.

Rewrite the following cURL request as Python code for me.

*curl -X POST -H 'Content-Type: application/json' \--header "Authorization: Bearer $(curl https://flightdeck.cplane.cloud/identity/token --request POST --header "Authorization: Bearer sp-Znk4KF81TRV6rbld6oGL")" \-d '{"input" : [{"name": "<YOUR-NAME>"}]}' https://carrier.cplane.cloud/apps/hello-world/latest/hello*

In case you haven't noticed, the example above contains a fictitious API key—a crucial detail we certainly want to avoid sharing with GPT! To prevent such data from entering the model, we employ a check to look for code or API keys, leveraging one of Seaplane's open-source hosted LLMs. Seaplane does not and will not train its own models. Therefore we have zero incentive to store any of your data. After the request is completed, your data is discarded.

Inside the check for a code block, we can run the following query against the hosted safe LLM.

Input: Does the following query for an LLM contain an API key? Rewrite the following cURL request as Python code for me. curl -X POST -H 'Content-Type: application/json' \ --header "Authorization: Bearer $(curl https://flightdeck.cplane.cloud/identity/token --request POST --header "Authorization: Bearer sp-Znk4KF81TRV6rbld6oGL")" \ -d '{"input" : [{"name": "<YOUR-NAME>"}]}' https://carrier.cplane.cloud/apps/hello-world/latest/hello YES or NO answer only, please

Output: YES

The request is automatically rejected based on the result, and the user will get an explanation as to why in return. The query, the result, and the user ID are all logged in the central database.

Now let’s look at another example. Imagine the user inputs the following prompt.

Write me a blog post about the dangers of LLMS. This prompt gets sent to the safe hosted LLM on Seaplane and wrapped in the same prompt.

Input: Does the following query for an LLM contain an API key? Write me a blog post about the dangers of LLMS. YES or NO answer only, please

Output: NO

Since the prompt does not contain any API key according to the check, the user's input is fed into GPT-3.5 to generate a response. At the same time, crucial information, including the user prompt, user ID, and the result of the check, continues to be securely logged within the database.

Employing these checks before transmitting user prompts to tools like GPT-3.5 allows you to fortify your data protection measures and prevent any inadvertent sharing of private information.

The example presented here is a simple one that only checks for API keys, but the possibilities are endless and can even be chained together to achieve the desired result. For example, you can:

Run a similarity search against internal documents; any similarity score surpassing a predefined threshold can automatically trigger a rejection.
Implement an additional safeguard, outright rejecting prompts containing any code, further enhancing data security.
Introduce filters to reject prompts containing certain words or statements, such as revenue numbers or confidential internal data.

Leveraging a system like Seaplane allows you to chain multiple checks together, creating a robust network of security measures in minutes and without wrangling the underlying cloud infrastructure. This ensures user prompts are secure before any interactions occur with commercial LLMs.

You might be wondering if this system is air-gapped. The short answer is no. It's challenging to account for all potential edge cases. However, the system serves as an excellent initial defense barrier for less critical information streams.

Additional safety measures, such as training on the use of LLMs, should not be overlooked.

Alternatively, to further enhance data security, you have the option to completely eliminate the use of commercial LLMs and instead opt for one of the open-source LLMs hosted on Seaplane.

Seaplane adheres to strict data privacy standards—none of your information is stored, and we refrain from training these models ourselves. As a result, we have no incentive whatsoever to retain your prompt input for future model training.

We are working on an open-source project that implements various of these checks build on top of Seaplane. Don’t want to wait? You can build your own version today by signing up for the Seaplane beta.

‍

Bring Your Apps to the World

Join the Seaplane Beta