Data Privacy for RAG Chatbots

Intro

Building a RAG chatbot for startups whose focus is on improving productivity, and building a RAG chatbot for enterprises whose priority is to ensure employee and customer data security, are two different ball games.

Building a RAG chatbot for startups where data privacy constraints are absent in today’s date doesn’t demand any coding skills, and can be built with several no-code tools like Lang Flow and CustomGPT. Using these tools, a 10th grader can easily build a bot that answers queries from various sources of data.

However, while using these tools, you are practically handing over the documents that the bot is answering from, and the text entered by the users in the query field over to Langflow and CustomGPT. If any of this data is sensitive, you risk losing the data if the 3rd Parties mishandle it, or if it is lost in a cyber attack.

This scenario is unacceptable to enterprises, and hence, certain guardrails need to be employed before deploying a RAG chatbot in production in an enterprise.

What is RAG?

Retrieval-Augmented Generation (RAG) is an advanced AI framework and technique in natural language processing that combines elements of both retrieval-based and generation-based approaches. It aims to enhance the quality and relevance of generated text by incorporating information retrieved from a knowledge base. This method gained attention in 2020 with the publication of the research paper, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

When do we use RAG?

RAG alongside advanced LLMs, is an engineering marvel. Instead of using a simple Keyword or Semantic search to retrieve information from your documents, RAG + LLM combination allows you to fetch information from your knowledge base in natural language, and answers it in natural language as well. The retrieval as well as content understanding becomes much easier for a human this way.

Contrary to popular beliefs, RAG is not a method to prevent hallucinations. The LLMs can and will hallucinate even when provided with clear instructions or detailed documents. RAG can, however, allow a user to see the sources of the information cited by the LLM, and verify if the information provided is correct.

Due to its capabilities, RAG Chatbots are a popular product within enterprises, as it allows them to seamlessly share knowledge within their employees.

Need for Data Privacy

Although enterprises have huge knowledge bases, almost all of the information held within is either confidential (company trade secrets) or critical (customer data). Ensuring the privacy of this data is not just a legal obligation—like complying with the General Data Protection Regulation (GDPR) or the California Consumer Privacy Act (CCPA)—but also a crucial aspect of maintaining user trust.

What are the Data Privacy Concerns in implementing RAG Chatbots?

When there is sensitive and confidential information involved in any information system, there are concerns of that data being leaked to the parties involved in the information system.

Here is the typical architecture of a RAG Chatbot:

Architecture of a RAG Chatbot includes 3 AI Models - Embedding, Search, and LLM

As you can see, there are 3 or more AI models used to generate a Response based on a user query.

[1] Embedding Model: Converts Text into Vectors

[2] Search Model: Searches for the relevant documents in the vector database

[3] Large Language Model: Generates response based on information fetched from the vector database

All 3 models get direct access to the information stored in the knowledge base, as well as the user query. If these 3 models are managed by a 3rd Party, they get this information as well. The 3rd Party may have data privacy policies that are not in line with your organization’s or there might be data leaks on the 3P vendor’s side that may lead to data leakage.

Open AI writes not to disclose any sensitive information in the queries, as they use the same to train their models. Employees not adhering to this instruction recently resulted in a data leak at Samsung.

What Steps Can we Take to Prevent Data Leakage in RAG?

To prevent leakage of sensitive information, we can take several approaches. I’m listing down a few in the in the decreasing order of security and increasing order of complexity:

[1] Data Minimization and Governance: Enterprises can implement chatbots only over information that is not confidential or critical. Additionally, data retention policies should be implemented that ensure timely deletion of data once it serves its purpose within the system. Furthermore, there needs to be an incident management plan in place, for the scenario where a data breach is detected. This includes being able to delete all confidential data from within the system immediately.

[2] Consent Mechanism: Enterprises can take consent from Chatbot users that the information being published in the chat interface can be used for training the model, and the employees should not input any information they want to be private. Additionally, users should also have the option to withdraw their consent at any point in time if necessary.

[3] Partner and 3P Management: Ensure that the 3P parties involved in the architecture of your product are GDPR or CCPA compliant. Although this does not ensure Data Privacy, it increases the chances of your data being secure dramatically.

[4] Access Control: Malicious actors can exist both inside and outside the enterprise. If all employees are given equal access to the information, they may be able to extract information that should not be accessible to them. Hence, it is important to enforce access controls for the chatbot. This includes identity verification, and data access management before the employee even starts typing a query in the interface.

[5] AI Gatekeeping: As seen earlier, RAG chatbots use multiple AI Models to generate a response from a user query. These AI models can either be 3P models, or in-house models. If they are 3P, then point [3] should be rigorously checked. Even if all 3P AI models are Privacy law compliant, there are still chances of leakage due to malicious actors. To prevent this, we need to add a layer of security between our data and the AI model. This is explained in further detail in the next section.

[6] Using In-House Models throughout: The last and safest option is to ring-fence the system, keeping any 3rd party involvement away. For this, you will need to use offline Vector Embedding, Search, and Large Language Models, or develop them yourself. You will also need to host these models on your on-prem server (safer) or within an instance in the cloud.

What is AI Gatekeeping, and is it good enough?

AI Gatekeeping involves adding a layer of security around your data before it is sent to an AI model for use. There are several methods that can be used to add such security:

[1] Data Masking / Anonymization: This involves hiding data with altered values. For instance, in a chatbot handling HR inquiries, employee names or IDs in the retrieved data can be masked with pseudonyms or generic IDs to prevent real identities from being disclosed in the chat interface.

[2] Tokenization / Encryption: This technique replaces sensitive data with non-sensitive placeholders or tokens. These tokens can be mapped back to the original data only through a secure tokenization system. For example, credit card numbers or personal health information could be tokenized to maintain functionality without exposing actual sensitive data.

[3] Pseudonymization: Similar to anonymization, this technique involves replacing sensitive data with artificial identifiers. Unlike anonymization, pseudonymization allows the possibility of re-identification if additional information is provided. This method is useful in scenarios where the chatbot may need to maintain some form of user continuity without revealing true identities.

[4] Noise Addition: Adding noise to data or slightly altering it in a way that does not significantly impact the utility of the data for legitimate purposes but does prevent the derivation of exact information. This is particularly useful in statistical datasets.

[5] LLM Guardrails: Adding guardrails to your LLM through prompts mitigates the risk of prompt injection attacks, off-topic discussions, identification of malicious actors, etc.

Using AI Gatekeeping on top of privacy compliant 3P AI Models is generally a good trade-off for enterprises. To further enhance confidence in 3P AI Models, enterprises may choose cloud provider-managed LLMs and Search models like Azure Search, Amazon Kendra, as well as Azure LLM suite and Amazon Bedrock. Cloud providers offer Data Centers in multiple zones, so that you can ensure that your data is not leaving your country, which is a legal requirement for many countries.

AWS and Azure have been securely providing cloud access to organizations for 15 years and can be trusted with the security of your data.

What do I do if I don’t want 3P involvement in the Chatbot?

In rare scenarios, enterprises dealing with extremely sensitive information make sure that all the data stays on their on-prem servers without leaving their premises for any operation. In such cases, you can choose to deploy offline models on your servers. This way, you get complete control over your data, can customize security for your use case, and ensure data localization.

However, choosing this option is much, much more expensive than using a cloud provider. Not only are the server fixed costs high, but ensuring server security and maintenance is a large overhead. This option should only be used for dealing with extremely critical data, and avoided in most cases.

Conclusion

Building and deploying RAG chatbots in enterprises involves significant considerations around data privacy and security, especially when sensitive information is at play. While no-code platforms may offer simplicity and accessibility for startups, enterprises must navigate complex privacy concerns and implement robust security measures such as data minimization, consent mechanisms, and AI gatekeeping. Ultimately, the choice between using third-party services or maintaining in-house operations depends on the specific security needs and resources of the enterprise. Properly implemented, RAG chatbots can significantly enhance productivity and information accessibility while safeguarding sensitive data.

‍

Data Privacy for RAG Chatbots

Intro

Building a RAG chatbot for startups whose focus is on improving productivity, and building a RAG chatbot for enterprises whose priority is to ensure employee and customer data security, are two different ball games.

This scenario is unacceptable to enterprises, and hence, certain guardrails need to be employed before deploying a RAG chatbot in production in an enterprise.

What is RAG?

When do we use RAG?

Due to its capabilities, RAG Chatbots are a popular product within enterprises, as it allows them to seamlessly share knowledge within their employees.

Need for Data Privacy

What are the Data Privacy Concerns in implementing RAG Chatbots?

When there is sensitive and confidential information involved in any information system, there are concerns of that data being leaked to the parties involved in the information system.

Here is the typical architecture of a RAG Chatbot:

As you can see, there are 3 or more AI models used to generate a Response based on a user query.

[1] Embedding Model: Converts Text into Vectors

[2] Search Model: Searches for the relevant documents in the vector database

[3] Large Language Model: Generates response based on information fetched from the vector database

Open AI writes not to disclose any sensitive information in the queries, as they use the same to train their models. Employees not adhering to this instruction recently resulted in a data leak at Samsung.

What Steps Can we Take to Prevent Data Leakage in RAG?

To prevent leakage of sensitive information, we can take several approaches. I’m listing down a few in the in the decreasing order of security and increasing order of complexity:

[3] Partner and 3P Management: Ensure that the 3P parties involved in the architecture of your product are GDPR or CCPA compliant. Although this does not ensure Data Privacy, it increases the chances of your data being secure dramatically.

What is AI Gatekeeping, and is it good enough?

AI Gatekeeping involves adding a layer of security around your data before it is sent to an AI model for use. There are several methods that can be used to add such security:

[1] Data Masking / Anonymization: This involves hiding data with altered values. For instance, in a chatbot handling HR inquiries, employee names or IDs in the retrieved data can be masked with pseudonyms or generic IDs to prevent real identities from being disclosed in the chat interface.

[4] Noise Addition: Adding noise to data or slightly altering it in a way that does not significantly impact the utility of the data for legitimate purposes but does prevent the derivation of exact information. This is particularly useful in statistical datasets.

[5] LLM Guardrails: Adding guardrails to your LLM through prompts mitigates the risk of prompt injection attacks, off-topic discussions, identification of malicious actors, etc.

AWS and Azure have been securely providing cloud access to organizations for 15 years and can be trusted with the security of your data.

What do I do if I don’t want 3P involvement in the Chatbot?

However, choosing this option is much, much more expensive than using a cloud provider. Not only are the server fixed costs high, but ensuring server security and maintenance is a large overhead. This option should only be used for dealing with extremely critical data, and avoided in most cases.

Conclusion

Transform your operations, insights, and customer experiences with AI.

Ready to take the leap?