Forex markets

What Happens to Data in Popular LLM Services

What Happens to Data in Popular LLM Services

What Happens to Data in Popular LLM Services

OpenAI ChatGPT: Control in the User's Hands

OpenAI’s policy appears fairly transparent, though there are nuances. By default, all conversations with ChatGPT are stored on the company’s servers. These records can be reviewed by moderators if the system detects a violation of rules. In the free version, the input data might potentially be used to train future versions of the model.

This may sound concerning, but there is good news. OpenAI provides real tools for user control. All content rights remain with the user—meaning the company does not claim ownership of the information. Data is only used to the extent necessary for the service to function.

Since 2023, OpenAI has changed its default settings. Data sent through APIs or corporate products is no longer used for training models without explicit consent. Business users can completely disable chat history storage—conversations are then retained for a maximum of 30 days and do not enter the training dataset.

For corporate clients, extended guarantees are in place.
What Happens to Data in Popular LLM Services

What Happens to Data in Popular LLM Services

Anthropic Claude: Focus on Ethics

Claude adheres to similar but stricter principles in some aspects. Anthropic states that it does not use user data for retraining the model without direct permission. By default, interactions with Claude do not enter the training dataset, especially when using the paid API.

The company emphasizes AI ethics and minimizes personal data retention. Requests and responses are stored for a limited time (up to two years for security purposes) but are not used for training. Users have the option to fully opt out of data retention.

Perplexity AI: An Extra Layer of Protection

Perplexity presents an interesting case. This AI-powered search assistant operates on top of models from OpenAI and Anthropic but adds its own layer of protection. The company has agreements ensuring that user data is not shared with base models for training purposes.

When you make a query through Perplexity, OpenAI or Anthropic receives it solely for generating a response but does not retain it for future training. Perplexity itself may use query history to improve the service, but there’s an option called AI Data Retention in the settings to disable this feature.

This creates a double layer of protection—external models don’t train on your data, and the service allows you to opt out of query history usage.

DeepSeek: Red Flags for Security

The situation with the Chinese model DeepSeek is entirely different. This increasingly popular service raises serious concerns among cybersecurity experts. DeepSeek collects a wide range of user information and transfers it to servers in China.

This creates multiple issues. First, the data falls under Chinese jurisdiction, where personal data regulations are more lenient. Government agencies can request access to information stored on local servers. Second, confidentiality isn’t guaranteed—the data is stored without protections like GDPR compliance.

Free services indeed pay for their "free" status with your data. For corporate use, this is extremely risky.

How to Build a Corporate AI Usage Policy

Understanding the differences between services, companies must establish clear guidelines for AI usage. Without a coherent policy, employees will act at their discretion, inevitably leading to security incidents.

Identifying Approved Services
The first step is to inventory popular AI tools and identify which ones meet corporate security requirements. You can ban services with unreliable privacy policies and permit only those offering sufficient guarantees.

Restrictions on Confidential Information
The most critical rule is a categorical ban on entering any data considered a trade secret into public LLMs. Personal customer data, internal communications, source code, financial information—all of this must be excluded from use in external AI tools.

Instead of real data, employees should use anonymized placeholders like asterisks or labels such as [REDACTED] in rare cases where mentioning certain details is unavoidable. All sensitive categories of information should be explicitly listed in the policy.

Rules for Handling Documents and Code
Special attention should be given to working with files and program code. You cannot simply upload entire documents or code snippets into cloud-based AI without checking them for sensitive information. This is especially relevant for developers who often use AI for analysis or debugging.

Before sending code to an AI, remove all passwords, API keys, server addresses, and other sensitive details. A requirement could be introduced to use AI for programming assistance only after a code review to check for potential leaks.

Employee Training and Oversight
Technical measures alone won’t solve the problem—working with people is essential. Mandatory training on safe AI usage should explain the risks of careless handling of LLMs and why the company imposes restrictions.

Data Classification: What Can Be Trusted to AI?

Not all information is equally critical to a company. Reasonable classification helps determine what can be processed via external LLMs and what must remain strictly within internal boundaries.

Publicly Available Information – Green Light: This includes data available in open sources or that holds no value for attackers. Draft press releases without specific figures, marketing texts intended for publication, or educational project code—such information can be processed through ChatGPT or similar tools without significant concern.
Internal Information – Handle with Care: This category includes data that isn’t entirely public but also isn’t critically important. For example, results of internal employee surveys or analytical reports based on open-source information. Such data can be processed but with precautions.
Confidential Data – Only Internally: Personal customer data, financial reports, product plans, source code, any information marked as “trade secret”—none of this should ever be entered into external cloud-based LLMs. The risk is too high, and the consequences of a leak could be catastrophic.
Such data can only be processed using AI on internal infrastructure—either through a privately deployed model or an isolated vendor solution where it’s guaranteed that the information doesn’t leave the system.

Practical Measures for Data Protection

Even when using approved services and following data classification, it’s crucial to
apply additional security measures. A comprehensive approach helps minimize remaining risks.

Minimization and Anonymization: The golden rule of security is to input as little specific information as possible into LLMs. Frame queries so that unnecessary details aren’t disclosed. If a task can be solved using hypothetical data, do so.

Privacy Settings:
Many platforms provide tools for data protection, but users often ignore them. First and foremost, disable history saving wherever possible. In ChatGPT, you can turn off chat storage—then the conversation will be deleted after 30 days and won’t be used for training.

Choosing Reliable Providers:
When there’s a choice between services, prefer those offering stronger security guarantees. OpenAI’s business version undergoes SOC 2 audits and promises not to use client data. Anthropic publishes detailed policies on data storage and deletion.

Local Solutions: The most radical but effective approach is to deploy your own LLM within the company. If you regularly work with confidential data, investing in a private model is worthwhile. There are decent open-source solutions like Llama or

Qwen that can be fine-tuned for your needs.

Protecting Intellectual Property from Becoming Part of a Dataset
In an era where every user session potentially feeds new knowledge into models, an unexpected risk arises. Your unique development or original text might suddenly “surface” in AI responses to others.

How can you avoid this?

Principle of Irreversibility: Once you send proprietary text to ChatGPT, remember—it’s stored on OpenAI’s servers and might be used to train future versions. Retrieving this data is impossible; it permanently leaves your possession.

No-Training Modes:
Many services offer paid or corporate options where training on user data is disabled. Use these for working with valuable content. If you need to ask about a patent description, do it through ChatGPT Enterprise, not the free chat.

Watermarks and Traces: An advanced technique—if absolutely necessary to use your content for generation, mark it with special markers. For example, insert invisible characters or erroneous data that can later identify a leak.

Internal Training: If you have a large volume of proprietary intellectual property and want to apply AI to it—better train the model locally on this data. Then you get all the benefits of AI, but your "dataset" stays internal.

Choose reliable services, follow internal rules, and avoid oversharing. Then artificial intelligence will become an assistant rather than a source of leaks.

By Claire Whitmore
July 31, 2025


Join us. Our Telegram: @forexturnkey
All to the point, no ads. A channel that doesn't tire you out, but pumps you up.

1000 Characters left


Author’s Posts

Image

Forex software store

Download Our Mobile App

Image
FX24 google news
© 2025 FX24 NEWS: Your trusted guide to the world of forex.
Design & Developed by FX24NEWS.COM HOSTING SERVERFOREX.COM sitemap