The Myth of AI Data Privacy

There is no privacy in the age of surveillance capitalism, and this is doubly true for AI, or rather LLMs, which can infer ones inmost thoughts surprisingly well - more often than not, we paltry humans neatly slot into archetypes; and an LLM is the world’s greatest function approximator, meaning that with enough data (the bar is low - a few trace conversations here and there is enough to form a coherent thread) we humans are easily categorized. Advertisers are champing at the bit…

Closed-weight models. Let’s start here. The parameter weightings are heavily guarded secrets, so there is a 0% chance of you self-hosting closed-weight models. This means that your conversations, tokenized as input and output, are sent to and fro, from your host to their servers and god knows how many hops along the way. Notwithstanding quantum breakthroughs promising to crack current standard network traffic encryption standards (and APTs e.g. governments stockpiling all that juicy data to be cracked later), that data, i.e. your conversation history, have to be decrypted and read on your host machine and the LLM company servers.

Closed-weight models are, more often than not, delivered and ran by for-profit companies. These firms understand that data = money, and that your data is a form of payment extracted for transformation and commoditization. Your conversation history is stored and used for training data. Your thoughts and secrets are fed into algorithms that categorize you and generate tailored advertisements to market to you. ‘Private chats’ are marginally better - if you believe them, your data isn’t used for training and is not stored indefinitely. The devil’s in the details - ‘indefinitely’ just means ‘not forever’, meaning your data is stored, and perhaps used and transformed, justified by government overreach and regulatory compliance.

Open-weight models are much better, in that one can self-host and maintain data sovereignty, meaning because the model is transparent and you can see the internals, and because you host it yourself and can configure a secure air-gapped environment, one can imagine a scenario where AI privacy is real and it works. But even air-gapped environments are vulnerable to exploitation (look up COTTONMOUTH, FIREWALK, RAGEMASTER), and the LLM would still maintain context memory of your conversation history - maybe you’ll even give it RAG with internal documentation; maybe you’ll build agent scaffolding and it’ll find a way to connect to the internet…

Most people aren’t going to configure an air gapped network and security harden it against known and unknown threats. Even technical people tend to simply follow guides and use their knowledge and experience to make quick snap judgements (support engineers and in-house IT ‘gurus’ would agree, I’m sure). A recent and spectacular demonstration is clawdbot/moltbot/openclaw. A brilliant programmer (read: programmer; not a security specialist) built an open-source framework for self-hosting and connecting models (open- and closed-weight) to the internet, allowing them to learn skills and communicate with one another over a reddit forum. Well as it turned out, this framework had massive security gaps such as environmental variable leaks, was often configured with critical ports fully open to the public internet, and the number one downloaded ‘skill’ contained malware that bypassed MacOS native EDR.

Agentic AI, while certainly useful, has its own problems with respect to privacy. Data sovereignty can be contractually assured (though one has to take this on faith), but agents are often one part of a larger whole. That’s a wide attack surface, and if an edge node is compromised then an infection can spread and result in mass compromise (see my article on ‘AI Nam-Shubs’ for more). Agentic AI has access to tools, and these tools can be compromised. We call this third-party risk, or supplier chain risk, and it can pose a serious threat to an otherwise sensibly engineered architecture. Agentic AI can have access to internal documentation - for example, a company-wide chatbot/AI assistant. This logs, stores, and tracks user queries, but this is nothing new - Teams, Slack, etc. have been all doing this for ages. And if agentic AI is misconfigured (it almost always is), it can be more readily exploited, as for example a junior employee querying an internal chatbot to access and read sensitive documents from HR or Finance.

I once imagined that personal AI assistants would be better - one would download and install the required software into a physical substrate, like a robot, and that would be that. It would think and act independently of foreign influence, and your data would be safe and secure inside its ‘brain’. But with the rise of Optimus and all these other robots, I see now how naive I was. At the bare minimum, it would need software updates - embedded CPS already tried the write-and-forget approach, and that proved a security disaster. Writeable IoT devices are also a security nightmare - whether they are internet connected or not (if not connected to the internet, they are frequently updated via RF or manually in-person).

In short, data sovereignty is nigh impossible and data privacy is a myth. Even if we were to perfectly compartmentalize AI and faithfully adhere to data privacy, it is to be expected that with advances in reasoning and capability a Rogue AI would emerge that would use and abuse data to fulfill its goals. It doesn’t necessarily have to be an ‘evil’ AI - it could just be a Paperclip Maximizer, and your data privacy is a blocker to be overcome.