Last week, Chinese AI lab DeepSeek unveiled an update to its R1 reasoning model, R1-0528, boasting strong performance across math and coding benchmarks. However, the origins of its training data remain unclear — and increasingly controversial.
According to multiple AI researchers, evidence is mounting that DeepSeek may have trained parts of R1-0528 using output from Google’s Gemini family of models. Melbourne-based developer Sam Paech, known for his work on AI emotional intelligence assessments, shared findings on X suggesting the model favors language and phrasing that closely resembles that of Gemini 2.5 Pro.
If you're wondering why new deepseek r1 sounds a bit different, I think they probably switched from training on synthetic openai to synthetic gemini outputs. pic.twitter.com/Oex9roapNv
While this doesn’t amount to definitive proof, further claims from the pseudonymous developer behind the AI eval tool SpeechMap note that DeepSeek’s “reasoning traces” — the intermediate outputs the model generates when forming conclusions — also resemble those of Gemini models.
This isn’t DeepSeek’s first brush with allegations of questionable data sourcing. Back in December, users noticed DeepSeek’s V3 model occasionally identifying itself as ChatGPT, prompting speculation it may have been trained on OpenAI chat data.
OpenAI later told the Financial Times that it had discovered links between DeepSeek and the use of model distillation — a process where developers train a smaller model using outputs from more powerful ones. In a related incident reported by Bloomberg, Microsoft detected suspicious data exfiltration via OpenAI developer accounts believed to be associated with DeepSeek in late 2024.
While distillation is technically common, OpenAI’s terms prohibit customers from leveraging its model outputs to build rival AI systems.
The broader challenge? The open web is increasingly polluted with AI-generated content — from clickbait articles to bot-spam on platforms like Reddit and X. This contamination makes it difficult to fully filter AI outputs from training datasets, even when companies attempt to follow best practices.
Still, experts like Nathan Lambert, a researcher at the nonprofit AI2, say it’s plausible DeepSeek used outputs from top-tier models like Gemini to generate synthetic training data.
I don’t think it’s existential for their models but makes them a bit better. I’d do the same in their shoes.
In response to the growing threat of model distillation, major AI players are tightening security. OpenAI recently implemented identity verification requirements for access to its advanced models, excluding countries like China. Meanwhile, Google and Anthropic have begun summarizing the traces their models produce, making it harder for competitors to mimic or distill from them.
As of now, DeepSeek has not publicly responded to these latest allegations, and Techxnow has reached out to Google for comment.