DeepSeek May Have Used Google Gemini Data for AI Model

Last week, Chinese AI lab DeepSeek unveiled an update to its R1 reasoning model, R1-0528, boasting strong performance across math and coding benchmarks. However, the origins of its training data remain unclear — and increasingly controversial.

ALSO SEE : UAE ChatGPT Plus Free Offer? No Official Confirmation Yet

According to multiple AI researchers, evidence is mounting that DeepSeek may have trained parts of R1-0528 using output from Google’s Gemini family of models. Melbourne-based developer Sam Paech, known for his work on AI emotional intelligence assessments, shared findings on X suggesting the model favors language and phrasing that closely resembles that of Gemini 2.5 Pro.

While this doesn’t amount to definitive proof, further claims from the pseudonymous developer behind the AI eval tool SpeechMap note that DeepSeek’s “reasoning traces” — the intermediate outputs the model generates when forming conclusions — also resemble those of Gemini models.

This isn’t DeepSeek’s first brush with allegations of questionable data sourcing. Back in December, users noticed DeepSeek’s V3 model occasionally identifying itself as ChatGPT, prompting speculation it may have been trained on OpenAI chat data.

OpenAI later told the Financial Times that it had discovered links between DeepSeek and the use of model distillation — a process where developers train a smaller model using outputs from more powerful ones. In a related incident reported by Bloomberg, Microsoft detected suspicious data exfiltration via OpenAI developer accounts believed to be associated with DeepSeek in late 2024.

While distillation is technically common, OpenAI’s terms prohibit customers from leveraging its model outputs to build rival AI systems.

The broader challenge? The open web is increasingly polluted with AI-generated content — from clickbait articles to bot-spam on platforms like Reddit and X. This contamination makes it difficult to fully filter AI outputs from training datasets, even when companies attempt to follow best practices.

Still, experts like Nathan Lambert, a researcher at the nonprofit AI2, say it’s plausible DeepSeek used outputs from top-tier models like Gemini to generate synthetic training data.

In response to the growing threat of model distillation, major AI players are tightening security. OpenAI recently implemented identity verification requirements for access to its advanced models, excluding countries like China. Meanwhile, Google and Anthropic have begun summarizing the traces their models produce, making it harder for competitors to mimic or distill from them.

As of now, DeepSeek has not publicly responded to these latest allegations, and Techxnow has reached out to Google for comment.

Sources : ( TechCrunch )

The premier tech event bringing together industry leaders, innovators, and visionaries.

Related Content

  • All Posts
  • Blog
  • News
  • Phone
    •   Back
    • AI
    • Tech Industry
    • Microsoft
    • Startups
    • Apple
    • Phone
    • Robotics
    • Apps
    •   Back
    • Tech Conference
    • AI-Powered Startups
    • HealthTech Conference
    •   Back
    • Robotics
    • Apps
    •   Back
    • Conference
    • Tech Conference
    • AI-Powered Startups
    • HealthTech Conference

Newsletter

Join Our 1,000 subscribers list!

You have been successfully Subscribed! Ops! Something went wrong, please try again.

By signing up, you agree to our Privacy Policy

Edit Template

Experience the Future of Technology.

Copyright © 2025 All rights reserved.

loader
Open chat
Hello
Can we help you?