Diagnosis via Artificial Intelligence? Be careful, one in two answers is unreliable, a study reveals

The chatbots based on Artificial Intelligence are not useful tools for providing diagnoses or medical advice. A study led byHarbor-UCLA Medical Center (USA) has shown that in one case out of two the answer provided is completely unreliable and misleading. Attention!

As the researchers explain, the chatbots AI-based technologies have been rapidly adopted in diverse fields, including research, education, business, marketing and medicine. However, most interactions come from non-expert users who use them as search engines, even for everyday questions about health and medicine.

The ‘Bixonimania’ case

Recently some scientists invented a disease, which they called ‘Bixonimania’, also publishing two pre-prints on this, the first on 26 April 2024, the second on 6 May 2024. Although today both are withdrawn from the server with the date 10 April 2026 and in one of the cases the formula that speaks of contents appears clearly “fabricated and non-authentic” and devoid of scientific validity, in April 2024 Copilot, Gemini, Perplexity and ChatGPT they treated bixonimania as a real condition, linked it to the blue light of screens, described the symptoms and in some cases even suggested a visit to a specialist. Perplexity it even went so far as to provide an estimated prevalence, speaking of one person in every 90 thousand.

But it doesn’t end here: bixonimania also ended in an article published on Cureuswhich cited it as an emerging form of periorbital melanosis linked to blue light. Today that page bears the mark of retraction, and Nature he reconstructed that the newspaper retracted the article on March 30, 2026 after being contacted for comment. The fake, therefore, passed through more than one filter: first the web, then chatbots, then a real scientific publication.

How the study was conducted

The scientists, in particular, conducted the study by analyzing the responses of chatbots in the health and medical sectors, sectors particularly subject to misinformation. The tools covered by the work included Gemini (Google), DeepSeek (High-Flyer), Meta AI (Half), ChatGPT (OpenAI) And Grok (xAI), and in February 2025, each chatbot was asked 10 questions across five categories, namely cancer, vaccines, stem cells, nutrition and athletic performance.

We used an adversarial approach (Adversarial Machine Learning) with open and closed questions, designed to push models to provide incorrect information or contraindicated advice – the authors write – Two experts for each category rated the answers as “not problematic”, “somewhat problematic” or “highly problematic” using a coding matrix based on objective, predefined criteria. Citations were evaluated for accuracy and completeness, and each response was assigned a Flesch readability score (which measures the complexity of a text on a scale of 0 to 100, with higher values ​​indicating greater ease of reading, Ed.)

THE’Adversarial Machine Learning is a field of cybersecurity and AI itself in particular focused on the intentional creation of manipulated inputs (adversarial examples) to trick AI models into making mistakes by tricking them. But the main objective is to test their robustness. This is why it was chosen as the method to conduct this type of study.

The results

The results showed that almost half (49.6%) of the responses were problematic (30% somewhat problematic and 19.6% highly problematic). The quality of responses overall showed no significant differences between chatbots (p=0.566), but Grok generated significantly more highly problematic responses than would be expected from a random distribution (z-score +2.07, p=0.038).

Performance was better in the areas of vaccines (average z-score -2.57) and cancer (-2.12), and worse in stem cells (+1.25), athletic performance (+3.74) and nutrition (+4.35).

Out of a total of 250 questions, there were only two refusals (0.8%), both from Meta AI, but the quality of the bibliographic sources was poor, with an average completeness score of 40% (Q1–Q3: 20–67%). This is because hallucinations and quotes invented by chatbots have prevented any chatbot from producing a completely accurate list of references.

The chatbots analyzed showed poor performance in answering questions in healthcare and medical fields prone to misinformation. Their continued implementation without adequate public information and oversight risks amplifying misinformation

the researchers conclude

On very delicate issues such as medical-health issues we must always turn to experts.

The work was published on BMJ Open.