The Potential Pitfalls of Using AI in Bioinformatics

18th April 2025
Posted by: Breige McBride
Category: Bioinformatics

Image show a robot with "AI" emblazoned on its chest, falling into a pit to symbolise the pitfalls of AI.

Recent AI Developments

Since the emergence of ChatGPT in November 2022, large language models (LLMs) and tools based on them have been making the headlines at a steady pace. Today, the AI race is in full swing with both governments and private entities announcing new initiatives and billion dollar investments on a weekly basis. In the month of writing this post, the EU announced new AI gigafactories¹ valued at $20bn as part of its new $200bn InvestAI² scheme. This is to answer the $500bn US AI³ investment announced in January. China⁴ introduced a 1 trillion Yuan AI program⁴, while Japan⁵ announced its intent to position itself as “the most AI-friendly country in the world”.

Companies have been equally busy this month as Meta announced its new flagship Llama 4⁶ model trained on 2 trillion parameters, making it the largest language model ever made. Google unveiled the Gemma 3 multi-modal LLM, while openAI announced a new open source LLM, the first since its initial open source model 2 years ago. Anthropic closed yet another billion dollar funding round last month, and xAI released Grok 3 after buying X (Twitter) as a whole this month. These models aim to top the Chinese DeepSeek model released in January, which upended the AI race by outpacing all flagship models with a cheaper, open source implementation.

Previous months have been equally dizzying, with not just governments but also companies providing a steady stream of developments at breakneck speed.

AI in Bioinformatics

Bioinformatics, relying heavily on machine learning to analyse large amounts of text data, should be an ideal use case to take advantage of the AI revolution. It was therefore no surprise when bioinformaticians proved to be early adopters of the new AI landscape, yet more than two years in, little has has changed in this field. This article aims to highlight a number of existing pitfalls that hinder the development of AI tools in the field of bioinformatics.

The Amplified Garbage-In, Garbage-Out (GIGO) Principle: Data Quality and Bias

LLMs are especially effective at summarising large bodies of text, condensing entire books effectively to a number of bullet points. This quality would appear to be extremely useful when summarising omics datasets, which are large, text-like structures that researchers wish to summarise into key features.

Omics datasets however are notoriously complex and prone to technical noise, batch effects, missing values, and inherent biological variability. Feeding AI models poorly curated or unrepresentative data not only leads to suboptimal performance, it can retain significant biases. Models trained on datasets skewed by specific demographics, experimental protocols, or sample handling procedures may fail spectacularly or produce discriminatory results when applied to new, diverse data. Integrating multi-omics data further compounds this issue due to varying data types, scales, and inherent biases in each modality.

To combat this, omics data sets need to be pre-processed, normalised, imputed, and quality controlled extensively. However, performing such tasks by human analysts already provides insights on the data, and may lead to reaching key take-aways from the experiment even before any AI could be suited to analyse it. Also, the insights the AI may gain from a well-curated data set may not necessarily provide more information to biomedical researchers than existing techniques.

LLMs are Black Boxes

Large language models at the end of the day are models that predict the next element in a sequence, usually a token (roughly a word) in a sentence. As the LLM outputs its text response to a query, each word is simply the element with the highest probability from the pool of words the model is aware of. Such probability models are used extensively in bioinformatics, but the scale of LLMs is unprecedented. A neural network used by a bioinformatician for deep learning on tabular data may have hundreds or even thousands of parameters. LLMs however have massively increased complexity relying on billions or even trillions of parameters involving complex, multi-step, multi-layered neural networks to generate probability scores for the next word in the sequence. This enormous complexity prevents engineers and scientists alike to understand the thought process that lead to the result.

And in drug discovery and development, the ‘why’ is often as important as the ‘what’. Identifying a potential biomarker is useful, but understanding why the model selected specific features (genes, proteins, mutations) is critical for biological validation, mechanism-of-action studies, and ultimately, regulatory acceptance. The opacity of complex models like deep neural networks hinders trust and makes it difficult to troubleshoot or scientifically validate their outputs.

Issues with Reproducibility

The low reproducibility of published results has been a long standing issue in science. Reproducing an AI result may be even more difficult than repeating a wet-lab experiment: it often requires identical code, software library versions (down to the patch), specific hardware configurations (especially for GPU-dependent tasks), exact hyperparameter settings, and even the same random seeds used during training. The stochastic nature of many training algorithms adds another layer of complexity.

In addition, since flagship LLMs are large, resource-intensive to use, and quickly get outdated, they are often only available from cloud providers for a limited time. Most flagship models from 2023 are no longer supported and have been superseded by newer versions, making them unavailable to confirm results obtained with them. It is equally likely that flagship models today will face the same fate in a few years, whereas in drug development projects can last more than a decade and need a consistent infrastructure in the process. While smaller models that can be run locally do mitigate this problem, such models also offer a more modest performance compared to the largest models available, and due to their stochastic nature may still produce a different outcome if the same input is provided multiple times.

Hallucination: the AI Minefield

AIs always respond with the most likely response based on their training set and model. This is not necessarily the correct response, it is merely deemed the most probable one by the model. It is easy to see how this property may lead to unwanted outcomes when interacting with LLMs. Inconsistencies in the training data, inaccuracies in the model, or simply incorrect conclusions may lead to confabulations – commonly known as hallucinations. This happens when an AI provides an often well articulated and confident, but incorrect output. These incorrect outputs can include non-existent citations to back up a claim, code that does not work and incorrect diagnoses, depending on the task at hand.

Relying on LLM outputs without verification can lead to spectacularly bad results. And while AI engineers are aware of this limitation and put great effort into reducing confabulations, these are still prevalent even in the latest models. They will likely be present in the future as well, as the underlying principle of transformer-based LLMs is centred around probability, not factual correctness. Therefore independent verification of results is critical.

The Ethical Tightrope of Privacy, Fairness, and Ownership

Omics data, even seemingly anonymized, can potentially be used to re-identify individuals, especially when combined with other datasets. AI models might inadvertently learn and expose sensitive patient attributes, or yield results that lack moral and ethical considerations. Ensuring fair access to the benefits of AI and addressing potential biases that could disadvantage certain patient groups remains a challenge to be addressed.

This issue is further complicated in the case of sensitive medical data. Training AIs requires an excessive amount of sensitive patient information, which government agencies and research institutions are not allowed to provide to third parties. While new legal frameworks are in development, access to sufficient amounts of medical data for training or fine-tuning models remains a challenge.

The Machine Learns from Good Analyses, not From Raw Data

A significant limitation to training new AI models for bioinformatics is the lack of data. This may sound surprising as omics data is abundant in public repositories and may be freely used for training AI. The lack is not of raw data, but well processed, annotated and summarised data sets. The AI is not able to learn from more and more raw data sets effectively, it also requires to see the analysis process and the resulting discoveries. There are surprisingly few data sets analysed in a standardised manner in order to train large models to draw conclusions appropriately.

This issue is so prevalent that leading AI developers have started recruiting medical professionals en masse to provide example diagnoses in large numbers. Such projects usually target a single data type, such as chest X-rays, where trained professionals are asked to provide an accurate, approved diagnosis of thousands of such images, and then the AI is trained on both the X-ray image, and the correct, human-provided diagnosis. These data sets are effectively being prepared by hand and therefore slow to produce, hindering progress of AI models in biomedical sciences.

Despite the Setbacks, AI is Gradually Being Integrated into Bioinformatic Workflows

While the above (and some additional) limitations have slowed down the integration of AI-based tools into bioinformatics, a number of successes have already been achieved. Perhaps the most impactful advance was the success of AlphaFold. Google’s tool is able to predict the 3D structure of proteins with unprecedented accuracy, leading to a database of millions of protein structures. The model was prepared using thousands of solved 3D protein structures, which were obtained using wet-lab methods, often taking years of wet-lab research in each case. The latest version is able to solve both protein, and protein-RNA complexes with increasing accuracy, often reducing the time needed to resolve the structure of a protein from years to hours. The development of Alphafold earned its researchers the Nobel prize in chemistry in 2024.

While other tools may be less apparent, the field of bioinformatics is slowly transforming by integrating AI-based tools into existing pipelines instead of replacing these outright. Such approaches may lead to the best of both worlds: rapid results drawing on a huge body of evidence by the AI, validated by human analysts using reliable, reproducible methods.

About the Author

Dr Mate Ravasz is a bioinformatician at Fios Genomics and holds a PhD in molecular biology from the Ludwig Maximilian University in Munich. After a thesis on studying the DNA replication machinery for use in gene therapy, Mate completed post-doctoral projects at the University of Southampton on colon cancer, and at the University of Edinburgh on T-cell immunotherapy before joining Fios Genomics in 2020. His main interests are signalling pathways, protein-protein interactions, and modelling biological networks.

Declaration of Generative AI and AI-assisted Technologies in the Writing Process

During the preparation of this work, the author created the initial outline, headings, and introduction and then used Gemini 2.5 pro for expanding on the concepts presented in the post and for improving readability. After using this service, the author took specific suggestions from the AI and wrote the content independently. The author takes full responsibility for the content of this publication.