understandscience

Bridging Drugs and Diseases: Large Language Models Tackle Drug-Indication Translation

7 minutes Read

Jan 01, 2026

Introduction

The journey from a chemical compound to a life-saving medication is a complex and expensive endeavor. Drug discovery, the process of identifying chemical entities with therapeutic potential, is a significant area of scientific research. A crucial aspect of this process involves understanding a drug's approved indications – the specific diseases, conditions, or symptoms it is intended to treat, prevent, mitigate, cure, relieve, or diagnose. The ability to efficiently link drug molecules with their indications, or vice versa, holds the promise of more targeted disease treatment and a substantial reduction in the cost of developing new medicines, potentially transforming the field.

In recent years, Large Language Models (LLMs), such as GPT-3, GPT-4, LLaMA, and Mixtral, have emerged as powerful tools in artificial intelligence. These models, trained on vast amounts of text data, excel at various Natural Language Processing tasks, including text generation and translation. Their capabilities extend beyond general language understanding, showing promise in diverse scientific domains. The challenge lies in adapting these text-based models to scientific concepts, particularly when dealing with molecular structures that are typically represented visually.

Translating Molecules and Medicine

To bridge the gap between molecular structures and textual information, researchers utilize methods like the Simplified Molecular-Input Line-Entry System (SMILES). SMILES strings provide a textual representation of molecules, capturing their atoms, bonds, and structural features. This textual format allows LLMs to process and understand molecules. This paper explores the viability of using LLMs for translation between drug molecules, represented by SMILES strings, and their corresponding indications. The study focuses on two primary tasks: "drug-to-indication," where the goal is to generate indications from a drug's SMILES string, and "indication-to-drug," where the aim is to generate a SMILES string for a drug that treats a given set of indications. Successfully achieving this translation could pave the way for finding treatments for currently untreatable diseases.

Existing research has already demonstrated the potential of AI in drug discovery and molecular design. Approaches include graph neural networks and generative AI models. Some efforts have employed GPT-based models to design molecules with desired properties, while others have used the T5 architecture for tasks like reaction prediction and converting between molecular descriptions and SMILES strings. Additional work involves generating new molecules from gene expression signatures or using recurrent neural networks and graph neural networks to predict drugs and their indications. These advancements highlight a strong foundation for leveraging AI in molecular design and drug discovery.

Evaluating LLM Capabilities for Drug-Indication Translation

This research specifically evaluates the capabilities of MolT5, a T5-based model, in performing the drug-to-indication and indication-to-drug translation tasks. The experiments utilized drug data from two prominent databases: DrugBank and ChEMBL. For the drug-to-indication task, the input was the SMILES string of an existing drug, and the target output was its associated indications. In the indication-to-drug task, the input was a set of indications, and the model aimed to generate the SMILES string of a drug that could treat those conditions.

The study employed all available MolT5 model sizes (small, base, and large) and tested them under three different experimental configurations: evaluating baseline models on the entire dataset, evaluating them on a 20% subset, and fine-tuning the models on 80% of the dataset followed by evaluation on the remaining 20% subset. The findings indicated that larger MolT5 models consistently outperformed smaller ones across all configurations and tasks. Interestingly, fine-tuning the MolT5 models often had a negative impact on performance, suggesting that the pre-trained knowledge might be disrupted by fine-tuning on this specific task.

Following these initial experiments, the researchers trained the smallest available MolT5 model from scratch using a custom tokenizer. This custom model showed improved performance on the DrugBank data for the drug-to-indication task compared to the ChEMBL data, possibly due to the richer detail in DrugBank's indication descriptions. Fine-tuning this custom model on 80% of either dataset did not degrade performance for the drug-to-indication task and even led to improvements in some metrics. However, for the indication-to-drug task, fine-tuning did not consistently improve performance on either dataset.

Challenges and Future Directions

Despite promising results, the current performance of these models is not yet satisfactory. The researchers identified a key challenge: the "signal" between SMILES strings and indications is weak. Unlike the original MolT5 task, where similar SMILES strings often had similar textual descriptions (molecular captions), in the drug-indication context, similar SMILES strings can represent different drugs with entirely different indications. Conversely, different SMILES strings might correspond to drugs with similar therapeutic uses. This lack of a direct, consistent relationship makes achieving high performance difficult. The study suggests that an intermediate representation, to which both drugs and indications map, could potentially improve performance. For instance, mapping a SMILES string to its caption and then translating that caption to an indication might be a fruitful avenue for future research.

Another significant limitation is the scarcity of data. The available datasets, ChEMBL and DrugBank, contain fewer than 10,000 drug-indication pairs in total. This limited data restricts the ability to establish a strong signal between SMILES strings and indications. Future work could focus on methods to enrich this data.

Overall, the experiments consistently showed that larger models tend to perform better. The researchers conclude that by utilizing larger models and accessing more data, or data with a stronger inherent signal between drug indications and SMILES strings, it may become possible to achieve successful translation and facilitate novel drug discovery. The study also acknowledges that its evaluation relied solely on automated metrics, which may not perfectly correlate with human judgment. Future research could explore incorporating human evaluation or using LLMs to assess the quality of generated indications. Additionally, exploring alternative model architectures, such as State Space Models (SSMs) like Mamba, which offer linear scaling with sequence length, could lead to better performance and computational efficiency compared to the current transformer architecture.

Conclusion

This research introduces a novel task: translating between drug molecules and their indications using large language models. By framing this as a text-to-text translation problem, the study explored two directions: generating indications from drug SMILES strings and generating SMILES strings from indications. Experiments with MolT5 models on DrugBank and ChEMBL datasets revealed that larger models generally perform better, while fine-tuning can sometimes hinder performance. The identified challenges of weak signal and data scarcity point towards future research directions, including the use of intermediate representations and data enrichment. The ultimate goal is to harness LLMs to accelerate drug discovery, leading to new treatments for unmet medical needs.


Original source: "https://www.nature.com/articles/s41598-024-61124-0"

#Large Language Models #Drug Discovery #Artificial Intelligence #Cheminformatics #SMILES #MolT5

Source: Original Article