Data Evolution in Pharma: The Spread of Multimodal
Many people are familiar with how large language models (LLMs) work. You ask a program like ChatGPT a question and it replies from an AI model built on large public datasets. For most in their daily lives the impact falls somewhere between a novelty to a handy work tool. But what if the same approach were applied to drug discovery—what if your ELN deployment was AI-enabled?
Today, thousands of chemists working on small molecule programs use Dotmatics ELN to capture the reactions they have performed in the lab. We want to help these scientists, and all scientists, to become much more efficient.
Now, imagine if your drug discovery R&D platform made useful predictions and recommendations on compounds to accelerate drug development. What if those predictions were based on a combination of models formed from neural network algorithms on local private data, as well as large public compound libraries that were folded into that private customer data. And thanks to feature engineering it could offer “magic suggestions” because it truly understands your data and the scientific domains behind the information. That’s exactly what Dotmatics is building.
Dotmatics Luma Now…and What’s Ahead
Earlier this year Dotmatics unveiled Luma, a breakthrough scientific intelligence platform that simplifies collecting and processing instrument data, and helps non-technical users easily gain critical insights directly from data. Luma opens up worlds of possibilities for customers to supercharge their existing ELNs with AI.
Currently in Luma we are building out an AI prototype solution that leans on large public datasets of molecules available for purchase. When a compound is sketched, Luma runs algorithms to calculate properties such as molecular formula and mass, the IUPAC chemical name, and can register the compound to see if it is novel within that customer system.
In that same prototype, Luma also embeds additional algorithms against the same structure. The extra algorithms search a database of billions of molecules listed as available for purchase to show the chemist if the compound they are considering synthesizing is available externally for purchase. Luma can also show other similar structures (close but non-identical) from both external and internal sources.
Plus, Luma can calculate using Neural Network predicted activities for compounds on models constructed on the customers local private data. This is an important element of Dotmatics’ AI roadmap—because customer data and AI models built on that data will always remain private to that customer. To create a much broader pool of data we also can fold in large public datasets to augment those local private datasets, similar to what ChatGPT does.
Best in Breed Technology Under the Hood
The technical implementation of these additional algorithms lean on AWS hardware and Databricks to deliver a scalable cloud-based solution.
For searching billions of compounds we use an AWS in-memory database that is highly scalable—that means queries for chemical similarity against 10B+ compounds take under one second.
The neural network models used to predict activity are calculated by tokenizing the molecules using Dotmatics’ proprietary chemical toolkit, then training the models using TensorFlow within Databricks. TensorFlow was originally released to the community from the Google Brain project. It is highly scalable and makes effective use of any CPU or GPU resources available within the AWS hardware Databricks is running on, which is delivered to Dotmatics customers under the Luma platform.
Feature Engineering is a Must Have
In high complexity areas like drug discovery feature engineering is an essential requirement in any AI solution. Feature engineering simply means adding human knowledge to guide the AI towards better models. It’s what raises a generic AI method to an application-specific one. In ChatGPT, feature engineering is done by “tokenizing words.” Tokenization refers to the process of converting a sequence of text into smaller parts, known as tokens. In the context of small molecules within AI by Dotmatics, feature engineering consists of tokenizing molecules.
A number of tokenizer options are available for molecules in Dotmatics, but as most ELN sketches done by chemists are 2D/flat structures, Morgan style fingerprints are a reasonable starting point. Additional tokens based on properties like molecular weight or a logP estimator can give the neural network further information to train on.
Neural network models in TensorFlow can be trained taking tokenized molecules as inputs and known activities (i.e. pIC50) as output. The model is trained to “see” the patterns from the input data; when a molecule of unknown activity passes through, it returns a prediction. Showing the compound a chemist is making in the context of the similar chemistry within the intellectual property of the customer, plus similar chemistry from external purchasable sets, enables chemists to decide if they should continue with the synthesis in the lab or buy a compound instead. If compounds can be purchased from an external supplier it is highly likely many variants can be directly purchased too.
This is critical because the ideal goal in drug discovery is to identify a developable series, not simply find one compound that is active. Discovering a family of compounds ensures that if one compound fails there is a supply of active “backup” compounds to go to next. Finding activity in purchasable compounds significantly reduces discovery time, cost and effort.
Small Molecules are Just the Start
Small molecules are the perfect starting point for these types of research programs and functionality pilots within the Luma platform; but this is only the beginning. Consider the possibility of modeling domains of information, such as DNA encoded libraries, formulations, images, etc.
Scientific acumen, world class technology infrastructure and domain-specific feature engineering are part of what separates Dotmatics solutions from the more general off-the-shelf R&D approaches.
To learn more about Dotmatics Luma and how the scientific intelligence platform enables AI capabilities to accelerate scientific discovery.