DataSAIL enables better evaluation of AI models
Roman Joeres and Professor Olga Kalinina from the Drug Bioinformatics Group at the Helmholtz Institute for Pharmaceutical Research Saarland, together with Professor David Blumenthal from Friedrich-Alexander University in Erlangen, have developed a new method for evaluating AI models more accurately. Their tool “DataSAIL” sets new standards in an important area of machine learning.
AI models are trained on huge amounts of data and must be tested before being used in practice. The test data provided for this purpose should represent the data from practice as accurately as possible. Data splitting is used to divide up the data available for the processing of a model for the development steps. Depending on which algorithm is used, the test data determined is more or less similar to the training data. If the training data and test data are very different, this is referred to as out-of-distribution data (OOD data).
In the new paper, the three researchers show in a theoretical section that the problem of splitting a data set in such a way that the test data is as different as possible belongs to a group of particularly difficult problems. At the same time, they present an algorithm that describes data splitting as an optimization problem and then solves it heuristically.
DataSAIL is the first tool to make this approach available for all types of data (not just biological data). Furthermore, it is the first tool that can also be used for the automated splitting of interaction data. Interaction data poses a particular challenge, as similarities between two types of data must be taken into account for splitting. For example, if one wants to develop a model for predicting interactions between drugs and target proteins on a data set with known interactions of this type, the model must be tested to see whether it works for new drugs or target proteins. DataSAIL also sets new standards in a third area by being able to consider similarities and class distributions equally, for example to have a similar proportion of female and male data points in testing as in training. This can prevent a model from working better for one gender than another. In the evaluation of the model, the researchers were able to show that DataSAIL generates more difficult data splits than known algorithms and thus makes it possible to realistically estimate the performance of AI models in OOD scenarios.
“In order to be able to test AI models well for use on OOD data, it is important to validate them on test data that is as different as possible from the data used to train the models,” says first author Roman Joeres. “DataSAIL enables developers of AI models to easily calculate such a division into training and test data.”
The plan is to further develop and expand the tool over the next few years in order to reduce the runtime of the algorithm and to be able to split data even more precisely for different practical scenarios.
The work was published in Nature Communications on April 8, 2025 (DOI: 10.1038/s41467-025-58606-8).