Read Time: 7 minutes

Building a ready-made cancer data library

Researchers built a cancer database combining 4 types of molecular data across 32 cancer types to enable consistent use of machine learning in cancer research.


shadow
Image Credit: Photo by Kevin Ku on Unsplash

Computational cancer researchers who use machine learning techniques face a fundamental problem. Massive amounts of data exist for training machine learning models, but this training requires extensive processing due to inconsistencies in format, naming, structure, and other properties of the data files. This means that when scientists use different cancer types and data-cleaning steps, the resulting models behave differently.

Researchers have observed that the gap between available and usable datasets is a barrier for scientists without specialized bioinformatics training. The differences in processing strategies also make it impossible to fairly compare new machine learning methods and select the best-performing one for a given cancer research task, like classifying patient samples into benign or malignant.

Therefore, researchers in Japan and the USA collaborated to create a comprehensive database containing genetic and molecular information from over 8,000 cancer patients, specifically designed for machine learning applications. They named the database MLOmics. Like a well-organized library where books are already sorted, labeled, and ready to read, MLOmics provides cancer data that computer models can immediately use without requiring extensive processing. 

To build MLOmics, they collected patient samples across 32 cancer types from a publicly available database called The Cancer Genome Atlas. For each patient, they collected 4 types of molecular data. These data included 2 types of DNA products, collectively called transcriptomics data, data on repeated DNA regions, called copy number variation, and data on chemical DNA tags, called methylation. For the transcriptomics data, the team labeled their experimental source, which influences data quality, removed contamination from non-human samples, and removed unlabeled values.

For the data on copy number variation, the researchers selected cancer-specific repeats and identified and labeled recurrent abnormal repeats with corresponding genes in the regions. They adjusted the methylation data to remove bias introduced by different experimental platforms. Finally, the team labeled all the molecular data they processed with uniform identifiers to resolve variations in naming conventions.

Next, they built a coding pipeline that checked data quality and combined the molecular data types for each patient into a single dataset. This approach is known as multi-omics because it combines multiple molecular measurements. Then, the researchers matched each patient’s samples to their corresponding cancer types to produce organized datasets ready for analysis.

The researchers constructed 20 task-ready datasets spanning 3 categories of machine learning problems and provided appropriate metrics to evaluate the models from each category. They aimed to demonstrate how other scientists can use MLOmics for a range of common tasks. 

The first category, classification, included 6 datasets. Scientists could use these datasets to train models to group samples into known classes, such as malignant or benign tumors. The second category, clustering, included 9 datasets. When predefined labels are unavailable, scientists use clustering to see how the samples naturally group together based on their molecular patterns. The final category, data imputation, included 5 datasets that could help scientists address incomplete molecular data due to experimental or technical errors. This category demonstrates how models can estimate or fill in missing values, which is common in real-world scenarios.

The researchers also structured their MLOmics database into 3 sections, with detailed usage guidelines for each. The first section hosts the task-ready cancer multi-omics datasets stored primarily as comma-separated value, or CSV files. The team chose CSV files because they are efficient even with large genomic datasets, and programming languages like Python and R have built-in functions to read, write, and analyze CSV files efficiently. The second section provides code files to help scientists develop models and apply evaluation metrics. The final section contains links to additional resources to complement the main datasets for other biological analyses, and to make the database accessible to all interested scientists regardless of their educational backgrounds.

The researchers concluded that MLOmics is a valuable resource for the cancer research community because it allows researchers to focus on developing better algorithms rather than spending time on data preparation. They emphasized that MLOmics is suitable for non-experts and supports interdisciplinary research and broader biological studies. They committed to continuously updating MLOmics with additional resources and tasks to ensure the database remains current as the field advances. 

Study Information

Original study: MLOmics: Cancer Multi-Omics Database for Machine Learning

Study was published on: May 30, 2025

Study author(s): Ziwei Yang, Rikuto Kotoge, Xihao Piao, Zheng Chen, Lingwei Zhu, Peng Gao, Yasuko Matsubara, Yasushi Sakurai, Jimeng Sun

The study was done at: Kyoto University (Japan), The University of Tokyo (Japan), Osaka University (Japan), University of Illinois Urbana-Champaign (USA)

The study was funded by: Grant-in-Aid for Scientific Research

Raw data availability: Found on github

Featured image credit: Photo by Kevin Ku on Unsplash

This summary was edited by: Madeline Taylor