About scMKL

scMKL (single-cell Multiple Kernel Learning) is a binary classifier. It takes advantage of Random Fourier Features (RFFs) to create a multiple approximate kernels that is passed to Group Lasso to make classifications.

Single-cell features are grouped into groupings such as gene sets for transcriptomics data. The data is then transformed with RFFs to create kernels that are then used as parameters in Group Lasso. This enables scMKL to be scalable to the volume of single-cell data.

Group Lasso assigns weights to each grouping based on how predictive those groupings are for distinguishing between two cell classes. The regularization coefficient Group Lasso takes allows the user to manipulate the number of nonzero groupings that are used in the final model and can be tuned for optimal accuracy. This feature makes the results of scMKL interpretable.

This frame work gives a straight-forward approach to integrating different data types such as RNA and ATAC data into a single model.

Experimental Design

Seven single-cell datasets were used to evaluate the performance of scMKL and compare to other methods of single-cell analysis as shown below. scMKL was used to predict cell labels for each data set.

To obtain robust results, we used 100 different train/test splits. For each split, we used 10 different sparsity arguments giving a range of group selection for each. This layout yields 1,000 models total for each groupings/modality combination.

Citations

Ors, Aysegul, Alex Daniel Chitsazan, Aaron Reid Doe, Ryan M. Mulqueen, Cigdem Ak, Yahong Wen, Syber Haverlack et al. "Estrogen regulates divergent transcriptional and epigenetic cell states in breast cancer."Nucleic acids research 50, no. 20 (2022): 11492-11508.
Identification of a tumor-specific gene regulatory network in human B-cell lymphoma, Single Cell Multiome ATAC + Gene Expression, 10x Genomics, (2021)
Song, Hanbing, Hannah NW Weinstein, Paul Allegakoen, Marc H. Wadsworth, Jamie Xie, Heiko Yang, Ethan A. Castro et al. "Single-cell analysis of human primary prostate cancer reveals the heterogeneity of tumor-associated epithelial cell states." Nature communications 13, no. 1 (2022): 141.
Eksi, Sebnem Ece, Alex Chitsazan, Zeynep Sayar, George V. Thomas, Andrew J. Fields, Ryan P. Kopp, Paul T. Spellman, and Andrew C. Adey. "Epigenetic loss of heterogeneity from low to high grade localized prostate tumours." Nature communications 12, no. 1 (2021): 7292.
Wolf, F. Alexander, Philipp Angerer, and Fabian J. Theis. "SCANPY: large-scale single-cell gene expression data analysis." Genome biology 19 (2018): 1-5.
Fang, Zhuoqing, Xinyuan Liu, and Gary Peltz. "GSEApy: a comprehensive package for performing gene set enrichment analysis in Python." Bioinformatics 39, no. 1 (2023): btac757.
McInnes, Leland, John Healy, and James Melville. "Umap: Uniform manifold approximation and projection for dimension reduction." arXiv preprint arXiv:1802.03426 (2018).
Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785-794. 2016.
Tolstikhin, Ilya O., Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung et al. "Mlp-mixer: An all-mlp architecture for vision." Advances in neural information processing systems 34 (2021): 24261-24272.

scMKL Performance

Selections

Datasets

MCF7

T47D

Lymphoma

Prostate RNA

Prostate ATAC

LUSC

LUAD

Metric

Additional Options

x-axis variable

Select All Datatype - Groupings

Datatype - Grouping

Single-cell Analysis with scanpy

Dataset and Modality

Datasets

MCF7

T47D

Lymphoma

Prostate_RNA

Prostate_ATAC

LUAD

LUSC

Modality

RNA

ATAC

Subsets

Collection

Group Subset

Labels

NOTE: When group subset is None, most variable features are used.

UMAP
Volcano Plot

Gene Set Enrichment Analysis

Using the differentially expressed genes calculated by scanpy, gene set enrichment was computed using GSEApy

Dataset Selection

MCF7

T47D

Lymphoma

Prostate_RNA

LUAD

LUSC

About scMKL

Experimental Design

Citations

Feature Groupings

Hallmark Gene Sets

Background

Top Group Feature Overlap

Proportion of Unique Features in Grouping

Top Group Feature Overlap

Proportion of Unique Features in Grouping

Cistrome TFBMs

Background

Top Group Feature Overlap

Proportion of Unique Features in Grouping

JASPAR TFBMs

Background

Top Group Feature Overlap

Proportion of Unique Features in Grouping

scMKL Performance

Selections

scMKL vs. Other State-of-the-Art Models

XGBoost uses gradient boosting decision trees to classify samples.

MLP uses a layered feedforward neural network to classify samples.

When scMKL is selected, the best performing alpha is used to plot results

scMKL Interpretation via Weights and Selection

Selections

Top Group Normalized Weights

Heatmap of scMKL Group Selection Frequency

scMKL Feature Selections

Single-cell Analysis with scanpy

Dataset and Modality

Subsets

Gene Set Enrichment Analysis

Using the differentially expressed genes calculated by scanpy, gene set enrichment was computed using GSEApy

GO Biological Process Gene Sets

Hallmark Gene Sets

Links

Sam Kupp

Ian VanGordon

Cigdem Ak