About scMKL

scMKL (single-cell Multiple Kernel Learning) is a binary classifier. It takes advantage of Random Fourier Features (RFFs) to create a multiple approximate kernels that is passed to Group Lasso to make classifications.

Single-cell features are grouped into groupings such as gene sets for transcriptomics data. The data is then transformed with RFFs to create kernels that are then used as parameters in Group Lasso. This enables scMKL to be scalable to the volume of single-cell data.

Group Lasso assigns weights to each grouping based on how predictive those groupings are for distinguishing between two cell classes. The regularization coefficient Group Lasso takes allows the user to manipulate the number of nonzero groupings that are used in the final model and can be tuned for optimal accuracy. This feature makes the results of scMKL interpretable.

This frame work gives a straight-forward approach to integrating different data types such as RNA and ATAC data into a single model.

Experimental Design

Seven single-cell datasets were used to evaluate the performance of scMKL and compare to other methods of single-cell analysis as shown below. scMKL was used to predict cell labels for each data set.

To obtain robust results, we used 100 different train/test splits. For each split, we used 10 different sparsity arguments giving a range of group selection for each. This layout yields 1,000 models total for each groupings/modality combination.

Citations

  • Ors, Aysegul, Alex Daniel Chitsazan, Aaron Reid Doe, Ryan M. Mulqueen, Cigdem Ak, Yahong Wen, Syber Haverlack et al. "Estrogen regulates divergent transcriptional and epigenetic cell states in breast cancer."Nucleic acids research 50, no. 20 (2022): 11492-11508.
  • Identification of a tumor-specific gene regulatory network in human B-cell lymphoma, Single Cell Multiome ATAC + Gene Expression, 10x Genomics, (2021)
  • Song, Hanbing, Hannah NW Weinstein, Paul Allegakoen, Marc H. Wadsworth, Jamie Xie, Heiko Yang, Ethan A. Castro et al. "Single-cell analysis of human primary prostate cancer reveals the heterogeneity of tumor-associated epithelial cell states." Nature communications 13, no. 1 (2022): 141.
  • Eksi, Sebnem Ece, Alex Chitsazan, Zeynep Sayar, George V. Thomas, Andrew J. Fields, Ryan P. Kopp, Paul T. Spellman, and Andrew C. Adey. "Epigenetic loss of heterogeneity from low to high grade localized prostate tumours." Nature communications 12, no. 1 (2021): 7292.
  • Wolf, F. Alexander, Philipp Angerer, and Fabian J. Theis. "SCANPY: large-scale single-cell gene expression data analysis." Genome biology 19 (2018): 1-5.
  • Fang, Zhuoqing, Xinyuan Liu, and Gary Peltz. "GSEApy: a comprehensive package for performing gene set enrichment analysis in Python." Bioinformatics 39, no. 1 (2023): btac757.
  • McInnes, Leland, John Healy, and James Melville. "Umap: Uniform manifold approximation and projection for dimension reduction." arXiv preprint arXiv:1802.03426 (2018).
  • Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785-794. 2016.
  • Tolstikhin, Ilya O., Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung et al. "Mlp-mixer: An all-mlp architecture for vision." Advances in neural information processing systems 34 (2021): 24261-24272.

Feature Groupings

Hallmark Gene Sets

Background

There are 50 gene sets composed of between 32 and 200 genes in Hallmark gene sets. The length of these genes sum to 7,322 genes. However, there are only 4,384 unique genes in this collection of gene sets indicating overlap between the groups. To use Hallmark gene sets for ATAC data, features in each ATAC data set were matched with regions that overlapped with or was in proximity of the gene bodies of genes in each gene set.

https://www.gsea-msigdb.org

Hallmark RNA Groupings

Top Group Feature Overlap

Proportion of Unique Features in Grouping

Hallmark ATAC Groupings

Top Group Feature Overlap

Proportion of Unique Features in Grouping

Cistrome TFBMs

Background

The Cistrome database contains tissue specific regions of transcription factor binding motifs (TFBMs). TFBMs were matched with assay features to create groupings of TFBMs by transcription factor for both MCF7 and T47D cell-lines.

http://cistrome.org

Top Group Feature Overlap

Proportion of Unique Features in Grouping

JASPAR TFBMs

Background

The JASPAR database contains regions of transcription factor binding motifs (TFBMs) that are not tissue specific. Using motifmatchr, we matched data ATAC peaks to known transcription factor binding motifs using motifmatchr where each group contained peaks associated with a single transcription factor's binding motifs.

Interestingly, there are no unique features in the top JASPAR groupings which could be an indication of the importance of feature groupings.

https://jaspar.elixir.no/

Top Group Feature Overlap

Proportion of Unique Features in Grouping

scMKL Performance

Selections

scMKL vs. Other State-of-the-Art Models

XGBoost uses gradient boosting decision trees to classify samples.
MLP uses a layered feedforward neural network to classify samples.
When scMKL is selected, the best performing alpha is used to plot results

scMKL Interpretation via Weights and Selection

Selections

Top Group Normalized Weights

Heatmap of scMKL Group Selection Frequency

scMKL Feature Selections

NOTE: Motif grouping feature selections are unavailable

Single-cell Analysis with scanpy

Dataset and Modality

Subsets

NOTE: When group subset is None, most variable features are used.

Gene Set Enrichment Analysis

Using the differentially expressed genes calculated by scanpy, gene set enrichment was computed using GSEApy

GO Biological Process Gene Sets

Hallmark Gene Sets

Sam Kupp

  • Implementation
  • Analysis

Computational Biologist

CEDAR, Oregon Health & Science University

Ian VanGordon

  • Implementation
  • Analysis

Computational Biologist

CEDAR, Oregon Health & Science University

Cigdem Ak

  • Direction
  • Implementation

Postdoctoral Scholar

CEDAR, Oregon Health & Science University