scMKL (single-cell Multiple Kernel Learning) is a binary classifier. It takes advantage of Random Fourier Features (RFFs) to create a multiple approximate kernels that is passed to Group Lasso to make classifications.
Single-cell features are grouped into groupings such as gene sets for transcriptomics data. The data is then transformed with RFFs to create kernels that are then used as parameters in Group Lasso. This enables scMKL to be scalable to the volume of single-cell data.
Group Lasso assigns weights to each grouping based on how predictive those groupings are for distinguishing between two cell classes. The regularization coefficient Group Lasso takes allows the user to manipulate the number of nonzero groupings that are used in the final model and can be tuned for optimal accuracy. This feature makes the results of scMKL interpretable.
This frame work gives a straight-forward approach to integrating different data types such as RNA and ATAC data into a single model.
Seven single-cell datasets were used to evaluate the performance of scMKL and compare to other methods of single-cell analysis as shown below. scMKL was used to predict cell labels for each data set.
To obtain robust results, we used 100 different train/test splits. For each split, we used 10 different sparsity arguments giving a range of group selection for each. This layout yields 1,000 models total for each groupings/modality combination.
There are 50 gene sets composed of between 32 and 200 genes in Hallmark gene sets. The length of these genes sum to 7,322 genes. However, there are only 4,384 unique genes in this collection of gene sets indicating overlap between the groups. To use Hallmark gene sets for ATAC data, features in each ATAC data set were matched with regions that overlapped with or was in proximity of the gene bodies of genes in each gene set.
The Cistrome database contains tissue specific regions of transcription factor binding motifs (TFBMs). TFBMs were matched with assay features to create groupings of TFBMs by transcription factor for both MCF7 and T47D cell-lines.
The JASPAR database contains regions of transcription factor binding motifs (TFBMs) that are not tissue specific. Using motifmatchr, we matched data ATAC peaks to known transcription factor binding motifs using motifmatchr where each group contained peaks associated with a single transcription factor's binding motifs.
Interestingly, there are no unique features in the top JASPAR groupings which could be an indication of the importance of feature groupings.
NOTE: Motif grouping feature selections are unavailable
NOTE: When group subset is None, most variable features are used.
GitHub: https://github.com/ohsu-cedar-comp-hub/scMKL
PyPi: https://pypi.org/project/scmkl/
API : https://ohsu-cedar-comp-hub.github.io/scMKL/
Conda: https://anaconda.org/ivango17/scmkl
Publication: Coming Soon
Computational Biologist
CEDAR, Oregon Health & Science University
Computational Biologist
CEDAR, Oregon Health & Science University
Postdoctoral Scholar
CEDAR, Oregon Health & Science University