View on GitHub

scDetect-Introduction

Summary

scDetect is a new cell type ensemble learning classification method for single-cell RNA sequencing across different data platforms, using a combination of gene expression rank-based method and majority vote ensemble machine-learning probability-based prediction method.

To further accurate predict the tumor cells in the single cell RNA-seq data, we developed scDetect-Cancer, a classification framework which incorporated the cell copy number information and epithelial origin information in the classification.

Application of scDetect

First, we load the rJava, scDetect package, and Seurat

library("rJava")
library("scDetect")
library("Seurat")

We will work with single cell data from two human pancreas dataset. “Muraro” dataset were generated from CEL-Seq2 platform, “Xin” dataset were generated from SMARTer platform.

Read the gene expression data and cell type lable.

# Xin human pancreas dataset #
xin<-counts(xin_test)
xin_lable<-xin_test$label

# Muraro human pancreas dataset #
muraro<-counts(muraro_test)
muraro_lable<-muraro_test$label

Prediction

To make scDetect easy to use, all steps were integrated into one function – scDetect.

Here, we used Muraro pancreas dataset as the training dataset to predict the cell types in Xin pancreas dataset. Muraro and Xin pancreas dataset had different cell type numbers. There are 9 cell types in Muraro dataset and 4 cell types in Xin dataset. The cell types in Xin dataset were included in the Muraro dataset. In the tutorial, we used Muraro dataset as the training dataset to predict the cell types in the Xin dataset. We suggested that the training dataset should be contained more cell types than the dataset be predicted. So that there will be less unclassified and misclassified cells in the prediction dataset.

# Using Muraro dataset as the training dataset #
# Prediction #
prediction_results<-scDetect(vali_set_matrix = xin, train_set_matrix = muraro, train_set_lable = muraro_lable,p_value=0.2)

We can obtain a table showing the prediction results and detailed inforamtion.

The prediction results of scDetect included four columns:

predict_lable: Predicted cell type of the highest predict_score cell type;

predict_score: Highest predict score of the corresponding cell type;

pvalue: p value of the predict score based on the permutation analysis;

final_predict_lable: Predicted cell type based on the predict score and pvalue.

prediction_results[1:20,]
##           predict_lable     predict_score pvalue final_predict_lable
## Sample_1           beta             0.475  0.263             Unknown
## Sample_2           beta             0.525  0.181                beta
## Sample_3           beta 0.666666666666667  0.045                beta
## Sample_4           beta 0.508333333333333    0.2                beta
## Sample_5           beta 0.591666666666667  0.095                beta
## Sample_6           beta              0.55   0.16                beta
## Sample_7           beta             0.625  0.079                beta
## Sample_8           beta 0.416666666666667  0.458             Unknown
## Sample_9           beta             0.625  0.079                beta
## Sample_10          beta 0.516666666666667  0.181                beta
## Sample_11          beta 0.666666666666667  0.045                beta
## Sample_12          beta 0.516666666666667  0.181                beta
## Sample_13          beta 0.633333333333333  0.072                beta
## Sample_14          beta             0.675  0.021                beta
## Sample_15          beta 0.716666666666667  0.021                beta
## Sample_16          beta 0.591666666666667  0.095                beta
## Sample_17          beta             0.625  0.079                beta
## Sample_18          beta 0.591666666666667  0.095                beta
## Sample_19          beta 0.633333333333333  0.072                beta
## Sample_20          beta              0.45  0.346             Unknown

Evaluation

Evaluate the prediction results.

evaluate_results<-evaluate(xin_lable,prediction_results$final_predict_lable)

Accuracy of the cell type prediction results.

evaluate_results$Acc
## 0.9758491

Confusion matrix of the cell type prediction results.

evaluate_results$Conf
##          pred_lab
##  true_lab acinar alpha beta delta gamma Unknown
##   alpha      1   846    0     0     0      39
##   beta       1     7  352    15     0      97
##   delta      0     2    0    33     0      14
##   gamma      0     6    0     0    62      17

Application of scDetect-Cancer

For the single cell RNA-seq data of the tumor samples. First, we load the scDetect package, and Seurat

library("scDetect")
library("Seurat")

We will work with single cell data from a test melanoma dataset.

Read the gene expression data and cell type lable.

# Melanoma reference dataset #
mela_ref<-counts(melanoma_ref)
mela_ref_lable<-melanoma_ref$label

# Melanoma test dataset #
mela_test<-counts(melanoma_test)

Prediction

To make scDetect-Cancer easy to use, all steps were integrated into one function – scDetect-Cancer.

Here, we used Melanoma reference dataset (without tumor cells) as the training dataset to predcit the cell types in a melanoma test dataset.

The gene position file used for single cell copy number variation analysis and gene list file used for epithelial score analysis could be obtained here.

Create temporary directory.

output_dir<-tempdir()
# Prediction #
scDetect_Cancer_results<-scDetect_Cancer(vali_set_matrix = mela_test, train_set_matrix = mela_ref, train_set_lable = mela_ref_lable, gene_position_file, gene_list, output_dir)

We can obtain a list included the prediction results and detailed inforamtion.

The prediction results:

scDetect_Cancer_results$lable[1:20]
##   "Tumor"   "Bcell"   "Bcell"   "Unknown" "Tcell"   "Unknown" "Tumor"   "Unknown" "Tumor"   "Bcell"   "Tcell"   "Tcell"   "Tcell"   "Tumor"   "Unknown" "Tcell" "Tumor"   "Bcell"   "Tumor"   "Tcell"

The detailed inforamtion:

scDetect_Cancer_results$detail_info[1:20,]
##                                        CNV_Class CNV_entropy_score anno_file Epithelial_score Epithelial_pvalue Epithelial_class   raw_lable final_lable
##cy81.Bulk.CD45.neg.B04.S112.comb             Tumor          8.072779     Other        0.2969218      2.881267e-51            Tumor  Fibroblast       Tumor
##cy94_cd45pos_4_C09_S33_comb                  Other          8.078587     Other        0.1443036      1.000000e+00            Other       Bcell       Bcell
##cy72.CD45.pos.D04.S904.comb                  Other          8.077199     Other        0.1684391      9.999999e-01            Other       Bcell       Bcell
##CY88CD45POS_2_D09_S429_comb                  Other          8.079058     Other        0.2982854      8.454728e-52            Tumor       Tcell     Unknown
##CY75_1_CD45_CD8_1__S29_comb                  Other          8.078603     Tcell        0.1341299      1.000000e+00            Other       Tcell       Tcell
##Cy80_II_CD45_F08_S932_comb                   Other          8.077468     Other        0.2424822      3.359626e-25            Tumor Endothelial     Unknown
##cy78.CD45.neg.2.C06.S606.comb                Tumor          8.073131     Other        0.3008013      9.071703e-53            Tumor Endothelial       Tumor
##CY94_CD45NEG_CD90POS_2_C02_S26_comb          Other          8.079323     Tcell        0.2200801      3.368614e-12            Tumor       Tcell     Unknown
##cy78.CD45.neg.2.A05.S581.comb                Tumor          8.069823     Other        0.3436408      4.530180e-67            Tumor  Fibroblast       Tumor
##cy79.p3.CD45.pos.PD1.neg.G01.S169.comb       Other          8.078866     Other        0.1801278      9.928248e-01            Other       Bcell       Bcell
##CY75_1_CD45_CD8_3__S106_comb                 Other          8.079064     Other        0.1324336      1.000000e+00            Other       Tcell       Tcell
##cy53.1.CD45.pos.2.E04.S1012.comb             Other          8.078719     Tcell        0.1793396      9.958538e-01            Other       Tcell       Tcell
##CY89A_CD45_POS_10_A04_S196_comb              Other          8.078433     Other        0.2101634      5.537329e-07            Other       Tcell       Tcell
##cy78.CD45.neg.3.A10.S682.comb                Tumor          8.074718     Other        0.2905377      1.049571e-48            Tumor  Fibroblast       Tumor
##cy80.Cd45.pos.PD1.pos.B01.S37.comb           Other          8.079013     Other        0.2456721      5.311377e-27            Tumor       Tcell     Unknown
##CY84_PRIM_POS_All_7_B11_S215_comb            Other          8.078439     Other        0.1528024      1.000000e+00            Other       Tcell       Tcell
##cy79.p4.CD45.neg.PDL1.neg.B09.S1077.comb     Tumor          8.074930     Other        0.2632536      2.483141e-36            Tumor       Bcell       Tumor
##Cy72_CD45_C03_S699_comb                      Other          8.078826     Other        0.1689189      9.999998e-01            Other       Bcell       Bcell
##cy79.p4.CD45.neg.PDL1.pos.C07.S415.comb      Tumor          8.072491     Other        0.2878176      1.405910e-47            Tumor  Fibroblast       Tumor
##CY75_1_CD45_CD8_8__S293_comb                 Other          8.079135     Other        0.1570361      1.000000e+00            Other       Tcell       Tcell

SessionInfo

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7600)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.936 
#> [2] LC_CTYPE=Chinese (Simplified)_China.936   
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C                              
#> [5] LC_TIME=Chinese (Simplified)_China.936    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.3  magrittr_1.5    tools_3.6.3     htmltools_0.4.0
#>  [5] yaml_2.2.1      Rcpp_1.0.3      stringi_1.4.6   rmarkdown_2.1  
#>  [9] knitr_1.28      stringr_1.4.0   xfun_0.12       digest_0.6.25  
#> [13] rlang_0.4.5     evaluate_0.14