2019年12月23日,上海交通大学丁显廷教授和林关宁教授团队(刘晓博士、宋炜宸博士生是论文的第一作者)联合在Genome Biology 上在线发表了题为A Comparison Framework and Guideline of Clustering Methods for Mass Cytometry Data的文章。该文章从准确性(precision)、一致性(coherence)和稳定性(stability)三个层面对CyTOF数据细胞分群方法开展了深度的基准分析工作。该工作根据每个方法的特性和应用场景,以及数据的特征,首次给出了具体的方法选择决策树,为单细胞质谱流式分析领域的研究者在数据分析上提供了方法指导。
在这篇文章中,研究人员在6个单细胞组学数据集上(涉及骨髓细胞、肌肉组织、结肠组织),对目前经典的无监督(Accense, Xshift, PhenoGraph, FlowSOM, flowMeans, DEPECHE, and kmeans)和半监督细胞分群方法(ACDC,LDA)进行了基准分析和深度比较。
With the expanding applications of mass cytometry in medical research, a wide variety of clustering methods, both semi-supervised and unsupervised, have been developed for data analysis. Selecting the optimal clustering method can accelerate the identification of meaningful cell populations.
Result
To address this issue, we compared three classes of performance measures, precision as external evaluation, coherence as internal evaluation, and stability, of nine methods based on six independent benchmark datasets. Seven unsupervised methods (Accense, Xshift, PhenoGraph, FlowSOM, flowMeans, DEPECHE, and kmeans) and two semi-supervised methods (Automated Cell-type Discovery and Classification and linear discriminant analysis (LDA)) are tested on six mass cytometry datasets. We compute and compare all defined performance measures against random subsampling, varying sample sizes, and the number of clusters for each method. LDA reproduces the manual labels most precisely but does not rank top in internal evaluation. PhenoGraph and FlowSOM perform better than other unsupervised tools in precision, coherence, and stability. PhenoGraph and Xshift are more robust when detecting refined sub-clusters, whereas DEPECHE and FlowSOM tend to group similar clusters into meta-clusters. The performances of PhenoGraph, Xshift, and flowMeans are impacted by increased sample size, but FlowSOM is relatively stable as sample size increases.
Conclusion
All the evaluations including precision, coherence, stability, and clustering resolution should be taken into synthetic consideration when choosing an appropriate tool for cytometry data analysis. Thus, we provide decision guidelines based on these characteristics for the general reader to more easily choose the most suitable clustering tools.