scanpy教程:预处理与聚类
男,
一个长大了才会遇到的帅哥,
稳健,潇洒,大方,靠谱。
一段生信缘,一棵技能树,
一枚大型测序工厂的螺丝钉,
一个随机森林中提灯觅食的津门旅客。
scanpy 是一个用于分析单细胞转录组(single cell rna sequencing)数据的python库,文章2018发表在Genome Biology(https://genomebiology.biomedcentral.com/)。其实它的许多分析思路借鉴了以seurat为中心的R语言单细胞转录数据分析生态的,scanpy以一己之力在python生态构建了单细胞转录组数据分析框架。我相信借助python的工业应用实力,其扩展性大于R语言分析工具。当然,选择走一遍scanpy的原因,不是因为它的强大,只是因为喜欢。
在我搬这块砖之前,中文世界关于scanpy的介绍已经很多了,这当然是好事。不知道谁会以怎样的方式遇见谁,所以,还是让我们开始吧。
所做的第一步就是配置好python环境,我建议是用conda来构建,这样软件管理起来很方便。然后是安装scanpy这个库,当然可能会遇到一些问题,但是花点时间总是可以Google掉的。在Windows、mac、linux平台scanpy都是可以运行的。
在学习新的库时,文档是不可不看的。有统计表明,程序员读代码的时间一般三倍于写代码的时间。所以这基本上是一次阅读体验。我们假装可爱的读者朋友,已经配置好scanpy的环境,也许花了两三天的时间。会不会python没关系,英语不好没关系,安装个某道词典就可以了。
三天后。。。
我们打开scanpy的教程官网:https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html ,对,就是这么直白。开始在我们的编辑器里复制粘贴教程代码,并尝试运行,并理解结果。
首先我们导入需要的库,读入10X标准的数据filtered_feature_bc_matrix,当然也可以是h5格式的文件,或者一个二维的表达谱。
读入数据
# -*- coding: utf-8 -*-
"""
Spyder Editor
https://scanpy.readthedocs.io/en/stable/
This is a temporary script file.
"""
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80)
sc.logging.print_versions()scanpy==1.4.5.1
anndata==0.7.1
umap==0.3.10
numpy==1.16.5
scipy==1.3.1
pandas==0.25.1
scikit-learn==0.21.3
statsmodels==0.10.1
python-igraph==0.8.0
不用担心这些库,我们都会慢慢熟悉的,这里python-igraph要注意一下,不是igraph。这些库在《利用python进行数据分析》这本书里面都有介绍,特别是pandas和numpy,以后的日子里我们会看到它们几乎是构成scanpy数据结构的核心。
results_file = 'E:\learnscanpy\write\pbmc3k.h5ad'
help(sc.read_10x_mtx)
adata = sc.read_10x_mtx(
'E:/learnscanpy/data/filtered_feature_bc_matrix', # the directory with the `.mtx` file
var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
cache=True)
adata.var_names_make_unique() # this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
这样就把数据读进来了,他的名字叫做adata,那么这是个什么东西呢?我认为搞清楚这个基本上学会一半scanpy了。
adata
Out[21]:
AnnData object with n_obs × n_vars = 5025 × 33694
var: 'gene_ids', 'feature_types'
adata.var["gene_ids"]
Out[22]:
RP11-34P13.3 ENSG00000243485
FAM138A ENSG00000237613
OR4F5 ENSG00000186092
RP11-34P13.7 ENSG00000238009
RP11-34P13.8 ENSG00000239945
AC233755.2 ENSG00000277856
AC233755.1 ENSG00000275063
AC240274.1 ENSG00000271254
AC213203.1 ENSG00000277475
FAM231B ENSG00000268674
Name: gene_ids, Length: 33694, dtype: object
adata.var["feature_types"]
Out[23]:
RP11-34P13.3 Gene Expression
FAM138A Gene Expression
OR4F5 Gene Expression
RP11-34P13.7 Gene Expression
RP11-34P13.8 Gene Expression
AC233755.2 Gene Expression
AC233755.1 Gene Expression
AC240274.1 Gene Expression
AC213203.1 Gene Expression
FAM231B Gene Expression
Name: feature_types, Length: 33694, dtype: object
adata.to_df()
Out[24]:
RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
AAACCCAAGCGTATGG-1 0.0 0.0 ... 0.0 0.0
AAACCCAGTCCTACAA-1 0.0 0.0 ... 0.0 0.0
AAACCCATCACCTCAC-1 0.0 0.0 ... 0.0 0.0
AAACGCTAGGGCATGT-1 0.0 0.0 ... 0.0 0.0
AAACGCTGTAGGTACG-1 0.0 0.0 ... 0.0 0.0
... ... ... ... ...
TTTGTTGCAGGTACGA-1 0.0 0.0 ... 0.0 0.0
TTTGTTGCAGTCTCTC-1 0.0 0.0 ... 0.0 0.0
TTTGTTGGTAATTAGG-1 0.0 0.0 ... 0.0 0.0
TTTGTTGTCCTTGGAA-1 0.0 0.0 ... 0.0 0.0
TTTGTTGTCGCACGAC-1 0.0 0.0 ... 0.0 0.0
[5025 rows x 33694 columns]
第一句AnnData object with n_obs × n_vars = 5025 × 33694 ,告诉我们adata是一个AnnData 对象,就像seurat也是一个对象一样。什么叫对象呢?对象就是一个实体、物体,它是一种存在而不是一种动作。当然,我们可以对它做一些操作,一个对象可以通过具体的属性为人们感知。
具体地,anndata(https://anndata.readthedocs.io/en/stable/)是这样一种构象:
单细胞转录组的核心就是一个cell X gene的二维表,但是分群后要存储cell的分群结果,特征选择是要记录gene的信息,降维后要存储降维后的结果。所以,这张表.X的对象cell相关的信息记录在.obs中,属性gene的信息记录在.var中,其他的信息在.uns中。那么每一部分是什么呢?
type(adata.var["gene_ids"])
Out[205]: pandas.core.series.Series
哦,原来是pandas的Series,下面是这个数据结构的详细介绍,这个数据结构能存储的信息一点也不亚于seurat啊。
Type: AnnData
String form:
AnnData object with n_obs × n_vars = 5025 × 33694
var: 'gene_ids', 'feature_types'
Length: 5025
File: d:\program files (x86)\anconda\lib\site-packages\anndata\_core\anndata.py
Docstring:
An annotated data matrix.
:class:`~anndata.AnnData` stores a data matrix :attr:`X` together with annotations
of observations :attr:`obs` (:attr:`obsm`, :attr:`obsp`),
variables :attr:`var` (:attr:`varm`, :attr:`varp`),
and unstructured annotations :attr:`uns`.
.. figure:: https://falexwolf.de/img/scanpy/anndata.svg
:width: 350px
An :class:`~anndata.AnnData` object `adata` can be sliced like a
:class:`~pandas.DataFrame`,
for instance `adata_subset = adata[:, list_of_variable_names]`.
:class:`~anndata.AnnData`’s basic structure is similar to R’s ExpressionSet
[Huber15]_. If setting an `.h5ad`-formatted HDF5 backing file `.filename`,
data remains on the disk but is automatically loaded into memory if needed.
See this `blog post`_ for more details.
.. _blog post: http://falexwolf.de/blog/171223_AnnData_indexing_views_HDF5-backing/
Parameters
----------
X
A #observations × #variables data matrix. A view of the data is used if the
data type matches, otherwise, a copy is made.
obs
Key-indexed one-dimensional observations annotation of length #observations.
var
Key-indexed one-dimensional variables annotation of length #variables.
uns
Key-indexed unstructured annotation.
obsm
Key-indexed multi-dimensional observations annotation of length #observations.
If passing a :class:`~numpy.ndarray`, it needs to have a structured datatype.
varm
Key-indexed multi-dimensional variables annotation of length #variables.
If passing a :class:`~numpy.ndarray`, it needs to have a structured datatype.
layers
Key-indexed multi-dimensional arrays aligned to dimensions of `X`.
dtype
Data type used for storage.
shape
Shape tuple (#observations, #variables). Can only be provided if `X` is `None`.
filename
Name of backing file. See :class:`File`.
filemode
Open mode of backing file. See :class:`File`.
See Also
--------
read_h5ad
read_csv
read_excel
read_hdf
read_loom
read_zarr
read_mtx
read_text
read_umi_tools
Notes
-----
:class:`~anndata.AnnData` stores observations (samples) of variables/features
in the rows of a matrix.
This is the convention of the modern classics of statistics [Hastie09]_
and machine learning [Murphy12]_,
the convention of dataframes both in R and Python and the established statistics
and machine learning packages in Python (statsmodels_, scikit-learn_).
Single dimensional annotations of the observation and variables are stored
in the :attr:`obs` and :attr:`var` attributes as :class:`~pandas.DataFrame`\ s.
This is intended for metrics calculated over their axes.
Multi-dimensional annotations are stored in :attr:`obsm` and :attr:`varm`,
which are aligned to the objects observation and variable dimensions respectively.
Square matrices representing graphs are stored in :attr:`obsp` and :attr:`varp`,
with both of their own dimensions aligned to their associated axis.
Additional measurements across both observations and variables are stored in
:attr:`layers`.
Indexing into an AnnData object can be performed by relative position
with numeric indices (like pandas’ :attr:`~pandas.DataFrame.iloc`),
or by labels (like :attr:`~pandas.DataFrame.loc`).
To avoid ambiguity with numeric indexing into observations or variables,
indexes of the AnnData object are converted to strings by the constructor.
Subsetting an AnnData object by indexing into it will also subset its elements
according to the dimensions they were aligned to.
This means an operation like `adata[list_of_obs, :]` will also subset :attr:`obs`,
:attr:`obsm`, and :attr:`layers`.
Subsetting an AnnData object returns a view into the original object,
meaning very little additional memory is used upon subsetting.
This is achieved lazily, meaning that the constituent arrays are subset on access.
Copying a view causes an equivalent “real” AnnData object to be generated.
Attempting to modify a view (at any attribute except X) is handled
in a copy-on-modify manner, meaning the object is initialized in place.
Here’s an example::
batch1 = adata[adata.obs["batch"] == "batch1", :]
batch1.obs["value"] = 0 # This makes batch1 a “real” AnnData object
At the end of this snippet: `adata` was not modified,
and `batch1` is its own AnnData object with its own data.
Similar to Bioconductor’s `ExpressionSet` and :mod:`scipy.sparse` matrices,
subsetting an AnnData object retains the dimensionality of its constituent arrays.
Therefore, unlike with the classes exposed by :mod:`pandas`, :mod:`numpy`,
and `xarray`, there is no concept of a one dimensional AnnData object.
AnnDatas always have two inherent dimensions, :attr:`obs` and :attr:`var`.
Additionally, maintaining the dimensionality of the AnnData object allows for
consistent handling of :mod:`scipy.sparse` matrices and :mod:`numpy` arrays.
.. _statsmodels: http://www.statsmodels.org/stable/index.html
.. _scikit-learn: http://scikit-learn.org/
绘制高表达的基因:
sc.pl.highest_expr_genes(adata, n_top=20)
本着本质的好奇心,我们不禁要问一句:这张图是用什么画的?一定要养成阅读文档的习惯:
help(sc.pl.highest_expr_genes)
Help on function highest_expr_genes in module scanpy.plotting._qc:
highest_expr_genes(adata: anndata._core.anndata.AnnData, n_top: int = 30, show: Union[bool, NoneType] = None, save: Union[str, bool, NoneType] = None, ax: Union[matplotlib.axes._axes.Axes, NoneType] = None, gene_symbols: Union[str, NoneType] = None, log: bool = False, **kwds)
Fraction of counts assigned to each gene over all cells.
Computes, for each gene, the fraction of counts assigned to that gene within
a cell. The `n_top` genes with the highest mean fraction over all cells are
plotted as boxplots.
This plot is similar to the `scater` package function `plotHighestExprs(type
= "highest-expression")`, see `here
<https://bioconductor.org/packages/devel/bioc/vignettes/scater/inst/doc/vignette-qc.html>`__. Quoting
from there:
*We expect to see the “usual suspects”, i.e., mitochondrial genes, actin,
ribosomal protein, MALAT1. A few spike-in transcripts may also be
present here, though if all of the spike-ins are in the top 50, it
suggests that too much spike-in RNA was added. A large number of
pseudo-genes or predicted genes may indicate problems with alignment.*
-- Davis McCarthy and Aaron Lun
Parameters
----------
adata
Annotated data matrix.
n_top
Number of top
show
Show the plot, do not return axis.
save
If `True` or a `str`, save the figure.
A string is appended to the default filename.
Infer the filetype if ending on {`'.pdf'`, `'.png'`, `'.svg'`}.
ax
A matplotlib axes object. Only works if plotting a single component.
gene_symbols
Key for field in .var that stores gene symbols if you do not want to use .var_names.
log
Plot x-axis in log scale
**kwds
Are passed to :func:`~seaborn.boxplot`.
Returns
-------
If `show==False` a :class:`~matplotlib.axes.Axes`.
可以看出这个功能借鉴了知名R包scater的一个函数:plotHighestExprs,绘图用的是python里面两个响当当的seaborn和matplotlib库。那我们不禁要自己微调一下了。
current_palette = sns.color_palette("dark")
sns.palplot(current_palette)
sc.pl.highest_expr_genes(adata, n_top=20,palette="dark" )
normalizing counts per cell
finished (0:00:00)
借助seaborn的绘图系统图我们还可以这样设置颜色:
with sns.color_palette("PuBuGn_d"):
sc.pl.highest_expr_genes(adata, n_top=20, )
help(sns.set_style)
sns.set_style("white")
sc.pl.highest_expr_genes(adata, n_top=20, )
cell QC
提到cell QC ,我们需要明白在10的体系里面什么是cell。有的同学说了,我上机6000个细胞,为什么cellranger检出的细胞不是6000,有时候差的还不小。因为在cellranger的逻辑里,cell不过是每个umi对barcode的支持程度。它说的cell是基于序列比对的,尽管这个理应和上机的cell保持一致,但是这种一致性是通过序列比对来保持的。其实我们说的cell QC指的是对barcode的质控,使留下的barcode能表征下游分析的cell。
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=0)
mito_genes = adata.var_names.str.startswith('MT-')
# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1
参数的意思就是字面的意思,有不明白的可以直接help相应的函数,看它的说明文档。我基本上就是靠说明文档来学习的。下面让我们看看过滤后的adata发生了哪些变化。
adata
Out[9]:
AnnData object with n_obs × n_vars = 4960 × 33694
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'feature_types', 'n_cells'
adata.obs['percent_mito']
Out[11]:
AAACCCAAGCGTATGG-1 0.106588
AAACCCAGTCCTACAA-1 0.056101
AAACCCATCACCTCAC-1 0.532847
AAACGCTAGGGCATGT-1 0.106913
AAACGCTGTAGGTACG-1 0.078654
TTTGTTGCAGGTACGA-1 0.071226
TTTGTTGCAGTCTCTC-1 0.055394
TTTGTTGGTAATTAGG-1 0.183441
TTTGTTGTCCTTGGAA-1 0.081037
TTTGTTGTCGCACGAC-1 0.070987
Name: percent_mito, Length: 4960, dtype: float32
adata.var['n_cells']
Out[12]:
RP11-34P13.3 0
FAM138A 0
OR4F5 0
RP11-34P13.7 43
RP11-34P13.8 0
..
AC233755.2 1
AC233755.1 1
AC240274.1 83
AC213203.1 0
FAM231B 0
Name: n_cells, Length: 33694, dtype: int32
细胞属性obs中有了percent_mito;有了基因属性的var。
像seurat一样,我们看看cell三个属性的小提琴图:
sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='n_counts', y='percent_mito')
sc.pl.scatter(adata, x='n_counts', y='n_genes')
显然这是分别画的两张图,我们能不能把它们组合到一起呢?显然是可以的:
help(sns.FacetGrid)
adata.obs['percent_mito1'] = adata.obs['percent_mito']*10000
adata.obs.melt(id_vars=['n_counts','percent_mito'], var_name='myVariable', value_name='myValue')
import matplotlib.pyplot as plt
sns.set(style="ticks")
g = sns.FacetGrid(adata.obs.melt(id_vars=['n_counts','percent_mito'], var_name='myVariable', value_name='myValue'),col = 'myVariable')
g.map(plt.scatter,'n_counts',"myValue", alpha=.7)
g.add_legend();
有不少人问这两张图怎么看。就拿n_counts与n_genes的散点图(scatter)来说吧:每个点代表一个细胞,斜率代表随着count的增加gene的增加程度。count和gene一般呈现线性关系,斜率越大也就是较少的count就可以检出较多的gene,说明这批细胞基因的丰度偏低(普遍贫穷);反之,斜率较小,说明这批细胞基因丰度较高(少数富有)。有的时候不是一条可拟合的线,或者是两条可拟合的线,也反映出细胞的异质性。总之,他就是一个散点图,描述的是两个变量的关系。
根据统计的n_genes 和线粒体基因含量percent_mito 进一步过滤:
adata = adata[adata.obs.n_genes < 2500, :]
adata = adata[adata.obs.percent_mito < 0.1, :]
过滤掉基因数大于2500,线粒体基因含量大于0.1的barcode。这个过滤方法就像filter一样,操作的是adata.obs这张表。
adata.obs
Out[82]:
n_genes percent_mito n_counts percent_mito1
AAACGCTGTGTGATGG-1 2348 0.099821 6131.0 998.205811
AAACGCTTCTAGGCAT-1 2470 0.065564 8465.0 655.640869
AAAGAACCACCGTCTT-1 420 0.021944 638.0 219.435730
AAAGAACTCACCTGGG-1 2479 0.074067 10801.0 740.672119
AAAGAACTCGGAACTT-1 2318 0.079713 6837.0 797.133240
... ... ... ...
TTTGGTTTCCGGTAAT-1 2277 0.079324 9291.0 793.240723
TTTGTTGCAGGTACGA-1 1986 0.071226 8101.0 712.257751
TTTGTTGCAGTCTCTC-1 2485 0.055394 9604.0 553.935852
TTTGTTGTCCTTGGAA-1 1998 0.081037 5479.0 810.366882
TTTGTTGTCGCACGAC-1 2468 0.070987 7438.0 709.868225
[2223 rows x 4 columns]
特征提取
cellQC可以看做是对细胞的过滤,即保留可用的细胞。但是每个细胞又有成千的基因(细胞的特征),所以一般需要做特征的选择和提取(也就是降维)。什么意思呢?就是比如每个人都有很多特征,年龄、性别、身高、国籍、学历、生日、爱好、血型。。。我要对一百个人分类,所有的特征都拿来比较,可能得不到准确的划分。而用几个关键的特征可以分出某个国家某个年龄段的人。特征的选择是选择一部分特征,比如选择高变基因;特征提取是提取主要的特征结构,如主成分分析。
在特征提取之前要保证细胞之间是有可比性的,一般用的是归一化的方法,得到高变基因之后,为了使同一个基因在不同细胞之间具有可比性采用标准化。
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata
adata
AnnData object with n_obs × n_vars = 2223 × 33694
obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
var: 'gene_ids', 'feature_types', 'n_cells'
uns: 'log1p'sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
adata = adata[:, adata.var.highly_variable]
#时刻关注我们数据的变化
adata
Out[117]:
View of AnnData object with n_obs × n_vars = 2223 × 2208
obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p'
回归每个细胞的总计数和线粒体基因表达百分比的影响,将数据缩放到单位方差。
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
adata
Out[121]:
AnnData object with n_obs × n_vars = 2223 × 2208
obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'pca'
obsm: 'X_pca'
varm: 'PCs
adata.obsm['X_pca']
adata.varm['PCs']sc.pl.pca(adata, color='CST3')
sc.pl.pca_variance_ratio(adata, log=True)
adata.write(results_file)
特征选择我们用高变基因,特征提取我们用pca。这里就有一个都的问题:如何选择高变基因以及pc的维度?
聚类
scanpy提供leiden和louvain两种图聚类算法,图聚类在开始之前要先找邻居:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
adata
Out[128]:
AnnData object with n_obs × n_vars = 2223 × 2208
obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'pca', 'neighbors'
obsm: 'X_pca'
varm: 'PCs'
sc.tl.leiden(adata)
adata.obs['leiden']
Out[188]:
AAACGCTGTGTGATGG-1 10
AAACGCTTCTAGGCAT-1 6
AAAGAACCACCGTCTT-1 4
AAAGAACTCACCTGGG-1 2
AAAGAACTCGGAACTT-1 6
..
TTTGGTTTCCGGTAAT-1 3
TTTGTTGCAGGTACGA-1 1
TTTGTTGCAGTCTCTC-1 3
TTTGTTGTCCTTGGAA-1 7
TTTGTTGTCGCACGAC-1 3
Name: leiden, Length: 2223, dtype: category
Categories (12, object): [0, 1, 2, 3, ..., 8, 9, 10, 11]
#sc.tl.louvain(adata)
#sc.tl.paga(adata)
和Seurat等人一样,scanpy推荐Traag *等人(2018)提出的Leiden graph-clustering方法(基于优化模块化的社区检测)。注意,Leiden集群直接对cell的邻域图进行聚类,我们在sc.pp.neighbors已经计算过了。
读一下leiden的说明文档:
help(sc.tl.leiden)
Help on function leiden in module scanpy.tools._leiden:
leiden(adata: anndata._core.anndata.AnnData, resolution: float = 1, *, restrict_to: Union[Tuple[str, Sequence[str]], NoneType] = None, random_state: Union[int, mtrand.RandomState, NoneType] = 0, key_added: str = 'leiden', adjacency: Union[scipy.sparse.base.spmatrix, NoneType] = None, directed: bool = True, use_weights: bool = True, n_iterations: int = -1, partition_type: Union[Type[leidenalg.VertexPartition.MutableVertexPartition], NoneType] = None, copy: bool = False, **partition_kwargs) -> Union[anndata._core.anndata.AnnData, NoneType]
Cluster cells into subgroups [Traag18]_.
Cluster cells using the Leiden algorithm [Traag18]_,
an improved version of the Louvain algorithm [Blondel08]_.
It has been proposed for single-cell analysis by [Levine15]_.
This requires having ran :func:`~scanpy.pp.neighbors` or
:func:`~scanpy.external.pp.bbknn` first.
Parameters
----------
adata
The annotated data matrix.
resolution
A parameter value controlling the coarseness of the clustering.
Higher values lead to more clusters.
Set to `None` if overriding `partition_type`
to one that doesn’t accept a `resolution_parameter`.
random_state
Change the initialization of the optimization.
restrict_to
Restrict the clustering to the categories within the key for sample
annotation, tuple needs to contain `(obs_key, list_of_categories)`.
key_added
`adata.obs` key under which to add the cluster labels.
adjacency
Sparse adjacency matrix of the graph, defaults to
`adata.uns['neighbors']['connectivities']`.
directed
Whether to treat the graph as directed or undirected.
use_weights
If `True`, edge weights from the graph are used in the computation
(placing more emphasis on stronger edges).
n_iterations
How many iterations of the Leiden clustering algorithm to perform.
Positive values above 2 define the total number of iterations to perform,
-1 has the algorithm run until it reaches its optimal clustering.
partition_type
Type of partition to use.
Defaults to :class:`~leidenalg.RBConfigurationVertexPartition`.
For the available options, consult the documentation for
:func:`~leidenalg.find_partition`.
copy
Whether to copy `adata` or modify it inplace.
**partition_kwargs
Any further arguments to pass to `~leidenalg.find_partition`
(which in turn passes arguments to the `partition_type`).
Returns
-------
`adata.obs[key_added]`
Array of dim (number of samples) that stores the subgroup id
(`'0'`, `'1'`, ...) for each cell.
`adata.uns['leiden']['params']`
A dict with the values for the parameters `resolution`, `random_state`,
and `n_iterations`.
seurat的findclusters中聚类的算法亦是有Leiden的,关于leiden和louvain两者的区别,可以看看From Louvain to Leiden: guaranteeing well-connected communities(https://www.nature.com/articles/s41598-019-41695-z)
algorithm
Algorithm for modularity optimization (
1 = original Louvain algorithm;
2 = Louvain algorithm with multilevel refinement;
3 = SLM algorithm;
4 = Leiden algorithm). Leiden requires the leidenalg python.
为了可视化聚类的结果我们还是先做一下umap降维吧,然后看看分群结果。
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])
adata
Out[139]:
AnnData object with n_obs × n_vars = 2223 × 2208
obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1', 'leiden'
var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'pca', 'neighbors', 'leiden', 'paga', 'leiden_sizes', 'umap', 'leiden_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
差异分析
分群后我们得到以数值为标识的亚群编号,我们急于知道每一个数值背后的含义,意义还得从基因上找。差异分析就是为了这目的的。
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
ranking genes
D:\Program Files (x86)\anconda\lib\site-packages\scanpy\tools\_rank_genes_groups.py:252: RuntimeWarning: invalid value encountered in log2
rankings_gene_logfoldchanges.append(np.log2(foldchanges[global_indices]))
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
为了使我们的文章图文并茂一些,来看看't-test'检验的每个亚群的差异基因排序:
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
sc.settings.verbosity = 2 # reduce the verbosity
help(sc.tl.rank_genes_groups)
Help on function rank_genes_groups in module scanpy.tools._rank_genes_groups:
rank_genes_groups(adata: anndata._core.anndata.AnnData, groupby: str, use_raw: bool = True, groups: Union[scanpy._compat.Literal_, Iterable[str]] = 'all', reference: str = 'rest', n_genes: int = 100, rankby_abs: bool = False, key_added: Union[str, NoneType] = None, copy: bool = False, method: scanpy._compat.Literal_ = 't-test_overestim_var', corr_method: scanpy._compat.Literal_ = 'benjamini-hochberg', layer: Union[str, NoneType] = None, **kwds) -> Union[anndata._core.anndata.AnnData, NoneType]
Rank genes for characterizing groups.
Parameters
----------
adata
Annotated data matrix.
groupby
The key of the observations grouping to consider.
use_raw
Use `raw` attribute of `adata` if present.
layer
Key from `adata.layers` whose value will be used to perform tests on.
groups
Subset of groups, e.g. [`'g1'`, `'g2'`, `'g3'`], to which comparison
shall be restricted, or `'all'` (default), for all groups.
reference
If `'rest'`, compare each group to the union of the rest of the group.
If a group identifier, compare with respect to this group.
n_genes
The number of genes that appear in the returned tables.
method
The default 't-test_overestim_var' overestimates variance of each group,
`'t-test'` uses t-test, `'wilcoxon'` uses Wilcoxon rank-sum,
`'logreg'` uses logistic regression. See [Ntranos18]_,
`here <https://github.com/theislab/scanpy/issues/95>`__ and `here
<http://www.nxn.se/valent/2018/3/5/actionable-scrna-seq-clusters>`__,
for why this is meaningful.
corr_method
p-value correction method.
Used only for `'t-test'`, `'t-test_overestim_var'`, and `'wilcoxon'`.
rankby_abs
Rank genes by the absolute value of the score, not by the
score. The returned scores are never the absolute values.
key_added
The key in `adata.uns` information is saved to.
**kwds
Are passed to test methods. Currently this affects only parameters that
are passed to :class:`sklearn.linear_model.LogisticRegression`.
For instance, you can pass `penalty='l1'` to try to come up with a
minimal set of genes that are good predictors (sparse solution meaning
few non-zero fitted coefficients).
Returns
-------
**names** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
Structured array to be indexed by group id storing the gene
names. Ordered according to scores.
**scores** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
Structured array to be indexed by group id storing the z-score
underlying the computation of a p-value for each gene for each
group. Ordered according to scores.
**logfoldchanges** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
Structured array to be indexed by group id storing the log2
fold change for each gene for each group. Ordered according to
scores. Only provided if method is 't-test' like.
Note: this is an approximation calculated from mean-log values.
**pvals** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
p-values.
**pvals_adj** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
Corrected p-values.
Notes
-----
There are slight inconsistencies depending on whether sparse
or dense data are passed. See `here <https://github.com/theislab/scanpy/blob/master/scanpy/tests/test_rank_genes_groups.py>`__.
Examples
--------
import scanpy as sc
adata = sc.datasets.pbmc68k_reduced()
sc.tl.rank_genes_groups(adata, 'bulk_labels', method='wilcoxon')
# to visualize the results
sc.pl.rank_genes_groups(adata)
scanpy差异分析的方法没有seurat丰富了,除了t-test,还有wilcoxon和logreg。
查看差异分析的结果:
sc.get.rank_genes_groups_df(adata, group="0")
Out[148]:
scores names logfoldchanges pvals pvals_adj
0 15.179968 IL7R NaN 1.472705e-43 1.711438e-42
1 13.268551 RGS10 NaN 6.462721e-35 5.595956e-34
2 11.459023 LRRN3 NaN 9.208903e-27 5.465930e-26
3 10.671810 CD7 NaN 3.213348e-24 1.665510e-23
4 8.615908 INTS6 NaN 1.053249e-16 3.856673e-16
.. ... ... ... ... ...
95 1.755937 DUSP16 NaN 7.978160e-02 1.094148e-01
96 1.724632 GALM NaN 8.523017e-02 1.163811e-01
97 1.713933 KCNQ1OT1 NaN 8.719264e-02 1.188403e-01
98 1.706563 PASK NaN 8.851861e-02 1.202764e-01
99 1.696579 SUCO NaN 9.047560e-02 1.227089e-01
[100 rows x 5 columns]
指定哪两组进行差异分析:
sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
sc.get.rank_genes_groups_df(adata, group="0")
Out[169]:
scores names logfoldchanges pvals pvals_adj
0 19.284122 EFNB2 2.265986 7.301753e-83 4.030568e-81
1 19.205927 C11orf96 NaN 3.301754e-82 1.350050e-80
2 19.158936 NPDC1 -2.627818 8.152369e-82 2.686632e-80
3 18.702019 EGR4 NaN 4.766577e-78 6.190942e-77
4 18.405581 FBXL8 -0.640331 1.185097e-75 1.104091e-74
.. ... ... ... ... ...
95 11.469543 AP4S1 -3.958649 1.876464e-30 3.019849e-30
96 11.459700 FBXO33 3.057788 2.102432e-30 3.378580e-30
97 11.423295 LDLRAD4 0.052481 3.198732e-30 5.121683e-30
98 11.368316 H1F0 NaN 6.013658e-30 9.587116e-30
99 11.310737 ZCWPW1 0.455119 1.161100e-29 1.836468e-29
[100 rows x 5 columns]
差异基因的可视化
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
{group + '_' + key[:1]: result[key][group]
for group in groups for key in ['names', 'pvals']}).head(5)
0_n 0_p
0 EFNB2 7.301753e-83
1 C11orf96 3.301754e-82
2 NPDC1 8.152369e-82
3 EGR4 4.766577e-78
4 FBXL8 1.185097e-75sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')
new_cluster_names = [
'CD4 T', 'CD14 Monocytes',
'B', 'CD8 T',
'NK', 'FCGR3A Monocytes',
'Dendritic', 'Megakaryocytes',
'NewCluster1','NewCluster2','NewCluster3','NewCluster4' ]
adata.rename_categories('leiden', new_cluster_names)
\#Omitting rank_genes_groups/scores as old categories do not match.
\#Omitting rank_genes_groups/names as old categories do not match.
\#Omitting rank_genes_groups/logfoldchanges as old categories do not match.
\#Omitting rank_genes_groups/pvals as old categories do not match.
\#Omitting rank_genes_groups/pvals_adj as old categories do not match.
ax = sc.pl.dotplot(adata, marker_genes, groupby='leiden')
ax = sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90)
像seurat一样,scanpy也有丰富的数据可视化的方式,太多以至于需要单独再写另一篇文章,这里就不介绍了。
保存分析结果以供下一次调用
adata.write(results_file, compression='gzip') # `compression='gzip'` saves disk space, but slows down writing and subsequent reading
adata.X = None
adata.write('E:\learnscanpy\write\pbmc3k_withoutX.h5ad', compression='gzip')
adata.obs[['n_counts', 'leiden']].to_csv(
'E:\learnscanpy\write\pbmc3k_corrected_lei_groups.csv')
\# Export single fields of the annotation of observations
\# adata.obs[['n_counts', 'louvain_groups']].to_csv(
\# './write/pbmc3k_corrected_louvain_groups.csv')
\# Export single columns of the multidimensional annotation
\# adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(
\# './write/pbmc3k_corrected_X_pca.csv')
\# Or export everything except the data using `.write_csvs`.
\# Set `skip_data=False` if you also want to export the data.
\# adata.write_csvs(results_file[:-5], )
今天,我们学习了scanpy的一般流程,我们发现不管工具如何变,单细胞转录组的数据分析的大框架是没有变化的,几个分析的工具也是相互借鉴的。但是python的就不值得一学了吗?
SCANPY: large-scale single-cell gene expression data analysis(https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1382-0)