2011年的表达芯片分析和2019年的区别
突然奇想,希望学徒们可以比较一下同一个数据集,表达芯片的,在2011年他被发表的时候的数据分析和2019年其他人挖掘他的时候的分析有什么区别,数据集是:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE27447
最开始是2011的oncogene文章
文章是 FZD7 has a critical role in cell proliferation in triple negative breast cancer. Oncogene 2011 Oct 27;30(43):4437-46. PMID: 21532620 已经是2011的表达芯片数据了,那个时候的分析路线是 首先做差异分析拿到统计学显著的上下调基因,然后KEGG等数据库注释,挑选一条通路,然后选定通路里面的指定基因,比如FZD7进行下游分析
Identification of differentially expressed genes was first carried out under the criteria of 1.5-fold upregulation in TNBC with a P-value <0.01.
Two hundred and six genes, including 169 annotated genes (Supplementary Figure 1) were identified as being differentially expressed
The Wnt signaling pathway was identified as a pathway that was significantly overexpressed in TNBC (P<0.05)
FZD7, LRP6 and TCF7 were all upregulated along the Wnt signaling pathway
所以作者集中精力去做各种实验验证FZD7这个基因的重要性,实际上现在我们看来,这样的研究是非常片面的,但是当年那个时候大家对高通量芯片表达数据的认知就是这样。
2019的oncology letter文章
在文章 https://www.spandidos-publications.com/10.3892/ol.2019.9884 也是分析同样的数据集,就是我们说到的数据挖掘啦。
In total, 14 pre-treated non-triple-negative breast tumors and 5 triple-negative breast tumors were collected based on the GPL6244 (HuGene-1_0-st) Affymetrix Human Gene 1.0 ST Array.
同样的差异分析
On the basis of the SAM analysis, a total of 132 upregulated and 198 downregulated DEGs were identified.
The GO analysis was subsequently conducted (Table II).
The results demonstrated that the upregulated DEGs, which included CR2, IGHM, PRKCB, CARD11, PLCG2, CD79A, IGKC and CD27, were relative to the immune response, such as lymphocyte activation (P=1.49×10−11), leukocyte activation (P=4.68×10−11) and B-cell activation (P=6.02× 10−8) (Table IIA).
学徒作业
走我的表达矩阵教程的标准分析流程,火山图,热图,GO/KEGG数据库注释等等。这些流程的视频教程都在B站和GitHub了,目录如下:
第一讲:GEO,表达芯片与R
第二讲:从GEO下载数据得到表达量矩阵
第三讲:对表达量矩阵用GSEA软件做分析
第四讲:根据分组信息做差异分析
第五讲:对差异基因结果做GO/KEGG超几何分布检验富集分析
第六讲:指定基因分组boxplot指定基因list画热图
感兴趣可以细读表达芯片的公共数据库挖掘系列推文 ;
然后点评一下这两个分析了同一个数据集的文章最后的生物学故事如何。