基因名变化太快,比如PAM50
可以在 genefu 这个R包里面找到PAM50数据集
library(genefu)
data(pam50)
pam50$centroids.map
简单查看如下:
probe probe.centroids EntrezGene.ID
ACTR3B ACTR3B ACTR3B 57180
ANLN ANLN ANLN 54443
BAG1 BAG1 BAG1 573
BCL2 BCL2 BCL2 596
BIRC5 BIRC5 BIRC5 332
BLVRA BLVRA BLVRA 644
CCNB1 CCNB1 CCNB1 891
CCNE1 CCNE1 CCNE1 898
CDC20 CDC20 CDC20 991
CDC6 CDC6 CDC6 990
CDCA1 CDCA1 CDCA1 83540
CDH3 CDH3 CDH3 1001
CENPF CENPF CENPF 1063
CEP55 CEP55 CEP55 55165
CXXC5 CXXC5 CXXC5 51523
EGFR EGFR EGFR 1956
ERBB2 ERBB2 ERBB2 2064
ESR1 ESR1 ESR1 2099
EXO1 EXO1 EXO1 9156
FGFR4 FGFR4 FGFR4 2264
FOXA1 FOXA1 FOXA1 3169
FOXC1 FOXC1 FOXC1 2296
GPR160 GPR160 GPR160 26996
GRB7 GRB7 GRB7 2886
KIF2C KIF2C KIF2C 11004
KNTC2 KNTC2 KNTC2 10403
KRT14 KRT14 KRT14 3861
KRT17 KRT17 KRT17 3872
KRT5 KRT5 KRT5 3852
MAPT MAPT MAPT 4137
MDM2 MDM2 MDM2 4193
MELK MELK MELK 9833
MIA MIA MIA 8190
MKI67 MKI67 MKI67 4288
MLPH MLPH MLPH 79083
MMP11 MMP11 MMP11 4320
MYBL2 MYBL2 MYBL2 4605
MYC MYC MYC 4609
NAT1 NAT1 NAT1 9
ORC6L ORC6L ORC6L 23594
PGR PGR PGR 5241
PHGDH PHGDH PHGDH 26227
PTTG1 PTTG1 PTTG1 9232
RRM2 RRM2 RRM2 6241
SFRP1 SFRP1 SFRP1 6422
SLC39A6 SLC39A6 SLC39A6 25800
TMEM45B TMEM45B TMEM45B 120224
TYMS TYMS TYMS 7298
UBE2C UBE2C UBE2C 11065
UBE2T UBE2T UBE2T 29089
当然准备把这些基因跟ensembl数据库的ID对应的时候我发现少了3个,然后我搜索发现它们的symbol其实被修改了,可以说变化比较快啦,才几年时间,3 of 50的基因就变了。
# CDCA1 --> NUF2 NUF2, NDC80 Kinetochore Complex Component
# KNTC2 --> NDC80
# ORC6L --> ORC6 Origin Recognition Complex Subunit 6
需要用代码进行修改回来
pam50genes=pam50$centroids.map[c(1,3)]
pam50genes[pam50genes$probe=='CDCA1',1]='NUF2'
pam50genes[pam50genes$probe=='KNTC2',1]='NDC80'
pam50genes[pam50genes$probe=='ORC6L',1]='ORC6'