生信编程6.下载最新版的KEGG信息并解析
有一些五六年前的学生们都成长为了各个生物信息学相关公司的小领导,而且他们都有了自己的公众号,知乎号,也算是一番人物。最近他们跟我反馈面试找不到或者说很难直接考核筛选到认真干活的生信工程师,挺有意思的。让我想起来了早在生信技能树论坛创立之初我为了引流,而规划的200个生信工程师面试题。值得继续分享:
为什么要自己下载?
用R包org.Hs.eg.db
中自带的数据进行KEGG分析十分方便,但是由于R包的更新相比于KEGG官网有延迟,所以我们可能会漏掉一些最新添加的通路。自己下载最新版本的KEGG数据集,可以弥补这一点。
首先访问KEGG官网https://www.genome.jp/kegg-bin/get_htext?hsa00001+3101
以人类的数据为列子

下载结果如下

解析
以A开头的是大类:
cat hsa00001.keg | grep '^A'
结果如下:共有8个大类(2021年1月30日)
A09100 Metabolism
A09120 Genetic Information Processing
A09130 Environmental Information Processing
A09140 Cellular Processes
A09150 Organismal Systems
A09160 Human Diseases
A09180 Brite Hierarchies
A09190 Not Included in Pathway or Brite
以B开头的小类:
cat hsa00001.keg | grep '^B' | sort | uniq -c | wc -l
结果文件如下:共有54个小类
1 B 09101 Carbohydrate metabolism
1 B 09102 Energy metabolism
1 B 09103 Lipid metabolism
1 B 09104 Nucleotide metabolism
1 B 09105 Amino acid metabolism
1 B 09106 Metabolism of other amino acids
1 B 09107 Glycan biosynthesis and metabolism
1 B 09108 Metabolism of cofactors and vitamins
1 B 09109 Metabolism of terpenoids and polyketides
1 B 09110 Biosynthesis of other secondary metabolites
1 B 09111 Xenobiotics biodegradation and metabolism
1 B 09112 Not included in regular maps
以C开头的就是KEGG pathway ID所在的行
到目前为止共有529条通路
(base) [test@mu01 KEGG]$ cat hsa00001.keg | grep '^C' | head
C 00010 Glycolysis / Gluconeogenesis [PATH:hsa00010]
C 00020 Citrate cycle (TCA cycle) [PATH:hsa00020]
C 00030 Pentose phosphate pathway [PATH:hsa00030]
C 00040 Pentose and glucuronate interconversions [PATH:hsa00040]
C 00051 Fructose and mannose metabolism [PATH:hsa00051]
C 00052 Galactose metabolism [PATH:hsa00052]
C 00053 Ascorbate and aldarate metabolism [PATH:hsa00053]
C 00500 Starch and sucrose metabolism [PATH:hsa00500]
C 00520 Amino sugar and nucleotide sugar metabolism [PATH:hsa00520]
C 00620 Pyruvate metabolism [PATH:hsa00620](base) [test@mu01 KEGG]$ cat hsa00001.keg | grep '^C' | sort | uniq | wc -l
529
以D开头的就是对应通路的所有基因
(base) [test@mu01 KEGG]$ cat hsa00001.keg | grep '^D' | head
D 3101 HK3; hexokinase 3 K00844 HK; hexokinase [EC:2.7.1.1]
D 3098 HK1; hexokinase 1 K00844 HK; hexokinase [EC:2.7.1.1]
D 3099 HK2; hexokinase 2 K00844 HK; hexokinase [EC:2.7.1.1]
D 80201 HKDC1; hexokinase domain containing 1 K00844 HK; hexokinase [EC:2.7.1.1]
D 2645 GCK; glucokinase K12407 GCK; glucokinase [EC:2.7.1.2]
D 2821 GPI; glucose-6-phosphate isomerase K01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
D 5213 PFKM; phosphofructokinase, muscle K00850 pfkA; 6-phosphofructokinase 1 [EC:2.7.1.11]
D 5214 PFKP; phosphofructokinase, platelet K00850 pfkA; 6-phosphofructokinase 1 [EC:2.7.1.11]
D 5211 PFKL; phosphofructokinase, liver type K00850 pfkA; 6-phosphofructokinase 1 [EC:2.7.1.11]
D 2203 FBP1; fructose-bisphosphatase 1 K03841 FBP; fructose-1,6-bisphosphatase I [EC:3.1.3.11]
取出通路ID与匹配到的基因
1 #!/usr/bin/python
2 import re
3 A_db = {}
4 with open('hsa00001.keg','r') as f:
5 for line in f.readlines():
6 if line.startswith('C'):
7 C_id = re.search('C\s+?(\d+)\s*',line).group(1)
8 elif line.startswith('D'):
9 rea = re.search('D\s+?(\d+).*',line)
10 if rea:
11 D_id = rea.group(1)
12 print(C_id+'\t'+D_id)
结果如下:
第一列为KEGG通路的ID,第二列为基因ID
00010 3101
00010 3098
00010 3099
00010 80201
00010 2645
00010 2821
00010 5213
00010 5214
00010 5211
00010 2203
注意:有的通路没有基因
文末友情推荐