生信编程13.根据指定染色体及坐标得到参考碱基
有一些五六年前的学生们都成长为了各个生物信息学相关公司的小领导,而且他们都有了自己的公众号,知乎号,也算是一番人物。最近他们跟我反馈面试找不到或者说很难直接考核筛选到认真干活的生信工程师,挺有意思的。让我想起来了早在生信技能树论坛创立之初我为了引流,而规划的200个生信工程师面试题。值得继续分享:
我们有一个参考基因组,假设是hg38,知道指定的染色体及坐标,假设为"chr5","8397384",那么如何写程序的到这个坐标以及它的上下一个碱基?
批量操作:
我们的VCF文件中记录了所有变异位点的坐标,我们需要知道上下文来做mutation signature的分析。
测试的小fasta文件如下:
>chr_1
ATCGTCGaaAATGAANccNNttGTA
AGGTCTNAAccAAttGggG
>chr_2
ATCGAATGATCGANNNGccTA
AGGTCTNAAAAGG
>chr_3
ATCGTCGANNNGTAATggGA
AGGTCTNAAAAGG
>chr_4
ATCGTCaaaGANNAATGANGgggTAchr1 2
chr1 12
chr2 10
chr3 9
chr4 8
python脚本1:
from collections import OrderedDict
import time
fasta={}
vcf={}
time_start = time.time()
with open('test.fa','r') as f:
for line in f:
if line.startswith('>'):
key=line.strip().lstrip('>')
fasta[key]=[]
else:
fasta[key].append(line.strip().upper())
with open('test.vcf','r') as f:
for line in f:
arr=line.strip().split(' ')
key=arr[0]
if key in fasta:
seq = ''.join(fasta[key])
aim_base = seq[int(arr[1])-2:int(arr[1])+1]
print(key,arr[1],seq,aim_base)
time_end=time.time()
print(time_end-time_start)
结果如下:
chr1 2 ATCGTCGAAAATGAANCCNNTTGTAAGGTCTNAACCAATTGGGG ATC
chr1 12 ATCGTCGAAAATGAANCCNNTTGTAAGGTCTNAACCAATTGGGG ATG
chr2 10 ATCGAATGATCGANNNGCCTAAGGTCTNAAAAGG ATC
chr3 9 ATCGTCGANNNGTAATGGGAAGGTCTNAAAAGG ANN
chr4 8 ATCGTCAAAGANNAATGANGGGGTA AAA
python脚本2:
import os
import pysam
import time
time_start = time.time()
fasta = pysam.FastaFile('test.fa')
print(dir(fasta))
with open('test.vcf','r') as f:
for line in f:
arr = line.strip().split(' ')
print(fasta.fetch(arr[0])[int(arr[1])-2:int(arr[1])+1])
time_end = time.time()
print(time_end-time_start)
结果如下:
ATC
ATG
ATC
ANN
aaa
0.0029845237731933594
python脚本3:
from collections import OrderedDict
import time
time_start = time.time()
gene_seq = OrderedDict()
def get_seq(file,id,pos=0):
with open(file,'rt') as f:
for line in f:
if line.startswith('>'):
ID = line.strip()[1:]
gene_seq[ID] = ''
else:
gene_seq[ID] += line.strip()
if pos:
seq = gene_seq[id][pos-2:pos+1]
else:
seq = gene_seq[id]
print(id,pos,seq)
with open('test.vcf','r') as f:
for line in f:
arr = line.strip().split(' ')
get_seq('test.fa',arr[0],int(arr[1]))
time_end = time.time()
print(time_end-time_start)
结果如下:
chr1 2 ATC
chr1 12 ATG
chr2 10 ATC
chr3 9 ANN
chr4 8 aaa
0.00458073616027832
第一种方法用时最短,但是我粗浅的Python知识可能仍然是献丑了!
文末友情推荐