生信编程直播第12题:json格式数据的格式化
json数据大家统一用我给的测试数据,自己在浏览器打开下载:http://biotrainee.com/jbrowse/JBrowse-1.12.1/sample_data/json/modencode/modencodeMetaData.json范例如下:[AppleScript] 纯文本查看 复制代码?01020304050607080910111213141516171819202122232425262728293031323334353637{"types" : {"data set" : {"pluralLabel" : "data sets"}},"items" : [{"technique" : "ChIP-chip","factor" : "BEAF-32","target" : "Non TF Chromatin binding factor","principal_investigator" : "White, K.","Tracks" : ["fly/White_INSULATORS_WIG/BEAF32"],"submission" : "21","label" : "BEAF-32;Embryos 0-12 hr;ChIP-chip","category" : "Other chromatin binding sites","type" : "data set","Developmental-Stage" : "Embryos 0-12 hr","organism" : "D. melanogaster"},{"technique" : "ChIP-chip","factor" : "CP190","target" : "Non TF Chromatin binding factor","principal_investigator" : "White, K.","Tracks" : ["fly/White_INSULATORS_WIG/CP190"],"submission" : "22","label" : "CP190;Embryos 0-12 hr;ChIP-chip","category" : "Other chromatin binding sites","type" : "data set","Developmental-Stage" : "Embryos 0-12 hr","organism" : "D. melanogaster"},因为帖子长度有限,我就只截取了一部分,请自己下载查看,如果是完整的json,可以用在线工具查看结构:http://json.parser.online.fr/如果不懂json格式的,请自行搜索哈,现在TCGA在GDC的metadata信息,就是json格式的。我们需要从这个json文件里面提取:technique factor target principal_investigator submission label category type Developmental-Stage organism key 这几列信息,当然,是可以用正则表达式做的。完成之后应该是:http://biotrainee.com/jbrowse/JBrowse-1.12.1/sample_data/json/modencode/modencodeMetaData.csv 同样可以在浏览器打开并且下载用excel查看哈
我就不多做介绍了,主要难点在于理解json,本次作业,推荐大家用已有的包,正则表达式虽然可以做,但是太麻烦了~给一个perl代码如下;[Perl] 纯文本查看 复制代码?01020304050607080910111213141516171819202122#!/usr/bin/env perluse strict;use warnings;use autodie ':all';use 5.10.0;use JSON 2;my $data = from_json( do { local $/; open my $f, '<', $ARGV[0]; scalar <$f> } );my @fields = qw( technique factor target principal_investigator submission label category type Developmental-Stage organism key );say join ',', map "\"$_\"", @fields;for my $item ( @{$data->{items}} ) {$item->{key} = $item->{label};no warnings 'uninitialized';for my $track ( @{$item->{Tracks}} ) {$item->{label} = $track;say join ',', map "\"$_\"", @{$item}{@fields};}}希望有同学可以推陈出新,不要局限于我们的作业。可以自己下载TCGA的metadata信息,自己尝试提取,格式化。