关键基因和hub基因(生物网络角度)
写在前面这篇文章仍然来自几篇文章及自己平时的积累,主要阐述关键基因和hub基因。很多人误以为hub基因就是关键基因,甚至有人认为差异表达基因就是关键基因。在正式看本文章之前,我先以个人理解的角度简单的来说明这三者之间的关系,不同见解的请留言。差异表达基因是两个group之间有统计学差异的gene,以芯片为例的话,几万个探针里可能差异的就1000个左右(当然根据设定阈值差异很大)hub基因,是degree高的gene,在基因表达网络中有高的连接度degree,不涉及betweeness等。并且hub基因的筛选有很大的人为因素,到底是取前5%还是10%没有具体要求,一般建议5%。也就是说这是一个很宽松的设定。关键基因,有人从hub里挑靠前的,有人从差异表达基因里挑p值大的。到怎么才算关键基因?笼统来说,假如你这个基因被敲减,表型显著消失,那肯定是关键基因。但仅从生物信息分析角度怎么挑?不可能有一种方法就可以直接解决这个问题,现在只从表达网络的角度,稍后我会写一篇多个角度如何筛选关键基因的文章。,其范围要比hub小。hub不一定关键,关键不一定hub。总之,在数目上获范畴上DGEs>Hubs>key genes(candidate genes)------------------------------------------------好了,开始正文吧HUB 基因The WGCNA approach typically deals with the identification of gene modules by using the gene expression levels that are highly correlated across samples. This technique has been successfully utilized to detect gene modules in Arabidopsis, rice, maize and poplar for various biotic and abiotic stresses . Further, this approach also leads to construction of Gene Co-expression Network (GCN), a scale free network, where, genes are represented as nodes and edges depict associations among genes . In such network, highly connected genes are called hub genes, which are expected to play an important role in understanding the biological mechanism of response under stresses/conditions. Identification of hub genes will also help in mitigating the stress in plants through genetic engineering. The existing approaches have mainly focused on hub gene identification, based only on gene connection degrees in the GCN. Moreover, these techniques select such genes empirically without any statistical criteria. Besides, few approaches can be found in the literature for the identification of hub nodes in a scale free network.这里可以看出,hub基因是是在无尺度共表达网络中存在的,对应着degree,也就是说在GCN中。现存的方法主要关注hub基因的鉴定,基于的就是GCN中的连接度,这些技术只是凭经验选择,并没有统计学标准。另外,在文献中很少有方法发现来鉴定无尺度网络的中hub nodes。所以作者提出了一个算法,并写了一个包,对hub gene提供p值,可以根据p值标准来减少hub gene数目。包在这里文章地址1文章地址2It has been a long-standing长久存在的 goal in systems biology to find relations between the topological properties and functional features of protein networks. However, most of the focus in network studies has been on highly connected proteins (“hubs”). As a complementary notion, it is possible to define bottlenecks as proteins with a high betweenness centrality (i.e., network nodes that have many “shortest paths” going through them, analogous to major bridges and tunnels on a highway map). Bottlenecks are, in fact, key connector proteins with surprising functional and dynamic properties. In particular, they are more likely to be essential proteins. In fact, in regulatory and other directed networks, betweenness (i.e., “bottleneck-ness”) is a much more significant indicator of essentiality than degree (i.e., “hub-ness”). Furthermore, bottlenecks correspond to the dynamic components of the interaction network—they are significantly less well coexpressed with their neighbors than nonbottlenecks, implying that expression dynamics is wired into the network topology.A network is a graph consisting of a number of nodes with edges connecting them. Recently, network models have been widely applied to biological systems. Here, we are mainly interested in two types of biological networks: the interaction network, where nodes are proteins and edges connect interacting partners; and the regulatory network, where nodes are proteins and edges connect transcription factors and their targets. Betweenness is one of the most important topological properties of a network. It measures the number of shortest paths going through a certain node. Therefore, nodes with the highest betweenness control most of the information flow in the network, representing the critical points of the network. We thus call these nodes the “bottlenecks” of the network. Here, we focus on bottlenecks in protein networks. We find that, in the regulatory network, where there is a clear concept of information flow, protein bottlenecks indeed have a much higher tendency to be essential genes. In this type of network, betweenness is a good predictor of essentiality. Biological researchers can therefore use the betweenness as one more feature to choose potential targets for detailed analysis.
Figure1.png
Figure2.png下面是关于hub和bottlenecks的区别解释Central complex members have a low betweenness and are hub–nonbottlenecks. 中心复合体成员低betweenness,属于hub-nonbottlenecks.Because of the high connectivity inside these complexes, paths can go through them and all their neighbors. On the other hand, hub–bottlenecks tend to correspond to highly central proteins that connect several complexes or are peripheral members of central complexes.Hub-bottlenecks倾向于对应那些高中心性蛋白,连接几个复合体,或者是中心复合体的周边成员,他们有高betweenness的事实显示这些蛋白不是简单的大的蛋白复合体的成员(nonbottleneck-hubs的特点),而是把这个复合体和网络中其他部分连接起来,一定意义上说,是真正的连接度瓶颈。The fact that they have a high betweenness suggests that these proteins are not, however, simply members of large protein complexes (which is true for nonbottleneck–hubs), but are those members that connect the complex to the rest of the graph; in a sense, real connectivity bottlenecks. While hub–nonbottlenecks mainly consist of structural proteins, hub–bottlenecks are more likely to be part of signal transduction pathways.Hub-nonbottlenecks主要构成结构蛋白,Hub-bottlenecks更倾向于是信号转导通路的一部分Furthermore, hub–bottlenecks are (by construction) the most efficient in disrupting the network upon hub removal. This relates nicely to the date/party-hub concept by Han et al. : hub–bottlenecks tend to be date-hubs, whereas hub–nonbottlenecks tend to be party-hubs.另外,一旦hub被移走,hub-bottlenecks是破坏网络最有效的节点。这和Han的hub概念非常接近:hub-bottlenecks倾向于是date-hubs,hub-nonbottlenecks倾向于party-hubs(hans的文章看了就明白,datehubs更容易是大架构的组织者维持者,是大老板)。(han的这个观点发表在nature上,下面是han的观点)上面说的那个han的nature上的文章https://www.nature.com/articles/nature02555In apparently scale-free protein–protein interaction networks, or 'interactome’ networks1,2, most proteins interact with few partners, whereas a small but significant proportion of proteins, the 'hubs’, interact with many partners.在无尺度蛋白相互作用网络或叫相互作用组网络,大多数蛋白都是和少数的partners作用,只有少部分蛋白,也就是hubs,和很多partners作用.非hub但瓶颈通常比那些非hub非瓶颈蛋白和他们的邻居共表达更少,符合这个观察:betweenness是和邻接蛋白平均相关性的指标,非hub但瓶颈蛋白很少是复合体成员,并且大部分都是调节蛋白和信号转到machinery。不管是生物还是非生物,只要是无尺度网络,都对随机的node移除有抵抗能力,但是对hubs的移除非常敏感。大概就是酵母做了个实验,移除敲除编码hub蛋白的基因,比非hub的死亡率大3倍,我们发现了两类hub:party hubs党派型,同时和partners的大部分相互作用。Date hubs约会型,不同的时间或位置结合不同的partners。
Figure3.png这样,酵母中的相互作用网络的hub基于他们的partners'表达谱,可以分为两类:date和party hubs。这种区分揭示了酵母蛋白组组织模块的模型,通过regulators,mediators或adaptors连接模块,这就是date hubs。Party hubs代表不同的模块内部的必须的成分,对这这些模块介导的功能很重要(因此倾向于是必须蛋白),倾向于在蛋白组的组织上低水平工作。(大概意思是date hubs是大boss,沟通衔接,而party hubs是模块内部的小老板)。我们提出,date hubs在整个蛋白组网络中生物模块的总体组织中是必须的,参与的是大范围的整合连接(虽然一些date hub可以简单的共享,并且调节模块内或跨模块的局部功能)。这种相互作用网络的关键特点,比如对抗外界环境的遗传稳定性和弹性,使用这样的模块组织方式作为框架就更好理解了。因此,所谓的date-hubs是那些有高的betweeness(hub-bottlenecks),而party-hubs更可能是有着低betweeness的hubs(hub-nonbottlenecks)这个发现,或许表明了相互作用网络中动态和拓扑特性之间的联系,而这迄今为止是人类未知的。作者相信,虽然先有不好实现的地方,但是betweenness将来会被证明是一个非常有用的工具对很多蛋白昂立来说,尤其是有方向的edges(调控网络)。总之,我们提供了两种互补的拓扑网络特性的整合分析,这适合于不同的网络类型。这种整合的方法解释了先前不为人知的网络拓扑性质之间的联系,蛋白质必要性和表达动态。我们相信,这种整合的方法就像现在提出的这种,会对将来的预测模型至为重要。