CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章导读:人工智能领域,最新计算机视觉文章历史综述以及观察,深度卷积神经网络的最新架构综述。原作者Asifullah Khan1, 2*, Anabia Sohail1, 2, Umme Zahoora1, and Aqsa Saeed Qureshi11 Pattern Recognition Lab, DCIS, PIEAS, Nilore, Islamabad 45650, Pakistan2 Deep Learning Lab, Center for Mathematical Sciences, PIEAS, Nilore, Islamabad 45650, Pakistanasif@pieas.edu.pk更新中……相关文章CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第五章~第八章原文下载:https://download.csdn.net/download/qq_41185868/15548439AbstractDeep Convolutional Neural Networks (CNNs) are a special type of Neural Networks, which have shown state-of-the-art performance on various competitive benchmarks. The powerful learning ability of deep CNN is largely due to the use of multiple feature extraction stages (hidden layers) that can automatically learn representations from the data. Availability of a large amount of data and improvements in the hardware processing units have accelerated the research in CNNs, and recently very interesting deep CNN architectures are reported. The recent race in developing deep CNNs shows that the innovative architectural ideas, as well as parameter optimization, can improve CNN performance. In this regard, different ideas in the CNN design have been explored such as the use of different activation and loss functions, parameter optimization, regularization, and restructuring of the processing units. However, the major improvement in representational capacity of the deep CNN is achieved by the restructuring of the processing units. Especially, the idea of using a block as a structural unit instead of a layer is receiving substantial attention. This survey thus focuses on the intrinsic taxonomy present in the recently reported deep CNN architectures and consequently, classifies the recent innovations in CNN architectures into seven different categories. These seven categories are based on spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting, and attention. Additionally, this survey also covers the elementary understanding of CNN components and sheds light on its current challenges and applications.深度卷积神经网络(CNNs)是一种特殊类型的神经网络,在各种竞争性基准测试中表现出了最先进的性能。深度CNN强大的学习能力很大程度上是由于它使用了多个特征提取阶段(隐含层),可以从数据中自动学习表示。大量数据的可用性和硬件处理单元的改进加速了CNNs的研究,并且,最近报道了非常有意思的深度CNN架构。最近开发深度CNNs的竞赛表明,创新的架构思想和参数优化可以提高CNN的性能。为此,在CNN的设计中探索了不同的思路,如使用不同的激活和丢失函数、参数优化、正则化以及处理单元的重组。然而,深度CNN的代表性能力的主要提高是通过处理单元的重组实现的。特别是,使用一个块作为一个结构单元而不是一层的想法正在得到大量的关注。因此,本次调查的重点是最近报道的深度CNN架构的内在分类,因此,将CNN架构的最新创新分为七个不同的类别。这七个类别分别基于空间开发、深度、多路径、宽度、特征地图开发、通道提升和注意力机制。此外,本调查还涵盖了对CNN组件的基本理解,并阐明了其当前的挑战和应用。Keywords: Deep Learning, Convolutional Neural Networks, Architecture, Representational Capacity, Residual Learning, and Channel Boosted CNN.关键词:深度学习,卷积神经网络,架构,表征能力,残差学习,通道提升的CNN1、IntroductionMachine Learning (ML) algorithms belong to a specialized area in Artificial Intelligence (AI), which endows intelligence to computers by learning the underlying relationships among the data and making decisions without being explicitly programmed. Different ML algorithms have been developed since the late 1990s, for the emulation of human sensory responses such as speech and vision, but they have generally failed to achieve human-level satisfaction [1]–[6]. The challenging nature of Machine Vision (MV) tasks gives rise to a specialized class of Neural Networks (NN), known as Convolutional Neural Network (CNN) [7].机器学习(ML)算法属于人工智能(AI)的一个专门领域,它通过学习数据之间的基本关系并在没有显示编程的情况下做出决策,从而赋予计算机智能。自20世纪90年代末以来,针对语音、视觉等人类感官反应的仿真,人们开发了各种各样的ML算法,但普遍未能达到人的满意程度[1]-[6]。由于机器视觉(MV)任务的挑战性,产生了一类专门的神经网络(NN),称为卷积神经网络(CNN)[7]。CNNs are considered as one of the best techniques for learning image content and have shown state-of-the-art results on image recognition, segmentation, detection, and retrieval related tasks [8], [9]. The success of CNN has captured attention beyond academia. In industry, companies such as Google, Microsoft, AT&T, NEC, and Facebook have developed active research groups for exploring new architectures of CNN [10]. At present, most of the frontrunners of image processing competitions are employing deep CNN based models.CNNs被认为是学习图像内容的最佳技术之一,在图像识别、分割、检测和检索相关任务[8]、[9]方面已经取得了最新的成果。CNN的成功吸引了学术界以外的关注。在业界,谷歌、微软、AT&T、NEC、Facebook等公司都建立了活跃的研究小组,探索CNN[10]的新架构。目前,大多数图像处理竞赛的领跑者,都在使用基于深度CNN的模型。The topology of CNN is divided into multiple learning stages composed of a combination of the convolutional layer, non-linear processing units, and subsampling layers [11]. Each layer performs multiple transformations using a bank of convolutional kernels (filters) [12]. Convolution operation extracts locally correlated features by dividing the image into small slices (similar to the retina of the human eye), making it capable of learning suitable features. Output of the convolutional kernels is assigned to non-linear processing units, which not only helps in learning abstraction but also embeds non-linearity in the feature space. This non-linearity generates different patterns of activations for different responses and thus facilitates in learning of semantic differences in images. Output of the non-linear function is usually followed by subsampling, which helps in summarizing the results and also makes the input invariant to geometrical distortions [12], [13].CNN的拓扑结构分为多个学习阶段,包括卷积层、非线性处理单元和子采样层的组合[11]。每一层使用一组卷积核(滤波器)执行多重变换[12]。卷积操作通过将图像分割成小块(类似于人眼视网膜)来提取局部相关特征,使其能够学习合适的特征。卷积核的输出被分配给非线性处理单元,这不仅有助于学习抽象,而且在特征空间中嵌入非线性。这种非线性会为不同的反应产生不同的激活模式,从而有助于学习图像中的语义差异。非线性函数的输出通常随后是子采样,这有助于总结结果,并使输入对几何畸变保持不变[12],[13]。The architectural design of CNN was inspired by Hubel and Wiesel’s work and thus largely follows the basic structure of primate’s visual cortex [14], [15]. CNN first came to limelight through the work of LeCuN in 1989 for the processing of grid-like topological data (images and time series data) [7], [16]. The popularity of CNN is largely due to its hierarchical feature extraction ability. Hierarchical organization of CNN emulates the deep and layered learning process of the Neocortex in the human brain, which automatically extract features from the underlying data [17]. The staging of learning process in CNN shows quite resemblance with primate’s ventral pathway of visual cortex (V1-V2-V4-IT/VTC) [18]. The visual cortex of primates first receives input from the retinotopic area, where multi-scale highpass filtering and contrast normalization is performed by the lateral geniculate nucleus. After this, detection is performed by different regions of the visual cortex categorized as V1, V2, V3, and V4. In fact, V1 and V2 portion of visual cortex are similar to convolutional, and subsampling layers, whereas inferior temporal region resembles the higher layers of CNN, which makes inference about the image [19]. During training, CNN learns through backpropagation algorithm, by regulating the change in weights with respect to the input. Minimization of a cost function by CNN using backpropagation algorithm is similar to the response based learning of human brain. CNN has the ability to extract low, mid, and high-level features. High level features (more abstract features) are a combination of lower and mid-level features. With the automatic feature extraction ability, CNN reduces the need for synthesizing a separate feature extractor [20]. Thus, CNN can learn good internal representation from raw pixels with diminutive processing.CNN的架构设计灵感来自于Hubel和Wiesel的工作,因此很大程度上遵循了灵长类动物视觉皮层的基本结构[14],[15]。CNN最早是在1989年通过LeCuN的工作引起了人们的注意,它处理了网格状的拓扑数据(图像和时间序列数据)[7],[16]。CNN的流行很大程度上是由于它的层次特征提取能力。CNN的分层组织模拟人脑皮层的深层和分层学习过程,它自动从底层数据中提取特征[17]。CNN中学习过程的分期与灵长类视觉皮层腹侧通路(V1-V2-V4-IT/VTC)非常相似[18]。灵长类动物的视觉皮层首先接收来自视黄醇区的输入,在视黄醇区,外侧膝状体核进行多尺度高通滤波和对比度归一化。之后,由视觉皮层的不同区域进行检测,这些区域分为V1、V2、V3和V4。事实上,视觉皮层的V1和V2部分与卷积层和亚采样层相似,而颞下区与CNN的高层相似,后者对图像进行推断[19]。在训练过程中,CNN通过反向传播算法学习,通过调节输入权重的变化。使用反向传播算法的CNN最小化代价函数类似于基于响应的人脑学习。CNN能够提取低、中、高级特征。高级特征(更抽象的特征)是低级和中级特征的组合。具有自动特征提取功能,CNN减少了合成单独特征提取器的需要[20]。因此,CNN可以通过较小的处理从原始像素中学习良好的内部表示。The main boom in the use of CNN for image classification and segmentation occurred after it was observed that the representational capacity of a CNN can be enhanced by increasing its depth [21]. Deep architectures have an advantage over shallow architectures, when dealing with complex learning problems. Stacking of multiple linear and non-linear processing units in a layer wise fashion provides deep networks the ability to learn complex representations at different levels of abstraction. In addition, advancements in hardware and thus the availability of high computing resources is also one of the main reasons of the recent success of deep CNNs. Deep CNN architectures have shown significant performance of improvements over shallow and conventional vision based models. Apart from its use in supervised learning, deep CNNs have potential to learn useful representation from large scale of unlabeled data. Use of the multiple mapping functions by CNN enables it to improve the extraction of invariant representations and consequently, makes it capable to handle recognition tasks of hundreds of categories. Recently, it is shown that different level of features including both low and high-level can be transferred to a generic recognition task by exploiting the concept of Transfer Learning (TL) [22]–[24]. Important attributes of CNN are hierarchical learning, automatic feature extraction, multi-tasking, and weight sharing [25]–[27].CNN用于图像分类和分割的主要兴起发生在观察到CNN的表示能力可以通过增加其深度来增强之后[21]。在处理复杂的学习问题时,深度架构比浅层架构具有优势。以分层方式堆叠多个线性和非线性处理单元,使深层网络能够在不同抽象级别学习复杂表示。此外,硬件的进步以及高计算资源的可用性也是deep CNNs最近成功的主要原因之一。深度CNN架构已经显示出比浅层和传统的基于视觉的模型有显著改进的性能。除了在监督学习中的应用外,深度CNN还具有从大规模未标记数据中学习有用表示的潜力。利用CNN的多重映射函数,提高了不变量表示的提取效率,使其能够处理数百个类别的识别任务。近年来,研究表明,利用迁移学习(TL)[22]-[24]的概念,可以将包括低层和高层特征在内的不同层次的特征,转化为一般的识别任务。CNN的重要特性是分层学习、自动特征提取、多任务处理和权重共享[25]-[27]。Various improvements in CNN learning strategy and architecture were performed to make CNN scalable to large and complex problems. These innovations can be categorized as parameter optimization, regularization, structural reformulation, etc. However, it is observed that CNN based applications became prevalent after the exemplary performance of AlexNet on ImageNet dataset [21]. Thus major innovations in CNN have been proposed since 2012 and were mainly due to restructuring of processing units and designing of new blocks. Similarly, Zeiler and Fergus [28] introduced the concept of layer-wise visualization of features, which shifted the trend towards extraction of features at low spatial resolution in deep architecture such as VGG [29]. Nowadays, most of the new architectures are built upon the principle of simple and homogenous topology introduced by VGG. On the other hand, Google group introduced an interesting idea of split, transform, and merge, and the corresponding block is known as inception block. The inception block for the very first time gave the concept of branching within a layer, which allows abstraction of features at different spatial scales [30]. In 2015, the concept of skip connections introduced by ResNet [31] for the training of deep CNNs got famous, and afterwards, this concept was used by most of the succeeding Nets, such as Inception-ResNet, WideResNet, ResNext, etc [32]–[34].在CNN学习策略和体系结构方面进行了各种改进,使CNN能够扩展到大型复杂问题。这些创新可分为参数优化、正则化、结构重构等。然而,据观察,在AlexNet在ImageNet数据集上的示范性能之后,基于CNN的应用变得普遍[21]。因此,自2012年以来,CNN提出了重大创新,主要归功于处理单元的重组和新区块的设计。类似地,Zeiler和Fergus[28]引入了特征分层可视化的概念,这改变了深度架构(如VGG[29])中以低空间分辨率提取特征的趋势。目前,大多数新的体系结构都是基于VGG提出的简单、同质的拓扑结构原理。另一方面,Google group引入了一个有趣的拆分、转换和合并的概念,相应的块称为inception块。inception块第一次给出了层内分支的概念,允许在不同的空间尺度上抽象特征[30]。2015年,ResNet[31]提出的用于训练深层CNNs的skip连接的概念很出名,之后,这个概念被大多数后续网络使用,如Inception ResNet、WideResNet、ResNext等[32]-[34]。In order to improve the learning capacity of a CNN, different architectural designs such as WideResNet, Pyramidal Net, Xception etc. explored the effect of multilevel transformations in terms of an additional cardinality and increase in width [32], [34], [35]. Therefore, the focus of research shifted from parameter optimization and connections readjustment towards improved architectural design (layer structure) of the network. This shift resulted in many new architectural ideas such as channel boosting, spatial and channel wise exploitation and attention based information processing etc. [36]–[38].为了提高CNN的学习能力,不同的结构设计,如WideResNet、金字塔网、exception等,从增加基数和增加宽度的角度探讨了多级转换的效果[32]、[34]、[35]。因此,研究的重点从网络的参数优化和连接调整转向网络的改进结构设计(层结构)。这种转变产生了许多新的架构思想,如信道增强、空间和信道利用以及基于注意力的信息处理等[36]-[38]。In the past few years, different interesting surveys are conducted on deep CNNs that elaborate the basic components of CNN and their alternatives. The survey reported by [39] has reviewed the famous architectures from 2012-2015 along with their components. Similarly, in the literature, there are prominent surveys that discuss different algorithms of CNN and focus on applications of CNN [20], [26], [27], [40], [41]. Likewise, the survey presented in [42] discussed taxonomy of CNNs based on acceleration techniques. On the other hand, in this survey, we discuss the intrinsic taxonomy present in the recent and prominent CNN architectures. The various CNN architectures discussed in this survey can be broadly classified into seven main categories namely; spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting, and attention based CNNs. The rest of the paper is organized in the following order (shown in Fig. 1): Section 1 summarizes the underlying basics of CNN, its resemblance with primate’s visual cortex, as well as its contribution in MV. In this regard, Section 2 provides the overview on basic CNN components and Section 3 discusses the architectural evolution of deep CNNs. Whereas, Section 4, discusses the recent innovations in CNN architectures and categorizes CNNs into seven broad classes. Section 5 and 6 shed light on applications of CNNs and current challenges, whereas section 7 discusses future work and last section draws conclusion.在过去的几年里,对深度CNN进行了不同有趣的调查,阐述了CNN的基本组成部分及其替代方案。[39]报告的调查回顾了2012-2015年著名架构及其组成部分。类似地,在文献中,有一些著名的调查讨论了CNN的不同算法,并着重于CNN的应用[20]、[26]、[27]、[40]、[41]。同样,在[42]中提出的调查讨论了基于加速技术的CNNs分类。另一方面,在这项调查中,我们讨论了在最近和著名的CNN架构中存在的内在分类法。本次调查中讨论的各种CNN架构大致可分为七大类,即:空间开发、深度、多径、宽度、特征地图开发、信道增强和基于注意力的CNN。论文的其余部分按以下顺序组织(如图1所示):第1节总结了CNN的基本原理,它与灵长类视觉皮层的相似性,以及它在MV中的贡献。在这方面,第2节概述了基本CNN组件,第3节讨论了deep CNNs的体系结构演变。第4节讨论了CNN体系结构的最新创新,并将CNN分为七大类。第5节和第6节阐述了CNNs的应用和当前面临的挑战,第7节讨论了未来的工作,最后一节得出结论。
Fig. 1: Organization of the survey paper.2 Basic CNN ComponentsNowadays, CNN is considered as the most widely used ML technique, especially in vision related applications. CNNs have recently shown state-of-the-art results in various ML applications. A typical block diagram of an ML system is shown in Fig. 2. Since, CNN possesses both good feature extraction and strong discrimination ability, therefore in a ML system; it is mostly used for feature extraction and classification.目前,CNN被认为是应用最广泛的ML技术,尤其是在视觉相关应用中。CNNs最近在各种ML应用中显示了最新的结果。ML系统的典型框图如图2所示。由于CNN具有良好的特征提取和较强的识别能力,因此在ML系统中,它主要用于特征提取和分类。A typical CNN architecture generally comprises of alternate layers of convolution and pooling followed by one or more fully connected layers at the end. In some cases, fully connected layer is replaced with global average pooling layer. In addition to the various learning stages, different regulatory units such as batch normalization and dropout are also incorporated to optimize CNN performance [43]. The arrangement of CNN components play a fundamental role in designing new architectures and thus achieving enhanced performance. This section briefly discusses the role of these components in CNN architecture.典型的CNN体系结构,通常包括交替的卷积层和池化,最后是一个或多个完全连接的层。在某些情况下,完全连接层被替换为全局平均池层。除了不同的学习阶段,不同的常规单位,如 batch normalization和dropout,也被纳入优化CNN的表现[43]。CNN组件的排列在设计新的体系结构和提高性能方面起着基础性的作用。本节简要讨论这些组件在CNN架构中的作用。2.1 Convolutional LayerConvolutional layer is composed of a set of convolutional kernels (each neuron act as a kernel). These kernels are associated with a small area of the image known as a receptive field. It works by dividing the image into small blocks (receptive fields) and convolving them with a specific set of weights (multiplying elements of the filter with the corresponding receptive field elements) [43]. Convolution operation can expressed as follows:卷积层由一组卷积核组成(每个神经元充当一个核)。这些核与被称为感受野的图像的一小部分相关。它的工作原理是将图像分割成小的块(接收场),并用一组特定的权重(将滤波器的元素与相应的接收场元素相乘)[43]。卷积运算可以表示为:
Where, the input image is represented by x, y I , , xy shows spatial locality and kl K represents the lth convolutional kernel of the kth layer. Division of image into small blocks helps in extracting locally correlated pixel values. This locally aggregated information is also known as feature motif. Different set of features within image are extracted by sliding convolutional kernel on the whole image with the same set of weights. This weight sharing feature of convolution operation makes CNN parameter efficient as compared to fully connected Nets. Convolution operation may further be categorized into different types based on the type and size of filters, type of padding, and the direction of convolution [44]. Additionally, if the kernel is symmetric, the convolution operation becomes a correlation operation [16].其中,输入图像由x,y I,x y表示空间局部性,k l k表示第k层的第l卷积核。将图像分割成小块有助于提取局部相关像素值。这种局部聚集的信息也被称为特征模体。在相同的权值集下,通过滑动卷积核提取图像中不同的特征集。与全连通网络相比,卷积运算的这种权值共享特性使得CNN参数更有效。卷积操作还可以基于滤波器的类型和大小、填充的类型和卷积的方向而被分为不同的类型[44]。另外,如果核是对称的,卷积操作就变成相关性操作[16]。2.2 Pooling LayerFeature motifs, which result as an output of convolution operation can occur at different locations in the image. Once features are extracted, its exact location becomes less important as long as its approximate position relative to others is preserved. Pooling or downsampling like convolution, is an interesting local operation. It sums up similar information in the neighborhood of the receptive field and outputs the dominant response within this local region [45].卷积运算输出的特图案可以出现在图像的不同位置。一旦特征被提取,其精确位置就变得不那么重要了,只要其相对于其他位置的近似位置被保留。像卷积一样的池化或下采样是一种有趣的本地操作。它总结了接受野附近的相似信息,并输出了该局部区域内的主导反应[45]。
Equation (2) shows the pooling operation in which l Z represents the lth output feature map, ,lxyF shows the lth input feature map, whereas p f (.) defines the type of pooling operation. The use ofpooling operation helps to extract a combination of features, which are invariant to translational shifts and small distortions [13], [46]. Reduction in the size of feature map to invariant feature set not only regulates complexity of the network but also helps in increasing the generalization by reducing overfitting. Different types of pooling formulations such as max, average, L2, overlapping, spatial pyramid pooling, etc. are used in CNN [47]–[49].等式(2)表示池操作,其中l Z表示lth输出特征映射,lxyF表示lth输入特征映射,而p f(.)定义池操作的类型。使用pooling操作有助于提取特征的组合,这些特征对平移位移和小的失真是不变的[13],[46]。将特征映射的大小减少到不变特征集不仅可以调节网络的复杂度,而且有助于通过减少过拟合来增加泛化。CNN中使用了不同类型的池公式,如max、average、L2、overlapping、空间金字塔池化等[47]–[49]。2.3 Activation FunctionActivation function serves as a decision function and helps in learning a complex pattern. Selection of an appropriate activation function can accelerate the learning process. Activation function for a convolved feature map is defined in equation (3).激活函数作为一个决策函数,有助于学习一个复杂的模式。选择合适的激活函数可以加速学习过程。卷积特征映射的激活函数在方程(3)中定义。
In above equation, k l F is an output of a convolution operation, which is assigned to activation function; A f (.) that adds non-linearity and returns a transformed output k l T for kth layer. In literature, different activation functions such as sigmoid, tanh, maxout, ReLU, and variants of ReLU such as leaky ReLU, ELU, and PReLU [39], [48], [50], [51] are used to inculcate nonlinear combination of features. However, ReLU and its variants are preferred over others activations as it helps in overcoming the vanishing gradient problem [52], [53].在上面的等式中,k l F是卷积运算的输出,该卷积运算被分配给激活函数;F(.)添加非线性并返回第k层的转换输出k l T。在文献中,不同的激活函数如sigmoid、tanh、maxout、ReLU和ReLU的变体如leaky ReLU、ELU和PReLU[39]、[48]、[50]、[51]被用来灌输特征的非线性组合。然而,ReLU及其变体比其他激活更受欢迎,因为它有助于克服消失梯度问题[52],[53]。Fig. 2: Basic layout of a typical ML system. In ML related tasks, initially data is preprocessed and then assigned to a classification system. A typical ML problem follows three steps: stage 1 is related to data gathering and generation, stage 2 performs preprocessing and feature selection, whereas stage 3 is based on model selection, parameter tuning, and analysis. CNN has a good feature extraction and strong discrimination ability, therefore in a ML system; it can be used for feature extraction and classification. 图2:典型ML系统的基本布局。在与ML相关的任务中,首先对数据进行预处理,然后将其分配给分类系统。一个典型的ML问题有三个步骤:阶段1与数据收集和生成相关,阶段2执行预处理和特征选择,而阶段3基于模型选择、参数调整和分析。CNN具有很好的特征提取能力和较强的识别能力,因此在ML系统中可以用于特征提取和分类。
2.4 Batch Normalization注:根据博主的经验,此处常为考点!Batch normalization is used to address the issues related to internal covariance shift within feature maps. The internal covariance shift is a change in the distribution of hidden units’ values, which slow down the convergence (by forcing learning rate to small value) and requires careful initialization of parameters. Batch normalization for a transformed feature map k lT is shown in equation (4).批处理规范化用于解决与特征映射内部协方差偏移相关的问题。内协方差偏移是隐藏单元值分布的一种变化,它会减慢收敛速度(通过强制学习速率为小值),并且需要谨慎的初始化参数。转换后的特征映射k lT的批处理规范化如等式(4)所示。
In equation (4), k l N represents normalized feature map, kl F is the input feature map, B and 2 B depict mean and variance of a feature map for a mini batch respectively. Batch normalization unifies the distribution of feature map values by bringing them to zero mean and unit variance [54]. Furthermore, it smoothens the flow of gradient and acts as a regulating factor, which thus helps in improving generalization of the network.在式(4)中,k l N表示归一化特征映射,kl F是输入特征映射,Bμ和2 B分别表示小批量特征映射的均值和方差。批量规范化通过使特征映射值的平均值和单位方差为零来统一分布[54]。此外,它平滑了梯度的流动,起到了调节因子的作用,从而有助于提高网络的泛化能力。2.5 DropoutDropout introduces regularization within the network, which ultimately improves generalization by randomly skipping some units or connections with a certain probability. In NNs, multiple connections that learn a non-linear relation are sometimes co-adapted, which causes overfitting [55]. This random dropping of some connections or units produces several thinned network architectures, and finally one representative network is selected with small weights. This selected architecture is then considered as an approximation of all of the proposed networks [56].Dropout在网络中引入正则化,通过随机跳过某些具有一定概率的单元或连接,最终提高泛化能力。在NNs中,学习非线性关系的多个连接有时是协同适应的,这会导致过度拟合[55]。一些连接或单元的随机丢弃产生了几种细化的网络结构,最后选择了一种具有代表性的网络结构。然后将所选择的体系结构看作是所提出的所有网络的近似〔56〕。2.6 Fully Connected LayerFully connected layer is mostly used at the end of the network for classification purpose. Unlike pooling and convolution, it is a global operation. It takes input from the previous layer and globally analyses output of all the preceding layers [57]. This makes a non-linear combination of selected features, which are used for the classification of data. [58].全连接层主要用于网络末端的分类。与池化和卷积不同,它是一个全局操作。它接受前一层的输入,并全局分析所有前一层的输出[57]。这使得用于数据分类的选定特征的非线性组合。[58]。
Fig. 3: Evolutionary history of deep CNNs3 Architectural Evolution of Deep CNNNowadays, CNNs are considered as the most widely used algorithms among biologically inspired AI techniques. CNN history begins from the neurobiological experiments conducted by Hubel and Wiesel (1959, 1962) [14], [59]. Their work provided a platform for many cognitive models, almost all of which were latterly replaced by CNN. Over the decades, different efforts have been carried out to improve the performance of CNNs. This history is pictorially represented in Fig. 3. These improvements can be categorized into five different eras and are discussed below.目前,CNNs被认为是生物人工智能技术中应用最广泛的算法。CNN的历史始于Hubel和Wiesel(19591962)[14],[59]进行的神经生物学实验。他们的工作为许多认知模型提供了一个平台,几乎所有的认知模型都被CNN所取代。几十年来,人们一直在努力提高CNNs的性能。这段历史在图3中用图形表示这些改进可以分为五个不同的时代,并在下面讨论。3.1 Late 1980s-1999: Origin of CNNCNNs have been applied to visual tasks since the late 1980s. In 1989, LeCuN et al. proposed the first multilayered CNN named as ConvNet, whose origin rooted in Fukushima’s Neocognitron [60], [61]. LeCuN proposed supervised training of ConvNet, using Backpropagation algorithm [7], [62] in comparison to the unsupervised reinforcement learning scheme used by its predecessor Neocognitron. LeCuN’s work thus made a foundation for the modern 2D CNNs. Supervised training in CNN provides the automatic feature learning ability from raw input, rather than designing of handcrafted features, used by traditional ML methods. This ConvNet showed successful results for handwritten digit and zip code recognition related problems [63]. In 1998, ConvNet was improved by LeCuN and used for classifying characters in a document recognition application [64]. This modified architecture was named as LeNet-5, which was an improvement over the initial CNN as it can extract feature representation in a hierarchical way from raw pixels [65]. Reliance of LeNet-5 on fewer parameters along with consideration of spatial topology of images enabled CNN to recognize rotational variants of the image [65]. Due to the good performance of CNN in optical character recognition, its commercial use in ATM and Banks started in 1993 and 1996, respectively. Though, many successful milestones were achieved by LeNet-5, yet the main concern associated with it was that its discrimination power was not scaled to classification tasks other than hand recognition.自20世纪80年代末以来,CNNs已经被应用于视觉任务中。提出了第一个叫做ConvNet的多层CNN,其起源于Fukushima’s 的Neocognitron[60],[61]。LeCuN提出了ConvNet的有监督训练,使用了Backpropagation算法[7],[62],与其前身Neocognitron使用的无监督强化学习方案相比。他的作品为现代2D CNN奠定了基础。CNN中的监督训练提供了从原始输入中自动学习特征的能力,而不是传统ML方法所使用的手工特征的设计。这个ConvNet显示了手写数字和邮政编码识别相关问题的成功结果[63]。1998年,LeCuN改进了ConvNet,并将其用于文档识别应用程序中的字符分类[64]。这种改进的结构被命名为LeNet-5,这是对初始CNN的改进,因为它可以从原始像素中以分层的方式提取特征表示[65]。LeNet-5对较少参数的依赖以及对图像空间拓扑的考虑使得CNN能够识别图像的旋转变体[65]。由于CNN在光学字符识别方面的良好性能,其在ATM和银行的商业应用分别始于1993年和1996年。尽管LeNet-5取得了许多成功的里程碑,但与之相关的主要问题是它的辨别能力并没有扩展到除手识别以外的分类任务。3.2 Early 2000: Stagnation of CNNIn the late 1990s and early 2000s, interest in NNs reduced and less attention was given to explore the role of CNNs in different applications such as object detection, video surveillance, etc. Use of CNN in ML related tasks became dormant due to the insignificant improvement in performance at the cost of high computational time. At that time, other statistical methods and, in particular, SVM became more popular than CNN due to its relatively high performance [66]–[68]. It was widely presumed in early 2000 that the backpropagation algorithm used for training of CNN was not effective in converging to optimal points and therefore unable to learn useful features in supervised fashion as compared to handcrafted features [69]. Meanwhile, different researchers kept working on CNN and tried to optimize its performance. In 2003, Simard et al. improved CNN architecture and showed good results as compared to SVM on a hand digit benchmark dataset; MNIST [64], [68], [70]–[72]. This performance improvement expedited the research in CNN by extending its application in optical character recognition (OCR) to other script’s character recognition [72]–[74], deployment in image sensors for face detection in video conferencing, and regulation of street crimes, etc. Likewise, CNN based systems were industrialized in markets for tracking customers [75]–[77]. Moreover, CNN’s potential in other applications such as medical image segmentation, anomaly detection, and robot vision was also explored [78]–[80].在20世纪90年代末和21世纪初,人们对神经网络的兴趣逐渐减少,对神经网络在目标检测、视频监控等不同应用中的作用的研究也越来越少。由于性能上的显著提高,在ML相关任务中使用神经网络以牺牲较高的计算时间而变得不活跃。当时,其他统计方法,特别是支持向量机,由于其相对较高的性能而变得比CNN更受欢迎[66]-[68]。2000年初,人们普遍认为,用于CNN训练的反向传播算法在收敛到最优点方面并不有效,因此与手工制作的特征相比,无法以监督方式学习有用的特征[69]。与此同时,不同的研究人员继续研究CNN,并试图优化其性能。2003年,Simard等人。改进了CNN的体系结构,与支持向量机相比,在一个手写数字基准数据集上显示了良好的结果;MNIST[64],[68],[70]–[72]。这种性能的提高加速了CNN的研究,将其在光学字符识别(OCR)中的应用扩展到其他脚本的字符识别[72]-[74],在视频会议中部署用于面部检测的图像传感器,以及对街头犯罪的监管等。同样,基于CNN的系统也在市场上实现了工业化用于跟踪客户[75]–[77]。此外,CNN在医学图像分割、异常检测和机器人视觉等其他应用领域的潜力也得到了探索[78]-[80]。3.3 2006-2011: Revival of CNNDeep NNs have generally complex architecture and time intensive training phase that sometimes spanned over weeks and even months. In early 2000, there were only a few techniques for the training of deep Networks. Additionally, it was considered that CNN is not able to scale for complex problems. These challenges halted the use of CNN in ML related tasks.深度NNs通常具有复杂的结构和时间密集型训练阶段,有时跨越数周甚至数月。在2000年初,只有少数技术用于训练深层网络。此外,有人认为CNN无法扩展到复杂的问题。这些挑战阻止了CNN在ML相关任务中的应用。To address these problems, in 2006 many interesting methods were reported to overcome the difficulties encountered in the training of deep CNNs and learning of invariant features. Hinton proposed greedy layer-wise pre-training approach in 2006, for deep architectures, which revived and reinstated the importance of deep learning [81], [82]. The revival of a deep learning [83], [84] was one of the factors, which brought deep CNNs into the limelight. Huang et al. (2006) used max pooling instead of subsampling, which showed good results by learning of invariant features [46], [85].为了解决这些问题,2006年报道了许多有趣的方法来克服在训练深层CNNs和学习不变特征方面遇到的困难。Hinton在2006年提出了贪婪的分层预训练方法,用于深层架构,这恢复了深层学习的重要性[81],[82]。深度学习的复兴[83],[84]是其中的一个因素,这使深度cnn成为了焦点。Huang等人。(2006)使用最大值池代替子采样,通过学习不变特征显示了良好的结果[46],[85]In late 2006, researchers started using graphics processing units (GPUs) [86], [87] to accelerate training of deep NN and CNN architectures [88], [89]. In 2007, NVIDIA launched the CUDA programming platform [90], [91], which allows exploitation of parallel processing capabilities of GPU with a much greater degree [92]. In essence, the use of GPUs for NN training [88], [93] and other hardware improvements were the main factor, which revived the research in CNN. In 2010, Fei-Fei Li’s group at Stanford, established a large database of images known as ImageNet, containing millions of labeled images [94]. This database was coupled with the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competitions, where the performances of various models have been evaluated and scored [95]. Consequently, ILSVRC and NIPS have been very active in strengthening research and increasing the use of CNN and thus making it popular. This was a turning point in improving the performance and increasing the use of CNN.2006年末,研究人员开始使用图形处理单元(GPU)[86],[87]来加速深度神经网络和CNN架构的训练[88],[89]。2007年,NVIDIA推出了CUDA编程平台[90],[91],它允许在更大程度上利用GPU的并行处理能力[92]。从本质上讲,GPUs在神经网络训练中的应用[88]、[93]和其他硬件的改进是主要因素,这使CNN的研究重新活跃起来。2010年,李飞飞在斯坦福大学的团队建立了一个名为ImageNet的大型图像数据库,其中包含数百万个标记图像[94]。该数据库与年度ImageNet大型视觉识别挑战赛(ILSVRC)相结合,对各种模型的性能进行了评估和评分[95]。因此,ILSVRC和NIPS在加强研究和增加CNN的使用方面非常积极,从而使其流行起来。这是一个转折点,在提高性能和增加使用有线电视新闻网。3.4 2012-2014: Rise of CNNAvailability of big training data, hardware advancements, and computational resources contributed to advancement in CNN algorithms. Renaissance of CNN in object detection, image classification, and segmentation related tasks had been observed in this period [9], [96]. However, the success of CNN in image classification tasks was not only due to the result of aforementioned factors but largely contributed by the architectural modifications, parameter optimization, incorporation of regulatory units, and reformulation and readjustment of connections within the network [39], [42], [97].大训练数据的可用性、硬件的先进性和计算资源有助于CNN算法的进步。CNN在目标检测、图像分类和与分割相关的任务方面的复兴在这一时期已经被观察到了[9],[96]。然而,CNN在图像分类任务中的成功不仅是由于上述因素的结果,而且在很大程度上是由于结构的修改、参数的优化、调节单元的合并以及网络内连接的重新制定和调整[39]、[42]、[97]。γiThe main breakthrough in CNN performance was brought by AlexNet [21]. AlexNet won the 2012-ILSVRC competition, which has been one of the most difficult challenges in image detection and classification. AlexNet improved performance by exploiting depth (incorporating multiple levels of transformation) and introduced regularization term in CNN. The exemplary performance of AlexNet [21] compared to conventional ML techniques in 2012-ILSVRC (AlexNet reduced error rate from 25.8 to 16.4) suggested that the main reason of the saturation in CNN performance before 2006 was largely due to the unavailability of enough training data and computational resources. In summary, before 2006, these resource deficiencies made it hard to train a high-capacity CNN without deterioration of performance [98].CNN的主要突破是由AlexNet带来的[21]。AlexNet赢得了2012-ILSVRC比赛,这是图像检测和分类领域最困难的挑战之一。AlexNet利用深度(包含多个层次的转换)提高了性能,并在CNN中引入了正则化项。与2012-ILSVRC(AlexNet将错误率从25.8降低到16.4)中的传统ML技术相比,AlexNet的示例性性能[21]表明,2006年之前CNN性能饱和的主要原因是缺乏足够的训练数据和计算资源。总之,在2006年之前,这些资源不足使得在不降低性能的情况下难以训练高容量CNN[98]With CNN becoming more of a commodity in the computer vision (CV) field, a number of attempts have been made to improve the performance of CNN with reduced computational cost. Therefore, each new architecture try to overcome the shortcomings of previously proposed architecture in combination with new structural reformulations. In year 2013 and 2014, researchers mainly focused on parameter optimization to accelerate CNN performance in a range of applications with a small increase in computational complexity. In 2013, Zeiler and Fergus [28] defined a mechanism to visualize learned filters of each CNN layer. Visualization approach was used to improve the feature extraction stage by reducing the size of the filters. Similarly, VGG architecture [29] proposed by the Oxford group, which was runner-up at the 2014-ILSVRC competition, made the receptive field much smaller in comparison to that of AlexNet but, with increased volume. In VGG, depth was increased from 9 layers to 16, by making the volume of features maps double at each layer. In the same year, GoogleNet [99] that won 2014-ILSVRC competition, not only exerted its efforts to reduce computational cost by changing layer design, but also widened the width in compliance with depth to improve CNN performance. GoogleNet introduced the concept of split, transform, and merge based blocks, within which multiscale and multilevel transformation is incorporated to capture both local and global information [33], [99], [100]. The use of multilevel transformations helps CNN in tackling details of images at various levels. In the year 2012-14, the main improvement in the learning capacity of CNN was achieved by increasing its depth and parameter optimization strategies. This suggested that the depth of a CNN helps in improving the performance of a classifier.随着CNN在计算机视觉(CV)领域的应用越来越广泛,人们在降低计算成本的前提下,对CNN的性能进行了许多尝试。因此,每一个新的架构都试图结合新的结构重组来克服先前提出的建筑的缺点。在第2013和2014年,研究人员主要集中在参数优化,以加速CNN在一系列应用中的性能,计算复杂性的增加很小。2013年,Zeiler和Fergus[28]定义了一种机制,可以可视化每个CNN层的学习过滤器。采用可视化的方法,通过减小滤波器的尺寸来改善特征提取阶段。同样,在2014-ILSVRC竞赛中获得亚军的Oxford group提出的VGG架构[29]也使得接受场比AlexNet小得多,但随着体积的增加。在VGG中,深度从9层增加到16层,使每层的特征地图体积加倍。同年,赢得2014-ILSVRC竞赛的GoogleNet[99]不仅努力通过改变层设计来降低计算成本,还根据深度拓宽了宽度以提高CNN性能。GoogleNet引入了基于分割、变换和合并的块的概念,其中结合了多尺度和多级变换来捕获局部和全局信息[33]、[99]、[100]。多级转换的使用有助于CNN处理不同层次的图像细节。2012-2014年,CNN的学习能力主要通过提高其深度和参数优化策略来实现。这表明CNN的深度有助于提高分类器的性能。3.5 2015-Present: Rapid increase in Architectural Innovations and Applications of CNNIt is generally observed the major improvements in CNN performance occurred from 2015-2019. The research in CNN is still on going and has a significant potential of improvement. Representational capacity of CNN depends on its depth and in a sense can help in learning complex problems by defining diverse level of features ranging from simple to complex. Multiple levels of transformation make learning easy by chopping complex problems into 15 smaller modules. However, the main challenge faced by deep architectures is the problem of negative learning, which occurs due to diminishing gradient at lower layers of the network. To handle this problem, different research groups worked on readjustment of layers connections and design of new modules. In earlier 2015, Srivastava et al. used the concept of cross-channel connectivity and information gating mechanism to solve the vanishing gradient problem and to improve the network representational capacity [101]–[103]. This idea got famous in late 2015 and a similar concept of residual blocks or skip connections was coined [31]. Residual blocks are a variant of cross-channel connectivity, which smoothen learning by regularizing the flow of information across blocks [104]–[106]. This idea was used in ResNet architecture for the training of 150 layers deep network [31]. The idea of cross-channel connectivity is further extended to multilayer connectivity by Deluge, DenseNet, etc. to improve representation [107], [108].一般观察到,CNN在2015-2019年的表现出现了重大改善。CNN的研究仍在进行中,有很大的改进潜力。CNN的表征能力取决于它的深度,在某种意义上可以通过定义从简单到复杂的不同层次的特征来帮助学习复杂的问题。通过将复杂的问题分解成15个较小的模块,多层次的转换使学习变得容易。然而,深度架构面临的主要挑战是负学习问题,这是由于网络较低层的梯度减小而产生的。为了解决这个问题,不同的研究小组致力于重新调整层连接和设计新的模块。2015年初,Srivastava等人。利用跨通道连接和信息选通机制的概念解决了消失梯度问题,提高了网络的表示能力[101]–[103]。这一想法在2015年末变得很有名,并创造了类似的剩余块或跳过连接的概念[31]。剩余块是跨信道连接的一种变体,它通过调整跨块的信息流来平滑学习[104]–[106]。该思想被用于ResNet体系结构中,用于150层深度网络的训练[31]。为了改进表示[107]、[108],通过Deluge、DenseNet等将跨信道连接的思想进一步扩展到多层连接。γiIn the year 2016, the width of the network was also explored in connection with depth to improve feature learning [34], [35]. Apart from this, no new architectural modification became prominent but instead, different researchers used hybrid of the already proposed architectures to improve deep CNN performance [33], [104]–[106], [109], [110]. This fact gave the intuition that there might be other factors more important as compared to the appropriate assembly of the network units that can effectively regulate CNN performance. In this regard, Hu et al. (2017) identified that the network representation has a role in learning of deep CNNs [111]. Hu et al. introduced the idea of feature map exploitation and pinpointed that less informative and domain extraneous features may affect the performance of the network to a larger extent. He exploited the aforementioned idea and proposed new architecture named as Squeeze and Excitation Network (SE-Network) [111]. It exploits feature map (commonly known as channel in literature) information by designing a specialized SE-block. This block assigns weight to each feature map depending upon its contribution in class discrimination. This idea was further investigated by different researchers, which assign attention to important regions by exploiting both spatial and feature map (channel) information [37], [38], [112]. In 2018, a new idea of channel boosting was introduced by Khan et al [36]. The motivation behind the training of network with boosted channel representation was to use an enriched representation. This idea effectively boost the performance of a CNN by learning diverse features as well as exploiting the already learnt features through the concept of TL.2016年,还结合深度探索了网络的宽度,以改进特征学习[34],[35]。除此之外,没有新的架构修改变得突出,但相反,不同的研究人员使用已经提出的架构的混合来改进深层CNN性能[33]、[104]–[106]、[109]、[110]。这一事实给人的直觉是,与能够有效调节CNN性能的网络单元的适当组装相比,可能还有其他因素更重要。在这方面,胡等人。(2017)确定了网络代表在学习深层CNN方面的作用[111]。Hu等人。介绍了特征图的开发思想,指出信息量小、领域无关的特征对网络性能的影响较大。他利用了上述思想,提出了一种新的结构,称为挤压激励网络(SE网络)[111]。它通过设计一个专门的SE块来开发特征映射(在文献中通常称为通道)信息。此块根据其在类别识别中的贡献为每个特征映射分配权重。不同的研究者对此进行了进一步的研究,他们利用空间和特征地图(通道)信息将注意力分配到重要区域[37]、[38]、[112]。2018年,Khan等人[36]提出了一种新的渠道提升理念。提高渠道表征的网络训练背后的动机是使用丰富的表征。这一思想通过学习不同的特征以及通过TL的概念利用已经学习的特征,有效地提高了CNN的性能From 2012 up till now, a lot of improvements have been reported in CNN architecture. As regards the architectural advancement of CNNs, recently the focus of research has been on designing of new blocks that can boost network representation by exploiting both feature maps and spatial information or by adding artificial channels.从2012年到现在,CNN的架构有很多改进。关于CNNs的体系结构进展,近年来的研究重点是设计新的块,通过利用特征图和空间信息或添加人工通道来增强网络表示。