sklearn:sklearn.preprocessing中的Standardization、Scaling、 Normalization简介、使用方法之详细攻略
sklearn:sklearn.preprocessing中的Standardization、Scaling、 Normalization简介、使用方法之详细攻略Standardization&Scaling、 Normalization简介参考文章:https://scikit-learn.org/stable/modules/preprocessing.htmlThe sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.In general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers are more appropriate. The behaviors of the different scalers, transformers, and normalizers on a dataset containing marginal outliers is highlighted in Compare the effect of different scalers on data with outliers.sklearn.preprocessing包提供了几个常用的实用函数和转换器类,用于将原始特征向量转换为更适合下游估计器的表示形式。一般来说,学习算法受益于数据集的标准化。如果数据集中存在一些异常值,则更适合使用健壮的标量或转换器。在比较不同标量对数据和异常值的影响时,重点介绍了不同标量、转换器和规格化器在包含边缘异常值的数据集上的行为。1、Standardization, or mean removal and variance scaling 标准化,或均值去除和方差标度Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.数据集的标准化Standardization 是许多在scikit-learn中实现的机器学习评估器的共同需求;如果单个特征与标准正态分布数据(均值为零,单位方差为零的高斯分布)没有多少相似之处,它们可能会表现得很糟糕。在实践中,我们经常忽略分布的形状,只是通过去除每个特征的平均值来将数据转换为中心,然后通过将非常量特征除以它们的标准差来对其进行缩放。For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.The function scale provides a quick and easy way to perform this operation on a single array-like dataset:例如,学习算法的目标函数中使用的许多元素(如支持向量机的RBF核或线性模型的l1和l2正则化器)都假设所有特征都以0为中心,并且具有相同的顺序的方差。如果一个特征的方差比其他特征的方差大几个数量级,那么它就可能控制目标函数,使estimator 无法按照预期正确地从其他特征中学习。scale 函数提供了一个快速和简单的方法来执行这个操作在一个单一的数组类数据集:from sklearn import preprocessingimport numpy as npX_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])X_scaled = preprocessing.scale(X_train)print(X_scaled )Scaled data has zero mean and unit variance:X_scaled.mean(axis=0)X_scaled.std(axis=0) The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:preprocessing 模块进一步提供了一个StandardScaler 类实现 Transformer API StandardScaler API来计算在一个训练集上的平均值和标准偏差,可以以后再申请相同的转换测试集。因此适用于该类的早期步骤sklearn.pipeline.Pipeline::The scaler instance can then be used on new data to transform it the same way it did on the training set:It is possible to disable either centering or scaling by either passing with_mean=False or with_std=False to the constructor of StandardScaler.然后,可以将scaler实例用于新数据,以与训练集相同的方式对其进行转换:可以通过向StandardScaler的构造函数传递with_mean=False或with_std=False来禁用定心或缩放。scaler = preprocessing.StandardScaler().fit(X_train)print(scaler)print(scaler.mean_)print(scaler.scale_)print(scaler.transform(X_train))X_test = [[-1., 1., 0.]]scaler.transform(X_test) 1.1、Scaling features to a range 缩放功能到一个范围An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using MinMaxScaler or MaxAbsScaler, respectively.The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.另一种标准化方法是将特征缩放到给定的最小值和最大值之间,通常是在0和1之间,或者将每个特征的最大绝对值缩放到单位大小。这可以分别使用MinMaxScaler 和MaxAbsScaler来实现。使用这种缩放的动机包括对非常小的特征标准差的鲁棒性和在稀疏数据中保持零项。1.2、Scaling sparse data 缩放稀疏数据Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales.MaxAbsScaler and maxabs_scale were specifically designed for scaling sparse data, and are the recommended way to go about this. However, scale and StandardScaler can accept scipy.sparse matrices as input, as long as with_mean=False is explicitly passed to the constructor. Otherwise a ValueError will be raised as silently centering would break the sparsity and would often crash the execution by allocating excessive amounts of memoryunintentionally. RobustScaler cannot be fitted to sparse inputs, but you can use the transform method on sparse inputs.以稀疏数据为中心会破坏数据中的稀疏结构,因此很少有明智的做法。然而,缩放稀疏输入是有意义的,特别是当特征在不同的尺度上时。MaxAbsScaler 和maxabs_scale 是专门为扩展稀疏数据而设计的,并且是推荐的实现方法。但是,scale 和StandardScaler 可以接受scipy.sparse 稀疏矩阵作为输入,只要with_mean=False被显式传递给构造函数。否则,将引发一个ValueError,因为静默居中将破坏稀疏性,并且常常会由于无意中分配过多的内存而导致执行崩溃。RobustScaler 不能适用于稀疏输入,但可以在稀疏输入上使用变换方法。Note that the scalers accept both Compressed Sparse Rows and Compressed Sparse Columns format (see scipy.sparse.csr_matrixand scipy.sparse.csc_matrix). Any other sparse input will be converted to the Compressed Sparse Rows representation. To avoid unnecessary memory copies, it is recommended to choose the CSR or CSC representation upstream.Finally, if the centered data is expected to be small enough, explicitly converting the input to an array using the toarray method of sparse matrices is another option.注意,scalers 同时接受压缩的稀疏行和压缩的稀疏列格式(请参阅scipy.sparse.csr_matrixand scipy.sparse.csc_matrix)。任何其他稀疏输入将被转换为压缩的稀疏行表示。为了避免不必要的内存副本,建议选择上游的CSR或CSC表示。最后,如果期望集中的数据足够小,则使用稀疏矩阵的toarray方法显式地将输入转换为数组是另一种选择。1.3、Scaling data with outliers 用离群值对数据进行缩放If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead. They use more robust estimates for the center and range of your data.如果您的数据包含许多异常值,那么使用数据的均值和方差进行缩放可能不会很好地工作。在这些情况下,你可以使用robust_scale 和RobustScaler 作为完全替代。他们对数据的中心和范围使用更可靠的估计。References:Further discussion on the importance of centering and scaling data is available on this FAQ: Should I normalize/standardize/rescale the data?引用:关于定心和缩放数据的重要性的进一步讨论可以在这个常见问题解答中找到:Should I normalize/standardize/rescale the data?1.4、Scaling vs Whitening 缩放比例与白化It is sometimes not enough to center and scale the features independently, since a downstream model can further make some assumption on the linear independence of the features.To address this issue you can use sklearn.decomposition.PCA with whiten=True to further remove the linear correlation across features.由于下游模型可以进一步对特征的线性无关性做出一些假设,因此有时仅对特征进行单独的居中和标度是不够的。要解决这个问题,可以使用带有whiten=True 的sklearn.decomposition.PCA 进一步消除特性之间的线性相关性。1.5、Centering kernel matrices 中心核矩阵If you have a kernel matrix of a kernel K that computes a dot product in a feature space defined by function ϕ, a KernelCenterer can transform the kernel matrix so that it contains inner products in the feature space defined by ϕ followed by removal of the mean in that space.如果你有一个内核的内核K矩阵计算特征空间的内积函数定义的ϕ, a KernelCenterer可以变换核矩阵,特征空间的内积定义为包含ϕ随后切除意味着在这个空间。2、Normalization 归一化Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.归一化是将单个样本缩放到具有单位范数的过程。如果您计划使用二次形式(如点积或任何其他内核)来量化任意一对样本的相似性,则此过程可能非常有用。这种假设是文本分类和聚类中常用的向量空间模型的基础。The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the l1 or l2 norms:The preprocessing module further provides a utility class Normalizer that implements the same operation using the TransformerAPI (even though the fit method is useless in this case: the class is stateless as this operation treats samples independently).This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:normalize函数提供了一种快速、简单的方法,可以使用l1或l2规范在单个类数组数据集上执行此操作:预处理模块进一步提供了一个实用程序类的规范化器,它使用TransformerAPI实现了相同的操作(即使fit方法在这种情况下是无用的:这个类是无状态的,因为这个操作独立地处理样本)。因此,这个类适合在sklearn.pipeline.Pipeline的早期步骤中使用。The normalizer instance can then be used on sample vectors as any transformer:Note: L2 normalization is also known as spatial sign preprocessing.然后,normalizer实例可以作为任何transformer在样本向量上使用:注:L2归一化也称为空间符号预处理。X = [[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]]X_normalized = preprocessing.normalize(X, norm='l2')print(X_normalized)normalizer = preprocessing.Normalizer().fit(X) # fit does nothingprint(normalizer)normalizer.transform(X)normalizer.transform([[-1., 1., 0.]])Sparse inputnormalize and Normalizer accept both dense array-like and sparse matrices from scipy.sparse as input.For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.csr_matrix) before being fed to efficient Cython routines. To avoid unnecessary memory copies, it is recommended to choose the CSR representation upstream.稀疏的输入normalize 和Normalizer 既接受来自scipy.sparse数组,也接受来自scipy的稀疏矩阵。稀疏作为输入。对于稀疏输入,将数据转换为压缩的稀疏行表示形式(请参阅 scipy.sparse.csr_matrix),然后将其提供给高效的Cython例程。为了避免不必要的内存副本,建议选择上游的CSR表示形式。