ML之sklearn:sklearn库中的ShuffleSplit()函数和StratifiedShuffleSplit()函数的讲解
ML之sklearn:sklearn库中的ShuffleSplit()函数和StratifiedShuffleSplit()函数的讲解sklearn库中的ShuffleSplit()函数和StratifiedShuffleSplit()函数的讲解from sklearn.model_selection import ShuffleSplit,StratifiedShuffleSplit这两个函数均是实现了对数据集进行打乱划分,即在数据集在进行划分之前,先进行打乱操作,否则容易产生过拟合,模型泛化能力下降。其中,StratifiedShuffleSplit函数是StratifiedKFold和ShuffleSplit的合并,它将返回StratifiedKFold。折叠是通过保存每个类的样本百分比来实现的。 首先将样本随机打乱,然后根据设置参数划分出train/test对。通过n_splits产生指定数量的独立的【train/test】数据集,划分数据集划分成n组(n组索引值),其创建的每一组划分将保证每组类比的比例相同。比如第一组训练数据类别比例为2:1,则后面每组类别都满足这个比例。ShuffleSplit()函数cv_split = ShuffleSplit(n_splits=6, train_size=0.7, test_size=0.2)class ShuffleSplit(BaseShuffleSplit):"""Random permutation cross-validatorYields indices to split data into training and test sets.Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.Read more in the :ref:`User Guide <cross_validation>`.Parameters----------n_splits : int, default=10. Number of re-shuffling & splitting iterations.test_size : float or int, default=None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If ``train_size`` is also None, it will be set to 0.1.train_size : float or int, default=None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.random_state : int or RandomState instance, default=None. Controls the randomness of the training and testing indices produced. Pass an int for reproducible output across multiple function calls.See :term:`Glossary <random_state>`.类ShuffleSplit (BaseShuffleSplit):随机排列交叉验证生成将数据分割为训练集和测试集的索引。注:与其他交叉验证策略相反,随机分割并不能保证所有的折叠都是不同的,尽管对于较大的数据集,这种情况仍然很可能发生。更多信息请参见:ref: ' User Guide <cross_validation> '。</cross_validation>参数----------n_splits : int,默认=10。重新洗牌和分裂迭代的数量。将训练数据分成【train/test】对的组数。test_size: float或int,默认=None。如果是浮动的,则应该在0.0和1.0之间,并表示要包含在test分割中的数据集的比例。如果int,表示测试样本的绝对数量。如果没有,则将该值设置为train_size的补集。如果train_size也是None,它将被设置为0.1。test_size用来设置【train/test】对中test所占的比例。train_size: float或int,默认=None。如果是浮点数,则应该在0.0和1.0之间,并表示要包含在train分割序列中的数据集的比例。如果int,表示train样本的绝对数量。如果没有,该值将自动设置为train size的补集。train_size用来设置【train/test】对中train所占的比例。random_state: int或RandomState实例,默认为None。控制产生的训练和测试指标的随机性。在多个函数调用之间传递可重复输出的int。控制将样本随机打乱,用于随机抽样的伪随机数发生器状态。看:术语:“术语表< random_state >”。Examples-------->>> import numpy as np>>> from sklearn.model_selection import ShuffleSplit>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])>>> y = np.array([1, 2, 1, 2, 1, 2])>>> rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)>>> rs.get_n_splits(X)5>>> print(rs)ShuffleSplit(n_splits=5, random_state=0, test_size=0.25, train_size=None)>>> for train_index, test_index in rs.split(X):... print("TRAIN:", train_index, "TEST:", test_index)TRAIN: [1 3 0 4] TEST: [5 2]TRAIN: [4 0 2 5] TEST: [1 3]TRAIN: [1 2 4 0] TEST: [3 5]TRAIN: [3 4 1 0] TEST: [5 2]TRAIN: [3 5 1 0] TEST: [2 4]>>> rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=.25, random_state=0)>>> for train_index, test_index in rs.split(X):... print("TRAIN:", train_index, "TEST:", test_index)TRAIN: [1 3 0] TEST: [5 2]TRAIN: [4 0 2] TEST: [1 3]TRAIN: [1 2 4] TEST: [3 5]TRAIN: [3 4 1] TEST: [5 2]TRAIN: [3 5 1] TEST: [2 4]"""@_deprecate_positional_argsdef __init__(self, n_splits=10, *, test_size=None, train_size=None,random_state=None):super().__init__(n_splits=n_splits, test_size=test_size, train_size=train_size, random_state=random_state)self._default_test_size = 0.1def _iter_indices(self, X, y=None, groups=None):n_samples = _num_samples(X)n_train, n_test = _validate_shuffle_split(n_samples, self.test_size, self.train_size,default_test_size=self._default_test_size)rng = check_random_state(self.random_state)for i in range(self.n_splits):# random partitionpermutation = rng.permutation(n_samples)ind_test = permutation[:n_test]ind_train = permutation[n_test:n_test + n_train]yield ind_train, ind_testExamples-------->>> import numpy as np>>> from sklearn.model_selection import ShuffleSplit>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])>>> y = np.array([1, 2, 1, 2, 1, 2])>>> rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)>>> rs.get_n_splits(X)5>>> print(rs)ShuffleSplit(n_splits=5, random_state=0, test_size=0.25, train_size=None)>>> for train_index, test_index in rs.split(X):... print("TRAIN:", train_index, "TEST:", test_index)TRAIN: [1 3 0 4] TEST: [5 2]TRAIN: [4 0 2 5] TEST: [1 3]TRAIN: [1 2 4 0] TEST: [3 5]TRAIN: [3 4 1 0] TEST: [5 2]TRAIN: [3 5 1 0] TEST: [2 4]>>> rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=.25, random_state=0)>>> for train_index, test_index in rs.split(X):... print("TRAIN:", train_index, "TEST:", test_index)TRAIN: [1 3 0] TEST: [5 2]TRAIN: [4 0 2] TEST: [1 3]TRAIN: [1 2 4] TEST: [3 5]TRAIN: [3 4 1] TEST: [5 2]TRAIN: [3 5 1] TEST: [2 4]"""@_deprecate_positional_argsdef __init__(self, n_splits=10, *, test_size=None, train_size=None,random_state=None):super().__init__(n_splits=n_splits, test_size=test_size, train_size=train_size, random_state=random_state)self._default_test_size = 0.1def _iter_indices(self, X, y=None, groups=None):n_samples = _num_samples(X)n_train, n_test = _validate_shuffle_split(n_samples, self.test_size, self.train_size,default_test_size=self._default_test_size)rng = check_random_state(self.random_state)for i in range(self.n_splits):# random partitionpermutation = rng.permutation(n_samples)ind_test = permutation[:n_test]ind_train = permutation[n_test:n_test + n_train]yield ind_train, ind_testStratifiedShuffleSplit()函数StratifiedShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=None)class StratifiedShuffleSplit(BaseShuffleSplit):"""Stratified Shuffle Split cross-validatorProvides train/test indices to split data in train/test sets.This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.Read more in the :ref:`User Guide <cross_validation>`.Parameters----------n_splits : int, default=10Number of re-shuffling & splitting iterations.test_size : float or int, default=None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If ``train_size`` is also None, it will be set to 0.1.train_size : float or int, default=None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.random_state : int or RandomState instance, default=None. Controls the randomness of the training and testing indices produced. Pass an int for reproducible output across multiple function calls.See :term:`Glossary <random_state>`.分层洗牌分裂交叉验证器提供训练/测试索引来分割训练/测试集中的数据。这个交叉验证对象是StratifiedKFold和ShuffleSplit的合并,它将返回StratifiedKFold。折叠是通过保存每个类的样本百分比来实现的。注意:就像ShuffleSplit策略一样,分层随机分割不能保证所有的折叠都是不同的,尽管这对于相当大的数据集仍然很有可能。更多信息请参见:ref: ' User Guide <cross_validation> '。</cross_validation>参数----------int,默认=10重新洗牌和分裂迭代的数量。test_size: float或int,默认=None。如果是浮动的,则应该在0.0和1.0之间,并表示要包含在测试分割中的数据集的比例。如果int,表示测试样本的绝对数量。如果没有,则将该值设置为train size的补集。如果' ' train_size ' '也是None,它将被设置为0.1。train_size: float或int,默认=None。如果是浮点数,则应该在0.0和1.0之间,并表示要包含在分割序列中的数据集的比例。如果int,表示train样本的绝对数量。如果没有,该值将自动设置为train size的补集。random_state: int或RandomState实例,默认为None。控制产生的训练和测试指标的随机性。在多个函数调用之间传递可重复输出的int。看:术语:“术语表< random_state >”。Examples-------->>> import numpy as np>>> from sklearn.model_selection import StratifiedShuffleSplit>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])>>> y = np.array([0, 0, 0, 1, 1, 1])>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5,random_state=0)>>> sss.get_n_splits(X, y)5>>> print(sss)StratifiedShuffleSplit(n_splits=5, random_state=0, ...)>>> for train_index, test_index in sss.split(X, y):... print("TRAIN:", train_index, "TEST:", test_index)... X_train, X_test = X[train_index], X[test_index]... y_train, y_test = y[train_index], y[test_index]TRAIN: [5 2 3] TEST: [4 1 0]TRAIN: [5 1 4] TEST: [0 2 3]TRAIN: [5 0 2] TEST: [4 3 1]TRAIN: [4 1 0] TEST: [2 3 5]TRAIN: [0 5 1] TEST: [3 4 2]"""@_deprecate_positional_argsdef __init__(self, n_splits=10, *, test_size=None, train_size=None,random_state=None):super().__init__(n_splits=n_splits, test_size=test_size,train_size=train_size, random_state=random_state)self._default_test_size = 0.1def _iter_indices(self, X, y, groups=None):n_samples = _num_samples(X)y = check_array(y, ensure_2d=False, dtype=None)n_train, n_test = _validate_shuffle_split(n_samples, self.test_size, self.train_size,default_test_size=self._default_test_size)if y.ndim == 2:# for multi-label y, map each distinct row to a string repr# using join because str(row) uses an ellipsis if len(row) >1000y = np.array([' '.join(row.astype('str')) for row in y])classes, y_indices = np.unique(y, return_inverse=True)n_classes = classes.shape[0]class_counts = np.bincount(y_indices)if np.min(class_counts) < 2:raise ValueError("The least populated class in y has only 1"" member, which is too few. The minimum"" number of groups for any class cannot"" be less than 2.")if n_train < n_classes:raise ValueError('The train_size = %d should be greater or ''equal to the number of classes = %d' %(n_train, n_classes))if n_test < n_classes:raise ValueError('The test_size = %d should be greater or ''equal to the number of classes = %d' %(n_test, n_classes)) # Find the sorted list of instances foreach class:# (np.unique above performs a sort, so code is O(n logn)already)class_indices = np.split(np.argsort(y_indices,kind='mergesort'), np.cumsum(class_counts)[:-1])rng = check_random_state(self.random_state)for _ in range(self.n_splits):# if there are ties in the class-counts, we want# to make sure to break them anew in each iterationn_i = _approximate_mode(class_counts, n_train, rng)class_counts_remaining = class_counts - n_it_i = _approximate_mode(class_counts_remaining, n_test,rng)train = []test = []for i in range(n_classes):permutation = rng.permutation(class_counts[i])perm_indices_class_i = class_indices[i].take(permutation,mode='clip')train.extend(perm_indices_class_i[:n_i[i]])test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]])train = rng.permutation(train)test = rng.permutation(test)yield train, testdef split(self, X, y, groups=None):"""Generate indices to split data into training and test set.Parameters----------X : array-like of shape (n_samples, n_features)Training data, where n_samples is the number of samplesand n_features is the number of features.Note that providing ``y`` is sufficient to generate the splitsandhence ``np.zeros(n_samples)`` may be used as a placeholderfor``X`` instead of actual training data.y : array-like of shape (n_samples,) or (n_samples, n_labels)The target variable for supervised learning problems.Stratification is done based on the y labels.groups : objectAlways ignored, exists for compatibility.Yields------train : ndarrayThe training set indices for that split.test : ndarrayThe testing set indices for that split.Notes-----Randomized CV splitters may return different results for eachcall ofsplit. You can make the results identical by setting`random_state`to an integer."""y = check_array(y, ensure_2d=False, dtype=None)return super().split(X, y, groups)Examples-------->>> import numpy as np>>> from sklearn.model_selection import StratifiedShuffleSplit>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])>>> y = np.array([0, 0, 0, 1, 1, 1])>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5,random_state=0)>>> sss.get_n_splits(X, y)5>>> print(sss)StratifiedShuffleSplit(n_splits=5, random_state=0, ...)>>> for train_index, test_index in sss.split(X, y):... print("TRAIN:", train_index, "TEST:", test_index)... X_train, X_test = X[train_index], X[test_index]... y_train, y_test = y[train_index], y[test_index]TRAIN: [5 2 3] TEST: [4 1 0]TRAIN: [5 1 4] TEST: [0 2 3]TRAIN: [5 0 2] TEST: [4 3 1]TRAIN: [4 1 0] TEST: [2 3 5]TRAIN: [0 5 1] TEST: [3 4 2]"""@_deprecate_positional_argsdef __init__(self, n_splits=10, *, test_size=None, train_size=None,random_state=None):super().__init__(n_splits=n_splits, test_size=test_size,train_size=train_size, random_state=random_state)self._default_test_size = 0.1def _iter_indices(self, X, y, groups=None):n_samples = _num_samples(X)y = check_array(y, ensure_2d=False, dtype=None)n_train, n_test = _validate_shuffle_split(n_samples, self.test_size, self.train_size,default_test_size=self._default_test_size)if y.ndim == 2:# for multi-label y, map each distinct row to a string repr# using join because str(row) uses an ellipsis if len(row) >1000y = np.array([' '.join(row.astype('str')) for row in y])classes, y_indices = np.unique(y, return_inverse=True)n_classes = classes.shape[0]class_counts = np.bincount(y_indices)if np.min(class_counts) < 2:raise ValueError("The least populated class in y has only 1"" member, which is too few. The minimum"" number of groups for any class cannot"" be less than 2.")if n_train < n_classes:raise ValueError('The train_size = %d should be greater or ''equal to the number of classes = %d' %(n_train, n_classes))if n_test < n_classes:raise ValueError('The test_size = %d should be greater or ''equal to the number of classes = %d' %(n_test, n_classes)) # Find the sorted list of instances foreach class:# (np.unique above performs a sort, so code is O(n logn)already)class_indices = np.split(np.argsort(y_indices,kind='mergesort'), np.cumsum(class_counts)[:-1])rng = check_random_state(self.random_state)for _ in range(self.n_splits):# if there are ties in the class-counts, we want# to make sure to break them anew in each iterationn_i = _approximate_mode(class_counts, n_train, rng)class_counts_remaining = class_counts - n_it_i = _approximate_mode(class_counts_remaining, n_test,rng)train = []test = []for i in range(n_classes):permutation = rng.permutation(class_counts[i])perm_indices_class_i = class_indices[i].take(permutation,mode='clip')train.extend(perm_indices_class_i[:n_i[i]])test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]])train = rng.permutation(train)test = rng.permutation(test)yield train, testdef split(self, X, y, groups=None):"""Generate indices to split data into training and test set.Parameters----------X : array-like of shape (n_samples, n_features)Training data, where n_samples is the number of samplesand n_features is the number of features.Note that providing ``y`` is sufficient to generate the splitsandhence ``np.zeros(n_samples)`` may be used as a placeholderfor``X`` instead of actual training data.y : array-like of shape (n_samples,) or (n_samples, n_labels)The target variable for supervised learning problems.Stratification is done based on the y labels.groups : objectAlways ignored, exists for compatibility.Yields------train : ndarrayThe training set indices for that split.test : ndarrayThe testing set indices for that split.Notes-----Randomized CV splitters may return different results for eachcall ofsplit. You can make the results identical by setting`random_state`to an integer."""y = check_array(y, ensure_2d=False, dtype=None)return super().split(X, y, groups)