ML之DT:基于DT决策树算法(对比是否经特征筛选FS处理)对Titanic(泰坦尼克号)数据集进行二分类预测
ML之DT:基于DT决策树算法(对比是否经特征筛选FS处理)对Titanic(泰坦尼克号)数据集进行二分类预测输出结果初步处理后的 X_train: (984, 474)(0, 0) 31.19418104265403(0, 78) 1.0(0, 82) 1.0(0, 366) 1.0(0, 391) 1.0(0, 435) 1.0(0, 437) 1.0(0, 473) 1.0(1, 0) 31.19418104265403(1, 73) 1.0(1, 79) 1.0(1, 296) 1.0(1, 389) 1.0(1, 397) 1.0(1, 436) 1.0(1, 446) 1.0(2, 0) 31.19418104265403(2, 78) 1.0(2, 82) 1.0(2, 366) 1.0(2, 391) 1.0(2, 435) 1.0(2, 437) 1.0(2, 473) 1.0(3, 0) 32.0: :(980, 473) 1.0(981, 0) 12.0(981, 73) 1.0(981, 81) 1.0(981, 84) 1.0(981, 390) 1.0(981, 435) 1.0(981, 436) 1.0(981, 473) 1.0(982, 0) 18.0(982, 78) 1.0(982, 81) 1.0(982, 277) 1.0(982, 390) 1.0(982, 435) 1.0(982, 437) 1.0(982, 473) 1.0(983, 0) 31.19418104265403(983, 78) 1.0(983, 82) 1.0(983, 366) 1.0(983, 391) 1.0(983, 435) 1.0(983, 436) 1.0(983, 473) 1.0经过FS处理后的 X_train_fs: (984, 94)(0, 93) 1.0(0, 85) 1.0(0, 83) 1.0(0, 76) 1.0(0, 71) 1.0(0, 27) 1.0(0, 24) 1.0(0, 0) 31.19418104265403(1, 84) 1.0(1, 74) 1.0(1, 63) 1.0(1, 25) 1.0(1, 19) 1.0(1, 0) 31.19418104265403(2, 93) 1.0(2, 85) 1.0(2, 83) 1.0(2, 76) 1.0(2, 71) 1.0(2, 27) 1.0(2, 24) 1.0(2, 0) 31.19418104265403(3, 93) 1.0(3, 85) 1.0(3, 83) 1.0: :(980, 24) 1.0(980, 0) 31.19418104265403(981, 93) 1.0(981, 84) 1.0(981, 83) 1.0(981, 75) 1.0(981, 28) 1.0(981, 26) 1.0(981, 19) 1.0(981, 0) 12.0(982, 93) 1.0(982, 85) 1.0(982, 83) 1.0(982, 75) 1.0(982, 26) 1.0(982, 24) 1.0(982, 0) 18.0(983, 93) 1.0(983, 84) 1.0(983, 83) 1.0(983, 76) 1.0(983, 71) 1.0(983, 27) 1.0(983, 24) 1.0(983, 0) 31.19418104265403
设计思路
核心代码class SelectPercentile Found at: sklearn.feature_selection.univariate_selectionclass SelectPercentile(_BaseFilter): """Select features according to a percentile of the highest scores. Read more in the :ref:`User Guide <univariate_feature_selection>`. Parameters ---------- score_func : callable Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif (see below "See also"). The default function only works with classification tasks. percentile : int, optional, default=10 Percent of features to keep. Attributes ---------- scores_ : array-like, shape=(n_features,) Scores of features. pvalues_ : array-like, shape=(n_features,) p-values of feature scores, None if `score_func` returned only scores. Notes ----- Ties between features with equal scores will be broken in an unspecified way. See also -------- f_classif: ANOVA F-value between label/feature for classification tasks. mutual_info_classif: Mutual information for a discrete target. chi2: Chi-squared stats of non-negative features for classification tasks. f_regression: F-value between label/feature for regression tasks. mutual_info_regression: Mutual information for a continuous target. SelectKBest: Select features based on the k highest scores. SelectFpr: Select features based on a false positive rate test. SelectFdr: Select features based on an estimated false discovery rate. SelectFwe: Select features based on family-wise error rate. GenericUnivariateSelect: Univariate feature selector with configurable mode. """ def __init__(self, score_func=f_classif, percentile=10): super(SelectPercentile, self).__init__(score_func) self.percentile = percentile def _check_params(self, X, y): if not 0 <= self.percentile <= 100: raise ValueError( "percentile should be >=0, <=100; got %r" % self.percentile) def _get_support_mask(self): check_is_fitted(self, 'scores_') # Cater for NaNs if self.percentile == 100: return np.ones(len(self.scores_), dtype=np.bool) elif self.percentile == 0: return np.zeros(len(self.scores_), dtype=np.bool) scores = _clean_nans(self.scores_) treshold = stats.scoreatpercentile(scores, 100 - self.percentile) mask = scores > treshold ties = np.where(scores == treshold)[0] if len(ties): max_feats = int(len(scores) * self.percentile / 100) kept_ties = ties[:max_feats - mask.sum()] mask[kept_ties] = True return mask