ML之DT:基于DT决策树算法(对比是否经特征筛选FS处理)对Titanic(泰坦尼克号)数据集进行二分类预测

ML之DT:基于DT决策树算法(对比是否经特征筛选FS处理)对Titanic(泰坦尼克号)数据集进行二分类预测输出结果初步处理后的 X_train: (984, 474)(0, 0)    31.19418104265403(0, 78)    1.0(0, 82)    1.0(0, 366)    1.0(0, 391)    1.0(0, 435)    1.0(0, 437)    1.0(0, 473)    1.0(1, 0)    31.19418104265403(1, 73)    1.0(1, 79)    1.0(1, 296)    1.0(1, 389)    1.0(1, 397)    1.0(1, 436)    1.0(1, 446)    1.0(2, 0)    31.19418104265403(2, 78)    1.0(2, 82)    1.0(2, 366)    1.0(2, 391)    1.0(2, 435)    1.0(2, 437)    1.0(2, 473)    1.0(3, 0)    32.0:    :(980, 473)    1.0(981, 0)    12.0(981, 73)    1.0(981, 81)    1.0(981, 84)    1.0(981, 390)    1.0(981, 435)    1.0(981, 436)    1.0(981, 473)    1.0(982, 0)    18.0(982, 78)    1.0(982, 81)    1.0(982, 277)    1.0(982, 390)    1.0(982, 435)    1.0(982, 437)    1.0(982, 473)    1.0(983, 0)    31.19418104265403(983, 78)    1.0(983, 82)    1.0(983, 366)    1.0(983, 391)    1.0(983, 435)    1.0(983, 436)    1.0(983, 473)    1.0经过FS处理后的 X_train_fs: (984, 94)(0, 93)    1.0(0, 85)    1.0(0, 83)    1.0(0, 76)    1.0(0, 71)    1.0(0, 27)    1.0(0, 24)    1.0(0, 0)    31.19418104265403(1, 84)    1.0(1, 74)    1.0(1, 63)    1.0(1, 25)    1.0(1, 19)    1.0(1, 0)    31.19418104265403(2, 93)    1.0(2, 85)    1.0(2, 83)    1.0(2, 76)    1.0(2, 71)    1.0(2, 27)    1.0(2, 24)    1.0(2, 0)    31.19418104265403(3, 93)    1.0(3, 85)    1.0(3, 83)    1.0:    :(980, 24)    1.0(980, 0)    31.19418104265403(981, 93)    1.0(981, 84)    1.0(981, 83)    1.0(981, 75)    1.0(981, 28)    1.0(981, 26)    1.0(981, 19)    1.0(981, 0)    12.0(982, 93)    1.0(982, 85)    1.0(982, 83)    1.0(982, 75)    1.0(982, 26)    1.0(982, 24)    1.0(982, 0)    18.0(983, 93)    1.0(983, 84)    1.0(983, 83)    1.0(983, 76)    1.0(983, 71)    1.0(983, 27)    1.0(983, 24)    1.0(983, 0)    31.19418104265403

设计思路

核心代码class SelectPercentile Found at: sklearn.feature_selection.univariate_selectionclass SelectPercentile(_BaseFilter): """Select features according to a percentile of the highest scores. Read more in the :ref:`User Guide <univariate_feature_selection>`. Parameters ---------- score_func : callable Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif (see below "See also"). The default function only works with classification tasks. percentile : int, optional, default=10 Percent of features to keep. Attributes ---------- scores_ : array-like, shape=(n_features,) Scores of features. pvalues_ : array-like, shape=(n_features,) p-values of feature scores, None if `score_func` returned only scores. Notes ----- Ties between features with equal scores will be broken in an unspecified way. See also -------- f_classif: ANOVA F-value between label/feature for classification tasks. mutual_info_classif: Mutual information for a discrete target. chi2: Chi-squared stats of non-negative features for classification tasks. f_regression: F-value between label/feature for regression tasks. mutual_info_regression: Mutual information for a continuous target. SelectKBest: Select features based on the k highest scores. SelectFpr: Select features based on a false positive rate test. SelectFdr: Select features based on an estimated false discovery rate. SelectFwe: Select features based on family-wise error rate. GenericUnivariateSelect: Univariate feature selector with configurable mode. """ def __init__(self, score_func=f_classif, percentile=10): super(SelectPercentile, self).__init__(score_func) self.percentile = percentile def _check_params(self, X, y): if not 0 <= self.percentile <= 100: raise ValueError( "percentile should be >=0, <=100; got %r" % self.percentile) def _get_support_mask(self): check_is_fitted(self, 'scores_') # Cater for NaNs if self.percentile == 100: return np.ones(len(self.scores_), dtype=np.bool) elif self.percentile == 0: return np.zeros(len(self.scores_), dtype=np.bool) scores = _clean_nans(self.scores_) treshold = stats.scoreatpercentile(scores, 100 - self.percentile) mask = scores > treshold ties = np.where(scores == treshold)[0] if len(ties): max_feats = int(len(scores) * self.percentile / 100) kept_ties = ties[:max_feats - mask.sum()] mask[kept_ties] = True return mask

(0)

相关推荐