ML之DT：基于DT决策树算法(对比是否经特征筛选FS处理)对Titanic(泰坦尼克号)数据集进行二分类预测

2024-04-15 15:37:10

ML之DT：基于DT决策树算法(对比是否经特征筛选FS处理)对Titanic(泰坦尼克号)数据集进行二分类预测输出结果初步处理后的 X_train： (984, 474)(0, 0) 31.19418104265403(0, 78) 1.0(0, 82) 1.0(0, 366) 1.0(0, 391) 1.0(0, 435) 1.0(0, 437) 1.0(0, 473) 1.0(1, 0) 31.19418104265403(1, 73) 1.0(1, 79) 1.0(1, 296) 1.0(1, 389) 1.0(1, 397) 1.0(1, 436) 1.0(1, 446) 1.0(2, 0) 31.19418104265403(2, 78) 1.0(2, 82) 1.0(2, 366) 1.0(2, 391) 1.0(2, 435) 1.0(2, 437) 1.0(2, 473) 1.0(3, 0) 32.0: :(980, 473) 1.0(981, 0) 12.0(981, 73) 1.0(981, 81) 1.0(981, 84) 1.0(981, 390) 1.0(981, 435) 1.0(981, 436) 1.0(981, 473) 1.0(982, 0) 18.0(982, 78) 1.0(982, 81) 1.0(982, 277) 1.0(982, 390) 1.0(982, 435) 1.0(982, 437) 1.0(982, 473) 1.0(983, 0) 31.19418104265403(983, 78) 1.0(983, 82) 1.0(983, 366) 1.0(983, 391) 1.0(983, 435) 1.0(983, 436) 1.0(983, 473) 1.0经过FS处理后的 X_train_fs： (984, 94)(0, 93) 1.0(0, 85) 1.0(0, 83) 1.0(0, 76) 1.0(0, 71) 1.0(0, 27) 1.0(0, 24) 1.0(0, 0) 31.19418104265403(1, 84) 1.0(1, 74) 1.0(1, 63) 1.0(1, 25) 1.0(1, 19) 1.0(1, 0) 31.19418104265403(2, 93) 1.0(2, 85) 1.0(2, 83) 1.0(2, 76) 1.0(2, 71) 1.0(2, 27) 1.0(2, 24) 1.0(2, 0) 31.19418104265403(3, 93) 1.0(3, 85) 1.0(3, 83) 1.0: :(980, 24) 1.0(980, 0) 31.19418104265403(981, 93) 1.0(981, 84) 1.0(981, 83) 1.0(981, 75) 1.0(981, 28) 1.0(981, 26) 1.0(981, 19) 1.0(981, 0) 12.0(982, 93) 1.0(982, 85) 1.0(982, 83) 1.0(982, 75) 1.0(982, 26) 1.0(982, 24) 1.0(982, 0) 18.0(983, 93) 1.0(983, 84) 1.0(983, 83) 1.0(983, 76) 1.0(983, 71) 1.0(983, 27) 1.0(983, 24) 1.0(983, 0) 31.19418104265403

设计思路

核心代码class SelectPercentile Found at: sklearn.feature_selection.univariate_selectionclass SelectPercentile(_BaseFilter): """Select features according to a percentile of the highest scores. Read more in the :ref:`User Guide <univariate_feature_selection>`. Parameters ---------- score_func : callable Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif (see below "See also"). The default function only works with classification tasks. percentile : int, optional, default=10 Percent of features to keep. Attributes ---------- scores_ : array-like, shape=(n_features,) Scores of features. pvalues_ : array-like, shape=(n_features,) p-values of feature scores, None if `score_func` returned only scores. Notes ----- Ties between features with equal scores will be broken in an unspecified way. See also -------- f_classif: ANOVA F-value between label/feature for classification tasks. mutual_info_classif: Mutual information for a discrete target. chi2: Chi-squared stats of non-negative features for classification tasks. f_regression: F-value between label/feature for regression tasks. mutual_info_regression: Mutual information for a continuous target. SelectKBest: Select features based on the k highest scores. SelectFpr: Select features based on a false positive rate test. SelectFdr: Select features based on an estimated false discovery rate. SelectFwe: Select features based on family-wise error rate. GenericUnivariateSelect: Univariate feature selector with configurable mode. """ def __init__(self, score_func=f_classif, percentile=10): super(SelectPercentile, self).__init__(score_func) self.percentile = percentile def _check_params(self, X, y): if not 0 <= self.percentile <= 100: raise ValueError( "percentile should be >=0, <=100; got %r" % self.percentile) def _get_support_mask(self): check_is_fitted(self, 'scores_') # Cater for NaNs if self.percentile == 100: return np.ones(len(self.scores_), dtype=np.bool) elif self.percentile == 0: return np.zeros(len(self.scores_), dtype=np.bool) scores = _clean_nans(self.scores_) treshold = stats.scoreatpercentile(scores, 100 - self.percentile) mask = scores > treshold ties = np.where(scores == treshold)[0] if len(ties): max_feats = int(len(scores) * self.percentile / 100) kept_ties = ties[:max_feats - mask.sum()] mask[kept_ties] = True return mask

针对TCGA数据库全部的癌症的表达量矩阵批量运行estimate

关于这个estimate,我们在生信技能树公众号已经是多次分享了,主要是因为肿瘤本身具有异质性而且肿瘤取样问题,所以我们拿到了肿瘤数据(比如表达量矩阵)里面除了恶性癌症细胞的,还有基质细胞和免疫细胞的 ...
第 85 天：NumPy 统计函数

数学统计在我们的程序当中特别是数据分析当中是必不可少的一部分,本文就来介绍一下 NumPy 常见的统计函数. 最大值与最小值 numpy.amin() 用于计算数组中的元素沿指定轴的最小值. 可以通过 ...
Tensorflow objection detection api 物体检测模型（三）从...

在利用官方提供的Tensorflow objection detection api 进行物体检测时,会有很多物体被检测出来并且被框柱,而我的目标是只需要一个类别的物体,那么如何将这个特定的物体抠出来 ...
【python】numpy.percentile()函数

numpy.percentile() 1.函数百分位数是统计中使用的度量,表示小于这个值的观察值的百分比. 函数numpy.percentile()接受以下参数. np.percentile(a, ...
DT830B

,.
ML之DT：基于DT决策树算法(交叉验证FS+for遍历最佳FS)对Titanic(泰坦尼克号)数据集进行二分类预测

ML之DT:基于DT决策树算法(交叉验证FS+for遍历最佳FS)对Titanic(泰坦尼克号)数据集进行二分类预测输出结果设计思路核心代码 fs = feature_selection.Sel ...
ML之RF&XGBoost：分别基于RF随机森林、XGBoost算法对Titanic(泰坦尼克号)数据集进行二分类预测(乘客是否生还)

ML之RF&XGBoost:分别基于RF随机森林.XGBoost算法对Titanic(泰坦尼克号)数据集进行二分类预测(乘客是否生还) 输出结果设计思路核心代码 rfc = RandomF ...
ML之RF&XGBoost：基于RF/XGBoost(均+5f-CrVa)算法对Titanic(泰坦尼克号)数据集进行二分类预测(乘客是否生还)

ML之RF&XGBoost:基于RF/XGBoost(均+5f-CrVa)算法对Titanic(泰坦尼克号)数据集进行二分类预测(乘客是否生还) 输出结果比赛结果设计思路核心代码 rfc ...
DL之GD：利用LogisticGD算法(梯度下降)依次基于一次函数和二次函数分布的数据集实现二分类预测(超平面可视化)

DL之GD:利用LogisticGD算法(梯度下降)依次基于一次函数和二次函数分布的数据集实现二分类预测(超平面可视化) 相关文章 DL之GD:利用LogisticGD算法(梯度下降)依次基于一次函数 ...
ML之DT：基于DT算法对泰坦尼克号乘客数据集进行二分类(是否获救)预测

ML之DT:基于DT算法对泰坦尼克号乘客数据集进行二分类(是否获救)预测输出结果设计思路核心代码 X_train, X_test, y_train, y_test = train_test_sp ...
EL之DT&RF&GBT：基于三种算法(DT、RF、GBT)对泰坦尼克号乘客数据集进行二分类(是否获救)预测并对比各自性能

EL之DT&RF&GBT:基于三种算法(DT.RF.GBT)对泰坦尼克号乘客数据集进行二分类(是否获救)预测并对比各自性能输出结果设计思路核心代码 vec = DictVecto ...
ML之LoR&DT&RF：基于LoR&DT(CART)&RF算法对mushrooms蘑菇数据集(22+1,6513+1611)训练来预测蘑菇是否毒性(二分类预测)

ML之LoR&DT&RF:基于LoR&DT(CART)&RF算法对mushrooms蘑菇数据集(22+1,6513+1611)训练来预测蘑菇是否毒性(二分类预测) 输出 ...
ML：基于自定义数据集利用Logistic、梯度下降算法GD、LoR逻辑回归、Perceptron感知器、SVM支持向量机、LDA线性判别分析算法进行二分类预测(决策边界可视化)

ML:基于自定义数据集利用Logistic.梯度下降算法GD.LoR逻辑回归.Perceptron感知器.支持向量机(SVM_Linear.SVM_Rbf).LDA线性判别分析算法进行二分类预测(决策 ...
ML之分类预测：基于sklearn库的七八种机器学习算法利用糖尿病(diabetes)数据集(8→1)实现二分类预测

ML之分类预测:基于sklearn库的七八种机器学习算法利用糖尿病(diabetes)数据集(8→1)实现二分类预测输出结果数据集展示输出结果 1.k-NN k-NN:Accuracy of K ...

ML之DT：基于DT决策树算法(对比是否经特征筛选FS处理)对Titanic(泰坦尼克号)数据集进行二分类预测

相关推荐