ML之FE:基于FE特征工程对RentListingInquries数据集进行预处理并导出为三种格式文件(csv格式/txt格式/libsvm稀疏txt格式)
ML之FE:基于FE特征工程对RentListingInquries数据集进行预处理并导出为三种格式文件(csv格式/txt格式/libsvm稀疏txt格式)输出结果1.1、RentListingInquries_FE_train.csv
1.2、RentListingInquries_FE_test.csv
2.1、RentListingInquries_FE_train.txt
2.2、RentListingInquries_FE_test.txt
代码输出y_train初步处理:10 110000 2100004 0100007 2100013 2100014 1100016 2100020 2100026 1100027 2100030 210004 2100044 0100048 210005 2100051 1100052 2100053 2100055 2100058 2100062 2100063 1100065 2100066 210007 1100071 2100075 1100076 2100079 0100081 2..99956 299960 199961 299964 199965 299966 299979 299980 299982 099984 299986 299987 299988 19999 199991 299992 299993 299994 2Name: interest_level, Length: 49352, dtype: int64train_test_sparse为最终处理:(0, 0) 1.5(0, 1) 3.0(0, 2) 40.7145(0, 3) -73.9425(0, 4) 3000.0(0, 5) 1200.0(0, 6) 750.0(0, 7) -1.5(0, 8) 4.5(0, 9) 2016.0(0, 10) 6.0(0, 11) 24.0(0, 12) 4.0(0, 13) 176.0(0, 14) 7.0(0, 15) 95.0(0, 17) 1.0(0, 18) 1.0(0, 19) 1.0(0, 20) 1.0(0, 21) 1.0(0, 22) 1.0(0, 23) 1.0(0, 24) 1.0(0, 32) 1.0: :(124010, 29) 1.0(124010, 30) 1.0(124010, 31) 1.0(124010, 32) 1.0(124010, 33) 1.0(124010, 34) 2.0(124010, 35) 12.0(124010, 36) 0.04446034405901145(124010, 37) 0.010558720013101165(124010, 38) 0.030099750139483926(124010, 39) 0.9593415298474148(124010, 40) 0.21662478672029833(124010, 41) 0.0020547768050611895(124010, 42) 0.7813204364746404(124010, 43) 0.12333335451008201(124010, 44) 1.2281905750572492e-07(124010, 45) 0.8766665226708605(124010, 46) 0.0004487893042226658(124010, 47) 0.001303620464837077(124010, 48) 0.9982475902309401(124010, 49) 3.0(124010, 58) 2.0(124010, 83) 1.0(124010, 107) 1.0(124010, 114) 1.0设计思路正在更新……
核心代码正在更新……train_test['features_count'] = train_test['features'].apply(lambda x: len(x))train_test['features2'] = train_test['features']train_test['features2'] = train_test['features2'].apply(lambda x: ' '.join(x))c_vect = CountVectorizer(stop_words='english', max_features=200, ngram_range=(1, 1)) c_vect_sparse = c_vect.fit_transform(train_test['features2']) c_vect_sparse_cols = c_vect.get_feature_names()train_test.drop(['features', 'features2'], axis=1, inplace=True)from sklearn.datasets import dump_svmlight_filedump_svmlight_file(y_train, dpath + 'RentListingInquries_FE_train_libsvm.txt',X_train_sparse) # dump_svmlight_file(X_train_sparse, dpath + 'RentListingInquries_FE_train_libsvm.txt') dump_svmlight_file(X_test_sparse, dpath + 'RentListingInquries_FE_test_libsvm.txt')