可选算法 Optional Algorithm
Baseline
Classification
基于 scikit-learn :
DummyClassifier
。
所使用的 prior
返回最常见的类作为标签和先验类 predict_proba()
.
Regression
基于 scikit-learn :
DummyRegressor
。
所使用的 mean
返回训练数据的目标平均值。
Baseline
is not tuned
Baseline
没有超参数,只会有一个模型。
Decision Tree
Classification
基于 scikit-learn :
DecisionTreeClassifier
。
Decision Tree
的分类任务超参数
超参数允许的值:
dt_params = {"criterion": ["gini", "entropy"],
"max_depth": [2, 3, 4]}
默认超参数:
classification_default_params = {"criterion": "gini", "max_depth": 3}
Regression
基于 scikit-learn :
DecisionTreeRegressor
。
Decision Tree
的回归任务超参数
超参数允许的值:
dt_params = {
"criterion": ["mse", "friedman_mse"],
"max_depth": [2, 3, 4]
}
默认超参数:
classification_default_params = {"criterion": "mse", "max_depth": 3}
Decision Tree
的可视化可以使用 dtreeviz
包创建 (在 explain_level > 0
时).
Linear
Classification
基于 scikit-learn :
LogisticRegression
。
Linear
的分类任务超参数
Linear
model没有超参数, LogisticRegression
初始化时使用的参数为: max_iter=500, tol=5e-4, n_jobs=-1
。
Regression
基于 scikit-learn :
LinearRegression
.
Linear
的回归任务超参数
Linear
model没有超参数, LinearRegression
初始化时使用的参数为: n_jobs=-1
。
如果 explain_level > 0
,则系数保存在Markdown报告中
Random Forest
Classification
基于 scikit-learn :
RandomForestClassifier
.
Random Forest
的分类任务超参数
超参数允许的值:
rf_params = {
"criterion": ["gini", "entropy"],
"max_features": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"min_samples_split": [10, 20, 30, 40, 50],
"max_depth": [4, 6, 8, 10, 12],
}
默认超参数:
classification_default_params = {
"criterion": "gini",
"max_features": 0.6,
"min_samples_split": 30,
"max_depth": 6,
}
Regression
基于 scikit-learn :
RandomForestRegressor
.
Random Forest
的回归任务超参数
超参数允许的值:
regression_rf_params = {
"criterion": ["mse"],
"max_features": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"min_samples_split": [10, 20, 30, 40, 50],
"max_depth": [4, 6, 8, 10, 12],
}
默认超参数:
regression_default_params = {
"criterion": "mse",
"max_features": 0.6,
"min_samples_split": 30,
"max_depth": 6,
}
Extra Trees
Classification
基于 scikit-learn :
ExtraTreesClassifier
.
Extra Trees
的分类任务超参数
超参数允许的值:
et_params = {
"criterion": ["gini", "entropy"],
"max_features": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"min_samples_split": [10, 20, 30, 40, 50],
"max_depth": [4, 6, 8, 10, 12],
}
默认超参数:
classification_default_params = {
"criterion": "gini",
"max_features": 0.6,
"min_samples_split": 30,
"max_depth": 6,
}
Regression
基于 scikit-learn :
ExtraTreesRegressor
.
Extra Trees
的回归任务超参数
超参数允许的值:
regression_et_params = {
"criterion": ["mse"],
"max_features": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"min_samples_split": [10, 20, 30, 40, 50],
"max_depth": [4, 6, 8, 10, 12],
}
默认超参数:
regression_default_params = {
"criterion": "mse",
"max_features": 0.6,
"min_samples_split": 30,
"max_depth": 6,
}
Xgboost
基于 Xgboost
包。
Binary Classification
Xgboost
的二分类任务超参数
超参数允许的值:
xgb_bin_class_params = {
"objective": ["binary:logistic"],
"eval_metric": ["logloss"],
"eta": [0.05, 0.075, 0.1, 0.15],
"max_depth": [1, 2, 3, 4, 5, 6, 7, 8, 9],
"min_child_weight": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"subsample": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"colsample_bytree": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
}
默认超参数:
classification_bin_default_params = {
"objective": "binary:logistic",
"eval_metric": "logloss",
"eta": 0.1,
"max_depth": 6,
"min_child_weight": 1,
"subsample": 1.0,
"colsample_bytree": 1.0,
}
Multi-class Classification
Xgboost
的多分类任务超参数
超参数允许的值:
xgb_multi_class_params = {
"objective": ["multi:softprob"],
"eval_metric": ["mlogloss"],
"eta": [0.05, 0.075, 0.1, 0.15],
"max_depth": [1, 2, 3, 4, 5, 6, 7, 8, 9],
"min_child_weight": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"subsample": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"colsample_bytree": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
}
默认超参数:
classification_multi_default_params = {
"objective": "multi:softprob",
"eval_metric": "mlogloss",
"eta": 0.1,
"max_depth": 6,
"min_child_weight": 1,
"subsample": 1.0,
"colsample_bytree": 1.0,
}
Regression
Xgboost
的回归任务超参数
超参数允许的值:
xgb_regression_params = {
"objective": ["reg:squarederror"],
"eval_metric": ["rmse"],
"eta": [0.05, 0.075, 0.1, 0.15],
"max_depth": [1, 2, 3, 4],
"min_child_weight": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"subsample": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"colsample_bytree": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
}
默认超参数:
regression_default_params = {
"objective": "reg:squarederror",
"eval_metric": "rmse",
"eta": 0.1,
"max_depth": 4,
"min_child_weight": 1,
"subsample": 1.0,
"colsample_bytree": 1.0,
}
CatBoost
基于 CatBoost
包。
Classification
CatBoost
的分类任务超参数
超参数允许的值:
classification_params = {
"learning_rate": [0.05, 0.1, 0.2],
"depth": [2, 3, 4, 5, 6],
"rsm": [0.7, 0.8, 0.9, 1], # random subspace method
"subsample": [0.7, 0.8, 0.9, 1], # random subspace method
"min_data_in_leaf": [1, 5, 10, 15, 20, 30, 50],
}
默认超参数:
classification_default_params = {
"learning_rate": 0.1,
"depth": 6,
"rsm": 0.9,
"subsample": 1.0,
"min_data_in_leaf": 15,
}
- 对于二分类任务,
loss_function=Logloss
, - 对于多分类任务,
loss_function=MultiClass
.
Regression
CatBoost
的回归任务超参数
超参数允许的值:
regression_params = {
"learning_rate": [0.05, 0.1, 0.2],
"depth": [2, 3, 4, 5, 6],
"rsm": [0.7, 0.8, 0.9, 1], # random subspace method
"subsample": [0.7, 0.8, 0.9, 1], # random subspace method
"min_data_in_leaf": [1, 5, 10, 15, 20, 30, 50],
}
默认超参数:
regression_default_params = {
"learning_rate": 0.1,
"depth": 6,
"rsm": 0.9,
"subsample": 1.0,
"min_data_in_leaf": 15,
}
对于回归任务, loss_function=RMSE
.
LightGBM
基于 LightGBM
包。
Binary Classification
LightGBM
的二分类任务超参数
超参数允许的值:
lgbm_bin_params = {
"objective": ["binary"],
"metric": ["binary_logloss"],
"num_leaves": [3, 7, 15, 31],
"learning_rate": [0.05, 0.075, 0.1, 0.15],
"feature_fraction": [0.8, 0.9, 1.0],
"bagging_fraction": [0.8, 0.9, 1.0],
"min_data_in_leaf": [5, 10, 15, 20, 30, 50],
}
默认超参数:
classification_bin_default_params = {
"objective": "binary",
"metric": "binary_logloss",
"num_leaves": 31,
"learning_rate": 0.1,
"feature_fraction": 0.9,
"bagging_fraction": 0.9,
"min_data_in_leaf": 10,
}
Multi-class Classification
LightGBM
的多分类任务超参数
超参数允许的值:
lgbm_bin_params = {
"objective": ["multiclass"],
"metric": ["multi_logloss"],
"num_leaves": [3, 7, 15, 31],
"learning_rate": [0.05, 0.075, 0.1, 0.15],
"feature_fraction": [0.8, 0.9, 1.0],
"bagging_fraction": [0.8, 0.9, 1.0],
"min_data_in_leaf": [5, 10, 15, 20, 30, 50],
}
默认超参数:
classification_multi_default_params = {
"objective": "multiclass",
"metric": "multi_logloss",
"num_leaves": 31,
"learning_rate": 0.1,
"feature_fraction": 0.9,
"bagging_fraction": 0.9,
"min_data_in_leaf": 10,
}
Regression
LightGBM
的回归任务超参数
超参数允许的值:
lgbm_bin_params = {
"objective": ["regression"],
"metric": ["l2"],
"num_leaves": [3, 7, 15, 31],
"learning_rate": [0.05, 0.075, 0.1, 0.15],
"feature_fraction": [0.8, 0.9, 1.0],
"bagging_fraction": [0.8, 0.9, 1.0],
"min_data_in_leaf": [5, 10, 15, 20, 30, 50],
}
默认超参数:
regression_default_params = {
"objective": "regression",
"metric": "l2",
"num_leaves": 15,
"learning_rate": 0.1,
"feature_fraction": 0.9,
"bagging_fraction": 0.9,
"min_data_in_leaf": 10,
}
Neural Network
使用Keras 和 Tensorflow。 分类任务和回归任务使用相同的超参数,但输出类型和损失函数不同
Neural Network
的超参数
超参数允许的值:
nn_params = {
"dense_layers": [2],
"dense_1_size": [16, 32, 64],
"dense_2_size": [4, 8, 16, 32],
"dropout": [0, 0.1, 0.25],
"learning_rate": [0.01, 0.05, 0.08, 0.1],
"momentum": [0.85, 0.9, 0.95],
"decay": [0.0001, 0.001, 0.01],
}
默认超参数:
default_nn_params = {
"dense_layers": 2,
"dense_1_size": 32,
"dense_2_size": 16,
"dropout": 0,
"learning_rate": 0.05,
"momentum": 0.9,
"decay": 0.001,
}
二分类任务
sigmoid
激活的单个神经元- 损失函数:
binary_crossentropy
.
多分类任务
- 神经元的数量等于类别的数量。输出层的激活为
softmax
。 - 损失函数:
categorical_crossentropy
。
回归任务
linear
激活的单个神经元- 损失函数:
mean_squared_error
.
Nearest Neighbor
使用 scikit-learn:
- 分类任务使用
KNeighborsClassifier
- 回归任务使用
KNeighborsRegressor
Nearest Neighbor
的超参数
超参数允许的值:
knn_params = {
"n_neighbors": [3, 5, 7],
"weights": ["uniform", "distance"]
}
默认超参数:
default_params = {
"n_neighbors": 5,
"weights": "uniform"
}
Stacked Algorithm
堆叠算法是根据先前(非堆叠的)模型的预测构建的。 从非堆叠模型中选择每类中的最优模型,重用它们的的超参数。
- 在堆叠过程中,除
Baseline
外每个算法最多使用 10 个最佳模型。 - out-of-folds 预测用于构建扩展训练数据, 堆叠仅适用于
validation_strategy="kfold"
(k折交叉验证). - 堆叠的模型只能是:
Xgboost
,LightGBM
,CatBoost
.
Ensemble
基于文章实现。 使用 average method, 对所有模型进行贪婪搜索, 并尝试向集成中添加(repeat)模型以提高集成的性能。集成性能是根据所用模型的折叠预测计算的。