Skip to content

核心功能   Core Functions

1. 训练

极简使用:只需要传入数据集路径即可进行训练

自定义使用:可传入AutoML(**kwargs),参见API部分

Simple training method, only need to pass in the data set path for training

Source code in function\train.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def train(data_path, **kwargs):
    """
    Simple training method, only need to pass in the data set path for training
    """
    # Load data
    df_train = pd.read_csv(data_path)

    # Separate id and label
    X_train = df_train.drop(['id', 'label'], axis=1)
    y_train = df_train["label"]

    # Calculate the weight of each class
    label_counts = y_train.value_counts()
    total_samples = len(y_train)
    weights = total_samples / label_counts

    # Set the corresponding weight
    sample_weight = y_train.map(weights).values

    # Create AutoML object
    automl = AutoML(**kwargs)

    # Train
    automl.fit(X_train, y_train, sample_weight=sample_weight)

2. 预测

极简使用:只需要传入数据集路径与模型路径即可进行预测

Simple predicting method, only need to pass in the data set path and model path for predicting

Source code in function\predict.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
def predict(file_input, file_output, model_path):
    """
    Simple predicting method, only need to pass in the data set path and model path for predicting
    """
    automl = AutoML(results_path=model_path)

    df = pd.read_csv(file_input)

    # Get the predicted labels
    predictions = automl.predict(df)

    # Add the predicted labels to the original data
    df['predicted_label'] = predictions

    # Output predictions to file_output
    df.to_csv(file_output, index=False)

3. 预设预处理

去除重复样本,使用平均值填充缺失值,进行标准化

请注意!

  • 此功能推荐但非必须,若不手动进行预处理,AutoML会自动进行预设好的简单预处理

Remove duplicate samples, fill missing values with mean, standardize :param missing_method: Missing value filling strategy

Source code in function\preprocessing_simply.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def preprocess_simplified(data_input, data_output, missing_method, has_labels=True):
    """
    Remove duplicate samples, fill missing values with mean, standardize
    :param missing_method: Missing value filling strategy
    """
    # Load data
    data = pd.read_csv(data_input)

    # Separate id and label
    id_data = data['id']
    X_data = data.drop(['id', 'label'], axis=1)

    if has_labels:
        y_data = data['label']
    else:
        y_data = None

    # Initialize Preprocessor
    preprocessor = SimplifiedPreprocessing(missing_method=missing_method)

    # Fit and transform
    preprocessor.fit(X_data)
    processed_X = preprocessor.transform(X_data)

    # Concatenate id_data, processed_X and y_data
    if y_data is not None:
        merged_data = pd.concat([id_data, processed_X, y_data], axis=1)
    else:
        merged_data = pd.concat([id_data, processed_X], axis=1)

    # Save to csv
    merged_data.to_csv(data_output, index=False)

4. 数据集平衡

可选功能,使用欠采样等方式对数据集进行平衡,使各个分类的样本尽可能均衡

请注意!

  • 此功能推荐但非必须,若不手动进行,AutoML不会自动平衡数据
  • 若不平衡现象严重,会影响所得模型的性能

请注意!

  • 请小心使用此功能,数据集的平衡策略与程度会大幅影响所训练的模型的性能

Perform under-sampling on the input data using specified techniques.

Parameters:

Name Type Description Default
data_input str

Path to the input data file.

required
data_output str

Path to the output file to save the balanced data.

required
mode int

Mode to determine the under-sampling technique to apply. Default is 0.

0
**kwargs

Additional keyword arguments for the RepeatedEditedNearestNeighbours.

{}

Returns:

Type Description

None

Source code in function\under_sampling.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def under_sampling(data_input, data_output, mode=0, **kwargs):
    """
    Perform under-sampling on the input data using specified techniques.

    Args:
        data_input (str): Path to the input data file.
        data_output (str): Path to the output file to save the balanced data.
        mode (int): Mode to determine the under-sampling technique to apply. Default is 0.
        **kwargs: Additional keyword arguments for the RepeatedEditedNearestNeighbours.

    Returns:
        None
    """
    # Load data
    df = pd.read_csv(data_input)

    X = df.drop(['id', 'label'], axis=1)
    y = df['label']

    # Print original data distribution
    print("Original data distribution: ", sorted(Counter(y).items()))

    tks = TomekLinks()

    if mode == 0:
        renn_mode_0 = RepeatedEditedNearestNeighbours(**kwargs)

        X, y = tks.fit_resample(X, y)
        X, y = renn_mode_0.fit_resample(X, y)

    elif mode == 1:
        renn_mode_1_0 = RepeatedEditedNearestNeighbours(sampling_strategy=[0], kind_sel='all', n_neighbors=2)
        renn_mode_1_1 = RepeatedEditedNearestNeighbours(sampling_strategy=[1, 2], kind_sel='mode', n_neighbors=6)

        X, y = tks.fit_resample(X, y)
        X, y = renn_mode_1_0.fit_resample(X, y)
        X, y = renn_mode_1_1.fit_resample(X, y)

    balanced_df = pd.concat([pd.DataFrame(np.arange(1, len(X) + 1), columns=['id']),
                             pd.DataFrame(X),
                             pd.DataFrame(y)], axis=1)

    # Print final data distribution
    print("Final data distribution: ", sorted(Counter(balanced_df['label']).items()))

    balanced_df.to_csv(data_output, index=False)

5. 自训练

输入带有标签的数据集,不带有标签的数据集,以及模型存放路径,进行自训练

请注意!

  • 请谨慎使用自训练功能,使用前务必对数据进行全面分析,并仔细设置相关参数

Auto-self-training function Please use this function with caution, make sure to conduct a comprehensive analysis of the data, and carefully set the relevant parameters

Source code in function\self_training.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def auto_st(labeled_data_qs, unlabeled_data_qs, results_path, **kwargs):
    """
    Auto-self-training function
    Please use this function with caution, make sure to conduct a comprehensive analysis of the data,
    and carefully set the relevant parameters
    """

    # Convert queryset to DataFrame
    labeled_data = pd.DataFrame.from_records(labeled_data_qs.values())
    unlabeled_data = pd.DataFrame.from_records(unlabeled_data_qs.values())

    # Assume the last column is the target label.
    X_labeled = labeled_data.drop(['id', 'label'], axis=1)
    y_labeled = labeled_data['label']

    X_unlabeled = unlabeled_data.drop(['id'], axis=1)

    # Define default parameters
    default_params = {
        "num_iterations": 5,
        "lambda_uncertainty_initial": 0.5,
        "lambda_uncertainty_final": 0.2,
        "prob_threshold_initial": 0.5,
        "prob_threshold_final": 0.3,
        "pseudo_label_ratio_initial": 0.25,
        "pseudo_label_ratio_final": 0.10,
        "algorithms": ["LightGBM", "Xgboost", "CatBoost"],
        "ml_task": "multiclass_classification",
        "start_random_models": 3,
        "hill_climbing_steps": 2,
        "top_models_to_improve": 2,
        "composite_features": True,
        "features_selection": True,
        "train_ensemble": True,
        "explain_level": 1,
        "eval_metric": "f1",
        "validation_strategy": {
            "validation_type": "kfold",
            "k_folds": 5,
            "shuffle": True,
            "stratify": True,
            "random_seed": 42
        }
    }

    # Update default arguments with values from kwargs
    default_params.update(kwargs)

    # Create an AutoST instance with the updated parameters
    auto_st = AutoST(results_path, **default_params)

    auto_st.fit(X_labeled, y_labeled, X_unlabeled)

    # Return state
    return {"success": True, "message": "自训练完成"}

6. 数据分析

输入数据集的路径,输出HTML形式的分析报告

  • 单数据集分析报告:
Source code in function\profile_report.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def profile_report_one_files(file, output_file):
    # Extract the file names
    file_name = os.path.basename(file)

    # Load your data
    df_file = pd.read_csv(file)

    # Create the ProfileReport objects without specifying the data source,
    # to allow editing the configuration
    report_file = ProfileReport(df_file, title=file_name, correlations=None,
                                html={"style": {"primary_color": "#2494f4"}})

    # Set the interactions.targets to only compute interactions with 'label'
    report_file.config.interactions.continuous = False

    # Assigning DataFrames and exporting to a file, triggering computation
    report_file.df = df_file

    # Output to html
    report_file.to_file(output_file)
  • 双数据集分析报告:
Source code in function\profile_report.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def profile_report_two_files(file1, file2, output_file):
    # Extract the file names
    file1_name = os.path.basename(file1)
    file2_name = os.path.basename(file2)

    # Load your data
    df_file1 = pd.read_csv(file1)
    df_file2 = pd.read_csv(file2)

    # Create the ProfileReport objects without specifying the data source,
    # to allow editing the configuration
    report_file1 = ProfileReport(df_file1, title=file1_name, correlations=None,
                                 html={"style": {"primary_color": "#2494f4"}})
    report_file2 = ProfileReport(df_file2, title=file2_name, correlations=None,
                                 html={"style": {"primary_color": "#2494f4"}})

    # Set the interactions.targets to only compute interactions with 'label'
    report_file1.config.interactions.continuous = False
    report_file2.config.interactions.continuous = False

    # Assigning DataFrames and exporting to a file, triggering computation
    report_file1.df = df_file1
    report_file2.df = df_file2

    # Compare the datasets
    comparison_report = report_file1.compare(report_file2)

    # Output to html
    comparison_report.to_file(output_file)
  • 三数据集分析报告:
Source code in function\profile_report.py
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def profile_report_three_files(file1, file2, file3, output_file):
    # Extract the file names
    file1_name = os.path.basename(file1)
    file2_name = os.path.basename(file2)
    file3_name = os.path.basename(file3)

    # Load your data
    df_file1 = pd.read_csv(file1)
    df_file2 = pd.read_csv(file2)
    df_file3 = pd.read_csv(file3)

    # Create the ProfileReport objects without specifying the data source,
    # to allow editing the configuration
    report_file1 = ProfileReport(df_file1, title=file1_name, correlations=None,
                                 html={"style": {"primary_color": "#2494f4"}})
    report_file2 = ProfileReport(df_file2, title=file2_name, correlations=None,
                                 html={"style": {"primary_color": "#2494f4"}})
    report_file3 = ProfileReport(df_file3, title=file3_name, correlations=None,
                                 html={"style": {"primary_color": "#2494f4"}})

    # Set the interactions.targets to only compute interactions with 'label'
    report_file1.config.interactions.continuous = False
    report_file2.config.interactions.continuous = False
    report_file3.config.interactions.continuous = False

    # Assigning DataFrames and exporting to a file, triggering computation
    report_file1.df = df_file1
    report_file2.df = df_file2
    report_file3.df = df_file3

    # Compare the datasets
    comparison_report = compare([report_file1, report_file2, report_file3])
    comparison_report.to_file(output_file)

其他功能请查阅功能详述栏目、API栏目