核心功能 Core Functions

1. 训练

极简使用：只需要传入数据集路径即可进行训练

自定义使用：可传入AutoML(**kwargs)，参见API部分

Simple training method, only need to pass in the data set path for training

Source code in function\train.py

def train(data_path, **kwargs):
    """
    Simple training method, only need to pass in the data set path for training
    """
    # Load data
    df_train = pd.read_csv(data_path)

    # Separate id and label
    X_train = df_train.drop(['id', 'label'], axis=1)
    y_train = df_train["label"]

    # Calculate the weight of each class
    label_counts = y_train.value_counts()
    total_samples = len(y_train)
    weights = total_samples / label_counts

    # Set the corresponding weight
    sample_weight = y_train.map(weights).values

    # Create AutoML object
    automl = AutoML(**kwargs)

    # Train
    automl.fit(X_train, y_train, sample_weight=sample_weight)

2. 预测

极简使用：只需要传入数据集路径与模型路径即可进行预测

Simple predicting method, only need to pass in the data set path and model path for predicting

Source code in function\predict.py

def predict(file_input, file_output, model_path):
    """
    Simple predicting method, only need to pass in the data set path and model path for predicting
    """
    automl = AutoML(results_path=model_path)

    df = pd.read_csv(file_input)

    # Get the predicted labels
    predictions = automl.predict(df)

    # Add the predicted labels to the original data
    df['predicted_label'] = predictions

    # Output predictions to file_output
    df.to_csv(file_output, index=False)

3. 预设预处理

去除重复样本，使用平均值填充缺失值，进行标准化

请注意！

此功能推荐但非必须，若不手动进行预处理，AutoML会自动进行预设好的简单预处理

Remove duplicate samples, fill missing values with mean, standardize :param missing_method: Missing value filling strategy

Source code in function\preprocessing_simply.py

def preprocess_simplified(data_input, data_output, missing_method, has_labels=True):
    """
    Remove duplicate samples, fill missing values with mean, standardize
    :param missing_method: Missing value filling strategy
    """
    # Load data
    data = pd.read_csv(data_input)

    # Separate id and label
    id_data = data['id']
    X_data = data.drop(['id', 'label'], axis=1)

    if has_labels:
        y_data = data['label']
    else:
        y_data = None

    # Initialize Preprocessor
    preprocessor = SimplifiedPreprocessing(missing_method=missing_method)

    # Fit and transform
    preprocessor.fit(X_data)
    processed_X = preprocessor.transform(X_data)

    # Concatenate id_data, processed_X and y_data
    if y_data is not None:
        merged_data = pd.concat([id_data, processed_X, y_data], axis=1)
    else:
        merged_data = pd.concat([id_data, processed_X], axis=1)

    # Save to csv
    merged_data.to_csv(data_output, index=False)

4. 数据集平衡

可选功能，使用欠采样等方式对数据集进行平衡，使各个分类的样本尽可能均衡

请注意！

此功能推荐但非必须，若不手动进行，AutoML不会自动平衡数据
若不平衡现象严重，会影响所得模型的性能

请注意！

请小心使用此功能，数据集的平衡策略与程度会大幅影响所训练的模型的性能

Perform under-sampling on the input data using specified techniques.

Parameters:

Name	Type	Description	Default
`data_input`	`str`	Path to the input data file.	required
`data_output`	`str`	Path to the output file to save the balanced data.	required
`mode`	`int`	Mode to determine the under-sampling technique to apply. Default is 0.	`0`
`**kwargs`		Additional keyword arguments for the RepeatedEditedNearestNeighbours.	`{}`

Returns:

Type	Description
	None

Source code in function\under_sampling.py

def under_sampling(data_input, data_output, mode=0, **kwargs):
    """
    Perform under-sampling on the input data using specified techniques.

    Args:
        data_input (str): Path to the input data file.
        data_output (str): Path to the output file to save the balanced data.
        mode (int): Mode to determine the under-sampling technique to apply. Default is 0.
        **kwargs: Additional keyword arguments for the RepeatedEditedNearestNeighbours.

    Returns:
        None
    """
    # Load data
    df = pd.read_csv(data_input)

    X = df.drop(['id', 'label'], axis=1)
    y = df['label']

    # Print original data distribution
    print("Original data distribution: ", sorted(Counter(y).items()))

    tks = TomekLinks()

    if mode == 0:
        renn_mode_0 = RepeatedEditedNearestNeighbours(**kwargs)

        X, y = tks.fit_resample(X, y)
        X, y = renn_mode_0.fit_resample(X, y)

    elif mode == 1:
        renn_mode_1_0 = RepeatedEditedNearestNeighbours(sampling_strategy=[0], kind_sel='all', n_neighbors=2)
        renn_mode_1_1 = RepeatedEditedNearestNeighbours(sampling_strategy=[1, 2], kind_sel='mode', n_neighbors=6)

        X, y = tks.fit_resample(X, y)
        X, y = renn_mode_1_0.fit_resample(X, y)
        X, y = renn_mode_1_1.fit_resample(X, y)

    balanced_df = pd.concat([pd.DataFrame(np.arange(1, len(X) + 1), columns=['id']),
                             pd.DataFrame(X),
                             pd.DataFrame(y)], axis=1)

    # Print final data distribution
    print("Final data distribution: ", sorted(Counter(balanced_df['label']).items()))

    balanced_df.to_csv(data_output, index=False)

5. 自训练

输入带有标签的数据集，不带有标签的数据集，以及模型存放路径，进行自训练

请注意！

请谨慎使用自训练功能，使用前务必对数据进行全面分析，并仔细设置相关参数

Auto-self-training function Please use this function with caution, make sure to conduct a comprehensive analysis of the data, and carefully set the relevant parameters

Source code in function\self_training.py

def auto_st(labeled_data_qs, unlabeled_data_qs, results_path, **kwargs):
    """
    Auto-self-training function
    Please use this function with caution, make sure to conduct a comprehensive analysis of the data,
    and carefully set the relevant parameters
    """

    # Convert queryset to DataFrame
    labeled_data = pd.DataFrame.from_records(labeled_data_qs.values())
    unlabeled_data = pd.DataFrame.from_records(unlabeled_data_qs.values())

    # Assume the last column is the target label.
    X_labeled = labeled_data.drop(['id', 'label'], axis=1)
    y_labeled = labeled_data['label']

    X_unlabeled = unlabeled_data.drop(['id'], axis=1)

    # Define default parameters
    default_params = {
        "num_iterations": 5,
        "lambda_uncertainty_initial": 0.5,
        "lambda_uncertainty_final": 0.2,
        "prob_threshold_initial": 0.5,
        "prob_threshold_final": 0.3,
        "pseudo_label_ratio_initial": 0.25,
        "pseudo_label_ratio_final": 0.10,
        "algorithms": ["LightGBM", "Xgboost", "CatBoost"],
        "ml_task": "multiclass_classification",
        "start_random_models": 3,
        "hill_climbing_steps": 2,
        "top_models_to_improve": 2,
        "composite_features": True,
        "features_selection": True,
        "train_ensemble": True,
        "explain_level": 1,
        "eval_metric": "f1",
        "validation_strategy": {
            "validation_type": "kfold",
            "k_folds": 5,
            "shuffle": True,
            "stratify": True,
            "random_seed": 42
        }
    }

    # Update default arguments with values from kwargs
    default_params.update(kwargs)

    # Create an AutoST instance with the updated parameters
    auto_st = AutoST(results_path, **default_params)

    auto_st.fit(X_labeled, y_labeled, X_unlabeled)

    # Return state
    return {"success": True, "message": "自训练完成"}

6. 数据分析

输入数据集的路径，输出HTML形式的分析报告

单数据集分析报告：

Source code in function\profile_report.py

def profile_report_one_files(file, output_file):
    # Extract the file names
    file_name = os.path.basename(file)

    # Load your data
    df_file = pd.read_csv(file)

    # Create the ProfileReport objects without specifying the data source,
    # to allow editing the configuration
    report_file = ProfileReport(df_file, title=file_name, correlations=None,
                                html={"style": {"primary_color": "#2494f4"}})

    # Set the interactions.targets to only compute interactions with 'label'
    report_file.config.interactions.continuous = False

    # Assigning DataFrames and exporting to a file, triggering computation
    report_file.df = df_file

    # Output to html
    report_file.to_file(output_file)

双数据集分析报告：

Source code in function\profile_report.py

def profile_report_two_files(file1, file2, output_file):
    # Extract the file names
    file1_name = os.path.basename(file1)
    file2_name = os.path.basename(file2)

    # Load your data
    df_file1 = pd.read_csv(file1)
    df_file2 = pd.read_csv(file2)

    # Create the ProfileReport objects without specifying the data source,
    # to allow editing the configuration
    report_file1 = ProfileReport(df_file1, title=file1_name, correlations=None,
                                 html={"style": {"primary_color": "#2494f4"}})
    report_file2 = ProfileReport(df_file2, title=file2_name, correlations=None,
                                 html={"style": {"primary_color": "#2494f4"}})

    # Set the interactions.targets to only compute interactions with 'label'
    report_file1.config.interactions.continuous = False
    report_file2.config.interactions.continuous = False

    # Assigning DataFrames and exporting to a file, triggering computation
    report_file1.df = df_file1
    report_file2.df = df_file2

    # Compare the datasets
    comparison_report = report_file1.compare(report_file2)

    # Output to html
    comparison_report.to_file(output_file)

三数据集分析报告：

Source code in function\profile_report.py

def profile_report_three_files(file1, file2, file3, output_file):
    # Extract the file names
    file1_name = os.path.basename(file1)
    file2_name = os.path.basename(file2)
    file3_name = os.path.basename(file3)

    # Load your data
    df_file1 = pd.read_csv(file1)
    df_file2 = pd.read_csv(file2)
    df_file3 = pd.read_csv(file3)

    # Create the ProfileReport objects without specifying the data source,
    # to allow editing the configuration
    report_file1 = ProfileReport(df_file1, title=file1_name, correlations=None,
                                 html={"style": {"primary_color": "#2494f4"}})
    report_file2 = ProfileReport(df_file2, title=file2_name, correlations=None,
                                 html={"style": {"primary_color": "#2494f4"}})
    report_file3 = ProfileReport(df_file3, title=file3_name, correlations=None,
                                 html={"style": {"primary_color": "#2494f4"}})

    # Set the interactions.targets to only compute interactions with 'label'
    report_file1.config.interactions.continuous = False
    report_file2.config.interactions.continuous = False
    report_file3.config.interactions.continuous = False

    # Assigning DataFrames and exporting to a file, triggering computation
    report_file1.df = df_file1
    report_file2.df = df_file2
    report_file3.df = df_file3

    # Compare the datasets
    comparison_report = compare([report_file1, report_file2, report_file3])
    comparison_report.to_file(output_file)

其他功能请查阅功能详述栏目、API栏目