week 2

PROBLEM 1 Regression on Salary

1. Description. As learned in class, you want to analyze what factors would affect the salary of graduates’ first jobs, and then you collect 500 graduates’ data for analysis. See the “salary.csv” file attached. There are 5 features (GPA, # of internships, # of Competitions, # of Penalties, and Nationality) and 1 label (Salary). Use a linear regression model to analyze this dataset.

2. Visualization. The very first step is to familiarize yourself with the data. You will need to make 6 plots. The first is a salary histogram, helping you understand its distribution. Then, draw feature-label scattered plots for each feature vs. label to see their correlations (5 features, so 5 plots). Briefly state your findings from these plots.

3. K-fold cross validation. Use K=5. Evenly partition the dataset into 5 chunks, denoted as C1, C2, C3, C4, C5, and do 5 runs of experiments.
For each run i = 1, 2, 3, 4, 5:
	Use Ci as test data and the rest as training data;
	Data preprocessing. Standardize the data for each feature and label. Note for continuous columns such as GPA and Salary, only use training data to compute the sample mean and variance, and use them to normalize both training data and test data.
	Use Least Square to find the best model weights on training data, and evaluate the model on test data. Report the model weights, training MSE, and test MSE.
After all the five runs, use a bar chart to visualize the five runs’ training MSE and test MSE, and report an averaged MSE ± 1 standard deviation.

4. Draw conclusions. Based on your model results, briefly describe the experiment results, analyze the results, and draw conclusions.
5. Report. You will need to compose a report containing the abovementioned figures, observations, analysis, and conclusions. Try to write it as professionally as you can. Make it concise and elegant.

report

略

source code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

# 为了和自己计算的数据进行对比,只写了作业1的K-fold
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

from graphviz import Digraph
from sklearn.tree import export_graphviz, DecisionTreeClassifier
import graphviz

class dataprocess:
    def __init__(self, data_path):
        self.threshold = 130000
        self.data = pd.read_csv(data_path).values
        self.GPA = np.array(self.data[:, 0])

        # 将薪资表的数据读取出来，并且进行分割
        self.Internships = np.array(self.data[:, 1])
        self.Competitions = np.array(self.data[:, 2])
        self.Penalties = np.array(self.data[:, 3])
        self.Nationality = np.array(self.data[:, 4])
        self.Salary = np.array(self.data[:, 5])
        # 将工资大于阈值的标记为1，小于的标记为0
        self.Salary_Classification = np.where(self.Salary > self.threshold, 1, 0)
        self.C = np.array([self.data[100 * i:100 * (i + 1), :] for i in range(5)])

        # 将国家数据转换为one-hot编码
        self.Nationality_onehot = self.one_hot_encoder(self.Nationality)

        self.batchs = [1, 4, 8, 16]

        self.weights = None
        self.feature = None

        self.beta = None  # 存储训练后的权重

    #数据可视化操作
    def Visualization(self):
        # 薪资统计
        fig = plt.figure(figsize=(16, 12))
        ax1 = fig.add_subplot(221)
        ax1.hist(self.Salary, bins=15, color='#f59311', alpha=0.5, edgecolor='k')
        ax1.set_title('Salary')
        ax1.set_xlabel('Salary')
        ax1.set_ylabel('Frequency')

        # 将GPA、实习、竞赛、罚款与薪资的关系可视化
        ax2 = fig.add_subplot(222)
        ax2.scatter(self.GPA, self.Salary, color='#f59311', alpha=0.5, edgecolor='k')
        ax2.set_title('GPA vs Salary')
        ax2.set_xlabel('GPA')
        ax2.set_ylabel('Salary')

        ax3 = fig.add_subplot(223)
        ax3.scatter(self.Internships, self.Salary, color='#f59311', alpha=0.5, edgecolor='k')
        ax3.set_title('Internships vs Salary')
        ax3.set_xlabel('Internships')
        ax3.set_ylabel('Salary')

        ax4 = fig.add_subplot(224)
        ax4.scatter(self.Competitions, self.Salary, color='#f59311', alpha=0.5, edgecolor='k')
        ax4.set_title('Competitions vs Salary')
        ax4.set_xlabel('Competitions')
        ax4.set_ylabel('Salary')

        plt.show()

        fig = plt.figure(figsize=(16, 12))
        ax5 = fig.add_subplot(231)
        ax5.scatter(self.Penalties, self.Salary, color='#f59311', alpha=0.5, edgecolor='k')
        ax5.set_title('Penalties vs Salary')
        ax5.set_xlabel('Penalties')
        ax5.set_ylabel('Salary')

        ax6 = fig.add_subplot(232)
        ax6.scatter(self.Nationality, self.Salary, color='#f59311', alpha=0.5, edgecolor='k')
        ax6.set_title('Nationality vs Salary')
        ax6.set_xlabel('Nationality')
        ax6.set_ylabel('Salary')

        plt.show()

    # 计算均值和方差
    def average(self, C, K=None):
        # data = np.array([])
        # for i in range(5):
        #     if i is not K:
        #         data = np.hstack((data, C[100*i:100*(i+1)]))
        return np.mean(C, axis=0)

    def variance(self, C, K=None):
        # data = np.array([])
        # for i in range(5):
        #     if i is not K:
        #         data = np.hstack((data, C[100 * i:100 * (i + 1)]))

        return np.var(C, axis=0)

    # One-Hot 编码
    def one_hot_encoder(self, Nationality):

        unique_categories = np.unique(Nationality)

        category_to_index = {category: index for index, category in enumerate(unique_categories)}

        # 初始化 One-Hot 编码矩阵
        Nationality_onehot = np.zeros((len(Nationality), len(unique_categories)))

        # 将类别映射为 One-Hot 向量
        for i, category in enumerate(Nationality):
            Nationality_onehot[i, category_to_index[category]] = 1

        # print("Unique categories:", unique_categories)
        # print("One-Hot Encoded matrix:")
        # print(self.Nationality_onehot)

        return Nationality_onehot

    # 将连续数据标准化
    def standardization(self, C, average, variance, k):
        return (C - average) / variance

    # 最小二乘法
    def Least_Square(self, train_data, train_label, test_data, test_label):
        X_transpose = np.transpose(train_data)

        XTX = np.dot(X_transpose, train_data)

        XTX_inv = np.linalg.inv(XTX)

        XTy = np.dot(X_transpose, train_label)

        self.beta = np.dot(XTX_inv, XTy)

        train_predictions = np.dot(train_data, self.beta)
        test_predictions = np.dot(test_data, self.beta)

        # 计算训练数据和测试数据的均方误差（MSE）
        train_mse = np.mean((train_predictions - train_label) ** 2)
        test_mse = np.mean((test_predictions - test_label) ** 2)

        # print("Model weights (beta):", self.beta)
        # print("Training MSE:", train_mse)
        # print("Test MSE:", test_mse)

        return self.beta, train_mse, test_mse

    # 对任务3的二分类数据进行处理
    def task3_dataprocess(self, i):
        # 数据归一化，以及相关的划分
        # 除薪资数据外，其余部分与 k-fold 中数据处理方法大致相同
        GPA_average = self.average(np.delete(self.GPA, slice(100 * i, 100 * (i + 1))), i)
        GPA_variance = self.variance(np.delete(self.GPA, slice(100 * i, 100 * (i + 1))), i)
        GPA = self.standardization(self.GPA, GPA_average, GPA_variance, i)

        GPA_test = GPA[100 * i:100 * (i + 1)]
        GPA = np.delete(GPA, slice(100 * i, 100 * (i + 1)), axis=0)

        Interships_test = self.Internships[100 * i:100 * (i + 1)]
        Interships = np.delete(self.Internships, slice(100 * i, 100 * (i + 1)), axis=0)

        Competitions_test = self.Competitions[100 * i:100 * (i + 1)]
        Competitions = np.delete(self.Competitions, slice(100 * i, 100 * (i + 1)), axis=0)

        Penalties_test = self.Penalties[100 * i:100 * (i + 1)]
        Penalties = np.delete(self.Penalties, slice(100 * i, 100 * (i + 1)), axis=0)

        Nationality_onehot = self.Nationality_onehot
        Nationality_test = Nationality_onehot[100 * i:100 * (i + 1)]
        Nationality = np.delete(Nationality_onehot, slice(100 * i, 100 * (i + 1)), axis=0)

        # Salary_average = self.average(np.delete(self.Salary, slice(100 * i, 100 * (i + 1))), i)
        # Salary_variance = self.variance(np.delete(self.Salary, slice(100 * i, 100 * (i + 1))), i)
        # Salary = self.standardization(self.Salary, Salary_average, Salary_variance, i)
        # high = 1, low = 0

        # 将薪资的数据进行比较替换
        Salary = self.Salary_Classification
        # print(Salary)
        # print(Salary.shape)
        Salary_test = Salary[100 * i:100 * (i + 1)]
        Salary = np.delete(Salary, slice(100 * i, 100 * (i + 1)), axis=0)

        # train_data = np.hstack(GPA, Interships, Competitions, Penalties, Nationality)
        train_data = np.concatenate((GPA[:, np.newaxis], Interships[:, np.newaxis], Competitions[:, np.newaxis],
                                     Penalties[:, np.newaxis], Nationality), axis=1).astype(float)

        train_label = Salary.astype(int)

        test_data = np.concatenate(
            (GPA_test[:, np.newaxis], Interships_test[:, np.newaxis], Competitions_test[:, np.newaxis],
             Penalties_test[:, np.newaxis], Nationality_test), axis=1).astype(float)
        test_label = Salary_test.astype(float)

        # 返回分好的训练数据和测试数据
        return train_data, train_label, test_data, test_label

    # i 表述第几个数据集
    def pre_dataprocess(self, i):
        # 数据归一化，以及相关的划分
        GPA_average = self.average(np.delete(self.GPA, slice(100 * i, 100 * (i + 1))), i)
        GPA_variance = self.variance(np.delete(self.GPA, slice(100 * i, 100 * (i + 1))), i)
        GPA = self.standardization(self.GPA, GPA_average, GPA_variance, i)

        GPA_test = GPA[100 * i:100 * (i + 1)]
        GPA = np.delete(GPA, slice(100 * i, 100 * (i + 1)), axis=0)

        Interships_test = self.Internships[100 * i:100 * (i + 1)]
        Interships = np.delete(self.Internships, slice(100 * i, 100 * (i + 1)), axis=0)

        Competitions_test = self.Competitions[100 * i:100 * (i + 1)]
        Competitions = np.delete(self.Competitions, slice(100 * i, 100 * (i + 1)), axis=0)

        Penalties_test = self.Penalties[100 * i:100 * (i + 1)]
        Penalties = np.delete(self.Penalties, slice(100 * i, 100 * (i + 1)), axis=0)

        Nationality_onehot = self.Nationality_onehot
        Nationality_test = Nationality_onehot[100 * i:100 * (i + 1)]
        Nationality = np.delete(Nationality_onehot, slice(100 * i, 100 * (i + 1)), axis=0)

        Salary_average = self.average(np.delete(self.Salary, slice(100 * i, 100 * (i + 1))), i)
        Salary_variance = self.variance(np.delete(self.Salary, slice(100 * i, 100 * (i + 1))), i)
        Salary = self.standardization(self.Salary, Salary_average, Salary_variance, i)

        Salary_test = Salary[100 * i:100 * (i + 1)]
        Salary = np.delete(Salary, slice(100 * i, 100 * (i + 1)), axis=0)

        # train_data = np.hstack(GPA, Interships, Competitions, Penalties, Nationality)
        train_data = np.concatenate((GPA[:, np.newaxis], Interships[:, np.newaxis], Competitions[:, np.newaxis],
                                     Penalties[:, np.newaxis], Nationality), axis=1).astype(float)

        train_label = Salary.astype(float)

        test_data = np.concatenate(
            (GPA_test[:, np.newaxis], Interships_test[:, np.newaxis], Competitions_test[:, np.newaxis],
             Penalties_test[:, np.newaxis], Nationality_test), axis=1).astype(float)
        test_label = Salary_test.astype(float)

        return train_data, train_label, test_data, test_label

    def K_fold(self, k=5):
        # GPA, Internships, Competitions, Penalties, Nationality, Salary
        # 最小二乘法的储存
        beta_array, train_mse_array, test_mse_array = [], [], []
        # SGD的数据结构
        # weight_dict, bias_dict, train_loss_dict, test_loss_dict = {}, {}, {}, {}
        for i in range(k):
            train_data, train_label, test_data, test_label = self.pre_dataprocess(i)
            # 最小二乘法
            beta, train_mse, test_mse = self.Least_Square(train_data, train_label, test_data, test_label)

            beta_array.append(beta)
            train_mse_array.append(train_mse)
            test_mse_array.append(test_mse)


        print("############################################")
        print('my K-fold function:')
        print(f"beta: {np.mean(beta_array)} + {np.var(beta_array)}")
        print(f"train_mse: {np.mean(train_mse_array)} + {np.var(train_mse_array)}")
        print(f"test_mse: {np.mean(test_mse_array)} + {np.var(test_mse_array)}")

        x_labels = [f'Run {i + 1}' for i in range(5)]
        x = np.arange(len(x_labels))  # X 轴的位置
        width = 0.35  # 柱子的宽度

        fig, ax = plt.subplots()
        # 绘制柱状图

        train_mse_array_scaled = [value * 1e15 for value in train_mse_array]
        test_mse_array_scaled = [value * 1e15 for value in test_mse_array]

        # 绘制训练 MSE 和测试 MSE 的柱状图
        rects1 = ax.bar(x - width / 2, train_mse_array_scaled, width, label='Training MSE', color='b')
        rects2 = ax.bar(x + width / 2, test_mse_array_scaled, width, label='Testing MSE', color='r')

        ax.set_ylabel('MSE / e^-15')
        ax.set_title('Training and Testing MSE for Each Run')
        ax.set_xticks(x)
        ax.set_xticklabels(x_labels)
        ax.legend()

        for rect in rects1:
            height = rect.get_height()
            ax.annotate(f'{height:.2f}',
                        xy=(rect.get_x() + rect.get_width() / 2, height),
                        xytext=(0, 3),  # 偏移量
                        textcoords="offset points",
                        ha='center', va='bottom')

        for rect in rects2:
            height = rect.get_height()
            ax.annotate(f'{height:.2f}',
                        xy=(rect.get_x() + rect.get_width() / 2, height),
                        xytext=(0, 3),  # 偏移量
                        textcoords="offset points",
                        ha='center', va='bottom')

        plt.tight_layout()
        plt.show()

        ####################################
        # sklearn库进行验证
        ####################################
        kf = KFold(n_splits=5)
        model = LinearRegression()
        train_mse_list, test_mse_list = [], []
        for train_index, test_index in kf.split(train_data):
            train_data_kf, test_data_kf = train_data[train_index], train_data[test_index]
            train_label_kf, test_label_kf = train_label[train_index], train_label[test_index]

            model.fit(train_data_kf, train_label_kf)

            train_pred = model.predict(train_data_kf)
            test_pred = model.predict(test_data_kf)

            train_mse = mean_squared_error(train_label_kf, train_pred)
            test_mse = mean_squared_error(test_label_kf, test_pred)

            train_mse_list.append(train_mse)
            test_mse_list.append(test_mse)

        print("############################################")
        print('sklearn K-fold function:')
        print("beta: ", model.intercept_)
        print("train_mse: ", np.mean(train_mse_list))
        print("test_mse: ", np.mean(test_mse_list))
        print("############################################")

    # SGD的实现
    def SGD(self, train_data, train_label, test_data, test_label, learning_rate=0.001, max_iter=100, batch=1):

        # 初始化相关参数
        n_samples, n_features = train_data.shape
        weights = np.zeros(n_features)
        bias = 0
        # 保存相关的 mse 差值
        train_loss_list, test_loss_list = [], []
        for epoch in range(max_iter):
            for i in range(0, n_samples, batch):
                X_batch = train_data[i:i + batch]
                y_batch = train_label[i:i + batch]

                y_pred = np.dot(X_batch, weights) + bias

                gradient_w = -2 * np.dot(X_batch.T, (y_batch - y_pred)) / batch
                gradient_b = -2 * np.mean(y_batch - y_pred)

                weights -= learning_rate * gradient_w
                bias -= learning_rate * gradient_b

            train_loss = np.mean((train_label - (np.dot(train_data, weights) + bias)) ** 2) * 1e11
            train_loss_list.append(train_loss)

        # 用训练好的参数预测测试集，计算 mse
        test_loss = np.mean((test_label - (np.dot(test_data, weights) + bias)) ** 2) * 1e11
        test_loss_list.append(test_loss)

            # if epoch % 10 == 0:
            #     print(f"Epoch {epoch + 1}/{max_iter}, Training Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}")

        return weights, bias, train_loss_list, test_loss_list

    def mysgd(self, k=5):
        # GPA, Internships, Competitions, Penalties, Nationality, Salary
        # SGD的数据结构
        batches = [1, 4, 8, 16]
        weight_dict, bias_dict, train_loss_dict, test_loss_dict = {}, {}, {}, {}

            # SGD算法
        for batch in batches:
            weight_b = []
            bias_b = []
            train_loss_b = []
            test_loss_b = []
            for i in range(k):
                train_data, train_label, test_data, test_label = self.pre_dataprocess(i)

                sgd_weight, sgd_bias, train_loss_mse, test_loss_mse = self.SGD(train_data, train_label, test_data, test_label, learning_rate=0.001, batch=batch)

                weight_b.append(sgd_weight)
                bias_b.append(sgd_bias)
                train_loss_b.append(train_loss_mse)
                test_loss_b.append(test_loss_mse)

            weight_dict[batch] = np.mean(weight_b, axis=0)
            bias_dict[batch] = np.mean(bias_b, axis=0)
            train_loss_dict[batch] = np.mean(train_loss_b, axis=0)
            test_loss_dict[batch] = np.mean(test_loss_b, axis=0)

        fig = plt.figure(figsize=(16, 12))
        print("############################################")
        print('my SGD function:')
        for batch in batches:
            print("############################################")
            print("batch: ", batch)
            print("weight: ", weight_dict[batch])
            print("bias: ", bias_dict[batch])
            # print("train_mse: ", train_loss_dict[batch])
            # print("test_mse: ", test_loss_dict[batch])
            ax = fig.add_subplot(2, 2, batches.index(batch) + 1)
            ax.plot(train_loss_dict[batch], label='train_mse')
            # ax.plot(test_loss_dict[batch], label='test_mse')
            ax.set_title(f'Batch Size: {batch}')
            ax.set_xlabel('Epoch')
            ax.set_ylabel('MSE ^ 1e-11')
            ax.legend()

        print("############################################")
        plt.show()

    # 作业一中的可视化，包括数据分布和K折交叉验证
    def task1(self):
        self.Visualization()
        self.K_fold()

    # 作业二中的SGD算法
    def task2(self):
        self.mysgd()


# 感知机算法
class Perceptron:
    def __init__(self, learning_rate=0.001, max_iter=100):
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.weights = None
        self.bias = None
        self.dp = dataprocess('salary.csv')

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    def SGD(self, train_data, train_label, test_data, test_label, learning_rate=0.001, max_iter=100, batch=1):

        n_samples, n_features = train_data.shape
        weights = np.zeros(n_features)
        bias = 0
        train_loss_list, test_loss_list = [], []

        for epoch in range(max_iter):

            for i in range(0, n_samples, batch):
                X_batch = train_data[i:i + batch]
                y_batch = train_label[i:i + batch]

                y_pred = self.sigmoid(np.dot(X_batch, weights) + bias)

                gradient_w = -2 * np.dot(X_batch.T, (y_batch - y_pred)) / batch
                gradient_b = -2 * np.mean(y_batch - y_pred)

                weights -= learning_rate * gradient_w
                bias -= learning_rate * gradient_b

                y_train_pred = self.sigmoid(np.dot(train_data, weights) + bias)
                train_loss = -np.mean(
                    train_label * np.log(y_train_pred + 1e-9) + (1 - train_label) * np.log(1 - y_train_pred + 1e-9))
                train_loss_list.append(train_loss)

            y_test_pred = self.sigmoid(np.dot(test_data, weights) + bias)
            test_loss = -np.mean(test_label * np.log(y_test_pred + 1e-9) + (1 - test_label) * np.log(1 - y_test_pred + 1e-9))
            test_loss_list.append(test_loss)

            # if epoch % 10 == 0:
            #     print(f"Epoch {epoch + 1}/{max_iter}, Training Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}")

        return weights, bias, train_loss_list, test_loss_list

    def train(self, k=5):
        weight_list, bias_list,train_mse_list, test_mse_list = [], [], [], []
        print('training with Perceptron')
        for i in range(k):
            train_data, train_label, test_data, test_label = self.dp.task3_dataprocess(i)
            weight, bias, train_mse_error, test_mse_error = self.SGD(train_data, train_label, test_data, test_label, learning_rate=0.001, batch=1)

            weight_list.append(weight)
            bias_list.append(bias)
            train_mse_list.append(train_mse_error)
            test_mse_list.append(test_mse_error)

        print("############################################")
        print('Perceptron function:')
        print("############################################")
        print("weight: ", np.mean(weight_list, axis=0))
        print("bias: ", np.mean(bias_list, axis=0))
        print("train_mse: ", np.mean(train_mse_list, axis=0))
        print("test_mse: ", np.mean(test_mse_list, axis=0))

        plt.plot(np.mean(train_mse_list, axis=1), label='train_mse')
        # plt.plot(np.mean(test_mse_list, axis=0), label='test_mse')
        plt.xlabel('Epoch')
        plt.ylabel('MSE')
        plt.legend()
        plt.show()

class LogisticRegressionSGD:
    def __init__(self, learning_rate=0.001, max_iter=100):
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.weights = None
        self.bias = None
        self.dp = dataprocess('salary.csv')

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))
    def SGD(self, train_data, train_label, test_data, test_label, learning_rate=0.001, max_iter=1000, batch=1):

        n_samples, n_features = train_data.shape
        weights = np.zeros(n_features)
        bias = 0
        train_loss_list, test_loss_list = [], []
        for epoch in range(max_iter):

            for i in range(0, n_samples, batch):
                X_batch = train_data[i:i + batch]
                y_batch = train_label[i:i + batch]

                y_pred = np.dot(X_batch, weights) + bias
                y_pred = self.sigmoid(y_pred)

                gradient_w = -2 * np.dot(X_batch.T, (y_batch - y_pred)) / batch
                gradient_b = -2 * np.mean(y_batch - y_pred)

                weights -= learning_rate * gradient_w
                bias -= learning_rate * gradient_b

            train_loss = -np.mean(y_batch * np.log(y_pred + 1e-9) + (1 - y_batch) * np.log(1 - y_pred + 1e-9))
            train_loss_list.append(train_loss)

        test_pred = self.sigmoid(np.dot(test_data, weights) + bias)
        test_loss = -np.mean(test_label * np.log(test_pred + 1e-9) + (1 - test_label) * np.log(1 - test_pred + 1e-9))
        test_loss_list.append(test_loss)

            # if epoch % 10 == 0:
            #     print(f"Epoch {epoch + 1}/{max_iter}, Training Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}")

        return weights, bias, train_loss_list, test_loss_list

    def train(self, k=5):
        weight_list, bias_list,train_mse_list, test_mse_list = [], [], [], []
        print('training Logistic Regression model...')
        for i in range(k):
            train_data, train_label, test_data, test_label = self.dp.task3_dataprocess(i)
            weight, bias, train_mse_error, test_mse_error = self.SGD(train_data, train_label, test_data, test_label, learning_rate=0.01, batch=1)

            weight_list.append(weight)
            bias_list.append(bias)
            train_mse_list.append(train_mse_error)
            test_mse_list.append(test_mse_error)

        print(test_mse_list)
        print("############################################")
        print('Logistic Regression function:')
        print("############################################")
        print("weight: ", np.mean(weight_list, axis=0))
        print("bias: ", np.mean(bias_list, axis=0))
        print("train_mse: ", np.mean(train_mse_list, axis=0))
        print("test_mse: ", np.mean(test_mse_list, axis=0))

        plt.plot(np.mean(train_mse_list, axis=0), label='train_mse')
        # plt.plot(np.mean(test_mse_list, axis=0), label='test_mse')
        plt.xlabel('Epoch')
        plt.ylabel('MSE')
        plt.legend()
        plt.show()


class DecisionTree:
    def __init__(self, epsilon=0.1, data_path='salary.csv'):
        self.epsilon = epsilon  # 停止生长的阈值
        self.tree = None
        self.dp = dataprocess(data_path)
        self.labels = ["GPA", "Internships", "Competitions", "Penalties", "Australia", "U.K.", "U.S."]

    def _grow_tree(self, X, y):
        num_samples = len(y)

        # 检查样本数量
        if num_samples < 2:  # 至少需要两个样本才能分割
            return {'label': np.argmax(np.bincount(y)), 'num_samples': num_samples, 'loss': self._cross_entropy(y)}

        if len(set(y)) == 1:  # 如果所有样本属于同一类
            return {'label': y[0], 'num_samples': num_samples, 'loss': self._cross_entropy(y)}

        probabilities = np.bincount(y) / num_samples
        if np.max(probabilities) >= (1 - self.epsilon):  # 如果几乎纯净
            return {'label': np.argmax(probabilities), 'num_samples': num_samples, 'loss': self._cross_entropy(y)}

        best_feature = self._best_feature_to_split(X, y)
        threshold = np.median(X[:, best_feature])  # 使用中位数作为阈值
        left_indices = X[:, best_feature] <= threshold
        right_indices = X[:, best_feature] > threshold

        # 检查左右子集是否有样本
        if not np.any(left_indices) or not np.any(right_indices):
            return {'label': np.argmax(np.bincount(y)), 'num_samples': num_samples, 'loss': self._cross_entropy(y)}

        tree = {
            'feature_index': best_feature,
            'threshold': threshold,
            'children': {
                'left': self._grow_tree(X[left_indices], y[left_indices]),
                'right': self._grow_tree(X[right_indices], y[right_indices])
            },
            'num_samples': num_samples,
            'loss': self._cross_entropy(y)
        }

        return tree

        # 为左右子树创建标签
        left_indices = X[:, best_feature] <= threshold
        right_indices = X[:, best_feature] > threshold

        tree['children']['left'] = self._grow_tree(X[left_indices], y[left_indices])
        tree['children']['right'] = self._grow_tree(X[right_indices], y[right_indices])

        return tree

    def _cross_entropy(self, y):
        probabilities = np.bincount(y) / len(y)
        return -np.sum(probabilities[probabilities > 0] * np.log(probabilities[probabilities > 0]))

    def _best_feature_to_split(self, X, y):
        num_features = X.shape[1]
        best_gain = -np.inf
        best_feature = -1

        for feature in range(num_features):
            gain = self._information_gain(X, y, feature)
            if gain > best_gain:
                best_gain = gain
                best_feature = feature

        return best_feature

    def _information_gain(self, X, y, feature_index):
        total_entropy = self._cross_entropy(y)

        # 检查样本数量
        if len(y) <= 1:  # 至少需要两个样本才能分割
            return 0

        # 计算阈值
        threshold = np.median(X[:, feature_index])  # 使用中位数作为阈值
        left_indices = X[:, feature_index] <= threshold
        right_indices = X[:, feature_index] > threshold

        # 检查左右子集是否有样本
        if not np.any(left_indices) or not np.any(right_indices):
            return 0  # 如果任一子集为空，返回0

        left_entropy = self._cross_entropy(y[left_indices])
        right_entropy = self._cross_entropy(y[right_indices])

        weighted_entropy = (
                (np.sum(left_indices) / len(y)) * left_entropy +
                (np.sum(right_indices) / len(y)) * right_entropy
        )

        return total_entropy - weighted_entropy

    def predict(self, X, tree):
        return np.array([self._predict_sample(sample, tree) for sample in X])

    def _predict_sample(self, sample, tree):
        if tree is None or 'label' in tree:
            return tree['label'] if tree is not None else None

        feature_value = sample[tree['feature_index']]
        if feature_value <= tree['threshold']:
            return self._predict_sample(sample, tree['children']['left'])
        else:
            return self._predict_sample(sample, tree['children']['right'])

    def train(self, X, y):
        return self._grow_tree(X, y)

    def K_fold(self, k=5):
        test_acc = 0
        train_acc = 0
        for i in range(k):
            train_data, train_label, test_data, test_label = self.dp.task3_dataprocess(i)
            tree = self.train(train_data, train_label)
            train_pred = self.predict(train_data, tree)
            test_pred = self.predict(test_data, tree)
            train_acc += np.sum(train_pred == train_label)
            test_acc += np.sum(test_pred == test_label)

        train_accuracy = train_acc / (len(train_label) * k)
        print(f'Tree decision K-fold train accuracy is: {train_accuracy}')
        test_accuracy = test_acc / (len(test_label) * k)
        print(f'Tree decision K-fold test accuracy is: {test_accuracy}')
        # print(f'train_data length is :{len(train_label)}')
        # print(f'test_data length is :{len(test_label)}')

        # 可视化，需要下载 graphviz
        # self.visualize(tree, train_data)

    def visualize(self, tree, X):
        dot = Digraph()
        # dot.attr(ranksep='1.5', nodesep='0.5')
        self._add_nodes_edges(dot, tree, X=X)
        dot.render('simple_decision_tree' + str(datetime.datetime.today()), format='png', view=True)

    def _add_nodes_edges(self, dot, tree, parent_name=None, X=None):
        if tree is None:
            return  # 如果树为空，直接返回

        if 'label' in tree:
            # print(f"Label: {tree['label']}")
            # print(f"Entropy: {tree['loss']:.3f}")
            # print(f"Samples: {tree['num_samples']}")
            return  # 不再添加叶节点

        feature_index = tree['feature_index']
        threshold = tree['threshold']
        feature_name = self.labels[feature_index]

        node_name = f"Node_{parent_name}_{feature_index}"
        dot.node(node_name, f"{feature_name} <= {threshold}")
        if parent_name:
            dot.edge(parent_name, node_name)

        # 递归处理子节点
        self._add_nodes_edges(dot, tree['children']['left'], node_name, X=X)
        self._add_nodes_edges(dot, tree['children']['right'], node_name, X=X)

    def sklearn_tree(self, X, y):
        clf = DecisionTreeClassifier(criterion='entropy', max_depth=5)
        clf.fit(X, y)

        dot_data = clf.export_graphviz(clf, out_file=None,feature_names=self.labels)
        dot = graphviz.Source(dot_data)
        dot.view()

class assignment2:
    def __init__(self, data_path):
        self.data_path = data_path
        self.dp = dataprocess(data_path)
        self.LR = LogisticRegressionSGD(data_path)
        self.p = Perceptron(data_path)

    # 作业一
    def task1(self):
        self.dp.task1()

    # 作业二
    def task2(self):
        self.dp.task2()

    # 作业三
    def task3(self):
        print("############################################")
        self.LR.train()
        print("############################################")
        self.p.train()
        print("############################################")
        for epsilon in [0.05, 0.1, 0.2]:
            tree = DecisionTree(epsilon=epsilon)
            tree.K_fold(5)
            print("############################################")



if __name__ == '__main__':
    data_path = 'salary.csv'

    homework = assignment2(data_path)
    print("############################################")
    print("PROBLEM 1		Regression on Salary")
    print("############################################")
    homework.task1()
    print("PROBLEM 2		Regression with SGD")
    print("############################################")
    homework.task2()
    print("############################################")
    print("PROBLEM 3		Classification on Salary")
    homework.task3()
    print("############################################")

PROBLEM 1

Visualization

salary

![[Pasted image 20241007152404.png]]

薪资是呈现一个正太分布的情况

大部分人的薪资水平都处于[100000， 175000]的区间

GPA

![[Pasted image 20241007152612.png]]

GPA与薪水呈现一个线性关系，随着 GPA 的上升，薪资呈现出一个上升的趋势。

GPA的分布主要集中在 [2.00， 4.00]之间

Internships

![[Pasted image 20241007153013.png]]

Internships 是一个离散型的数据，随着 internships 的升高，可以发现薪资的上限在不断加高，下限也在不断加高，同时分布集中点也在提高

Competitions

![[Pasted image 20241007153209.png]]

Competitions 是一个离散型的数据，随着 Competitions 的升高，可以发现薪资的上限在不断加高，下限也在不断加高，同时分布集中点也在提高

Penalties

![[Pasted image 20241007153348.png]]

Penalties 是一个离散型的数据，随着 Penalties 的升高，可以发现薪资的上限和下限在不断降低，但是受到处罚为 2.0 的人数相对较少

Nationality

![[Pasted image 20241007153538.png]]

Nationality 为离散数据，数据中 Australia, U.K.占数据的大部分，U.S. 人数明显少于其他两国

K-fold cross validation

![[Pasted image 20241007153827.png]]

最终实验数据如下：
############################################
my K-fold function:
beta: [ 4.86229593e-06  1.73602268e-05  8.67408337e-06 -3.47258147e-05
 -9.21844391e-06 -4.87899209e-06 -3.13396456e-06] ± [6.81674363e-08 4.79016568e-07 2.40329394e-07 9.61716710e-07
 7.41399918e-07 6.83886356e-07 6.64396429e-07]
 
train_mse: 5.465679839331648e-15 ± 3.2737986198218384e-16
test_mse: 5.641269773977836e-15 ± 4.453573881649367e-16
############################################
sklearn K-fold function:
beta:  [ 4.90709664e-06  1.82544487e-05  9.12638507e-06 -3.65263865e-05
 -3.65840432e-06  9.10871654e-07  2.74753266e-06]
 
train_mse:  6.041639987563109e-15
test_mse:  6.474849714869361e-15
###########################################

可以发现，与标准库相比，数据误差较小

Conclusions

训练MSE和测试MSE均在 $10^{-15}$ 到 $10^{-16}$ 的数量级，显示出模型对数据的拟合非常精确，几乎没有预测误差。这表明模型能够很好地捕捉数据中的趋势和关系。

数据中包括了多种特征（如GPA、实习经历、竞赛次数和违规记录），这些特征可能对薪资有显著影响，并且呈现线性关系。

PROBLEM 2 Regression with SGD

![[Pasted image 20241007160209.png]]

实验数据如下:
(其中train_mse,test_mse由于plt显示精度有限，均放大了1e11倍，上图为每次迭代时候train_mse 的变化图片)

############################################
my SGD function:
############################################
batch:  1
weight:  [ 4.86256567e-06  1.73605319e-05  8.67414653e-06 -3.47259825e-05
 -4.91039945e-06 -5.71725846e-07  1.17409987e-06]
bias:  -4.308025419954941e-06
test_mse:  [0.00057053]
############################################
batch:  4
weight:  [ 4.86068715e-06  1.73742397e-05  8.66585492e-06 -3.47046536e-05
 -4.84013736e-06 -5.06873714e-07  9.74622112e-07]
bias:  -4.372388966869282e-06
test_mse:  [0.00128614]
############################################
batch:  8
weight:  [ 4.86423291e-06  1.71428948e-05  8.36850732e-06 -3.43209527e-05
 -4.63462739e-06 -2.40183117e-07  5.75862310e-07]
bias:  -4.298948202211374e-06
test_mse:  [0.02306401]
############################################
batch:  16
weight:  [ 4.90350060e-06  1.55124551e-05  7.04386406e-06 -3.10383599e-05
 -4.11329838e-06  3.02350357e-07  1.62895767e-07]
bias:  -3.648052259500667e-06
test_mse:  [0.96528455]
############################################

![[Pasted image 20241007160746.png]]

将SGD所得权重与最小二乘法的的权重进行比较，发现每个特征的权重的误差都十分的小

![[Pasted image 20241007161822.png]]

在batch = 1, 4 ,8 的时候与 K- Fold 的 test-mse 相差十分的近，当达到 batch = 16 时可能由于学习率，或者迭代等因素，导致误差偏高

PROBLEM 3 Classification on Salary

Logistic Regression (LR)

![[Pasted image 20241007162255.png]]

随着训练次数的增加，MSE 逐渐减小，最后开始收敛在 0.3 附近

实验数据如下：
（train_mse 见上图）
############################################
Logistic Regression function:
############################################
weight:  [  3.44120469  11.92428137   6.2301448  -23.86514155  -2.55695149
   0.09951324   1.63580443]
bias:  -0.8216338251227515
test_mse:  [0.0390241]

Perceptron

![[Pasted image 20241007163113.png]]

############################################
Perceptron function:
############################################
weight:  [ 0.72449474  2.15297636  0.97787551 -4.20352564 -0.25714974  0.16387187
  0.21670045]
bias:  0.12342257835854971
test_mse:  [0.16665284]
############################################

Decision Tree (ID3)

决策树示意图如下

![[Pasted image 20241007163344.png]]

############################################
epsilon = 0.05
Tree decision K-fold train accuracy is: 0.9815
Tree decision K-fold test accuracy is: 0.908
############################################
epsilon = 0.1
Tree decision K-fold train accuracy is: 0.9765
Tree decision K-fold test accuracy is: 0.91
############################################
epsilon = 0.2
Tree decision K-fold train accuracy is: 0.84
Tree decision K-fold test accuracy is: 0.84
############################################

Conclusion

从实验结果来看，逻辑回归在测试均方误差上表现出色，表明其适合于处理线性关系的分类问题。感知器在复杂数据集上表现不足，未能有效收敛。决策树则在不同的停止标准下表现出色，显示出较高的准确性，特别是在严格的停止标准下。

通过本实验可以看出，不同模型在相同数据集上会有不同的表现，需要我们合理选择模型