week 2

PROBLEM 1 Regression on Salary

1
2
3
4
5
6
7
8
9
10
11
12
13
1. Description. As learned in class, you want to analyze what factors would affect the salary of graduates’ first jobs, and then you collect 500 graduates’ data for analysis. See the “salary.csv” file attached. There are 5 features (GPA, # of internships, # of Competitions, # of Penalties, and Nationality) and 1 label (Salary). Use a linear regression model to analyze this dataset.

2. Visualization. The very first step is to familiarize yourself with the data. You will need to make 6 plots. The first is a salary histogram, helping you understand its distribution. Then, draw feature-label scattered plots for each feature vs. label to see their correlations (5 features, so 5 plots). Briefly state your findings from these plots.

3. K-fold cross validation. Use K=5. Evenly partition the dataset into 5 chunks, denoted as C1, C2, C3, C4, C5, and do 5 runs of experiments.
For each run i = 1, 2, 3, 4, 5:
Use Ci as test data and the rest as training data;
Data preprocessing. Standardize the data for each feature and label. Note for continuous columns such as GPA and Salary, only use training data to compute the sample mean and variance, and use them to normalize both training data and test data.
Use Least Square to find the best model weights on training data, and evaluate the model on test data. Report the model weights, training MSE, and test MSE.
After all the five runs, use a bar chart to visualize the five runs’ training MSE and test MSE, and report an averaged MSE ± 1 standard deviation.

4. Draw conclusions. Based on your model results, briefly describe the experiment results, analyze the results, and draw conclusions.
5. Report. You will need to compose a report containing the abovementioned figures, observations, analysis, and conclusions. Try to write it as professionally as you can. Make it concise and elegant.

report

source code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

# 为了和自己计算的数据进行对比,只写了作业1的K-fold
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

from graphviz import Digraph
from sklearn.tree import export_graphviz, DecisionTreeClassifier
import graphviz

class dataprocess:
def __init__(self, data_path):
self.threshold = 130000
self.data = pd.read_csv(data_path).values
self.GPA = np.array(self.data[:, 0])

# 将薪资表的数据读取出来,并且进行分割
self.Internships = np.array(self.data[:, 1])
self.Competitions = np.array(self.data[:, 2])
self.Penalties = np.array(self.data[:, 3])
self.Nationality = np.array(self.data[:, 4])
self.Salary = np.array(self.data[:, 5])
# 将工资大于阈值的标记为1,小于的标记为0
self.Salary_Classification = np.where(self.Salary > self.threshold, 1, 0)
self.C = np.array([self.data[100 * i:100 * (i + 1), :] for i in range(5)])

# 将国家数据转换为one-hot编码
self.Nationality_onehot = self.one_hot_encoder(self.Nationality)

self.batchs = [1, 4, 8, 16]

self.weights = None
self.feature = None

self.beta = None # 存储训练后的权重

#数据可视化操作
def Visualization(self):
# 薪资统计
fig = plt.figure(figsize=(16, 12))
ax1 = fig.add_subplot(221)
ax1.hist(self.Salary, bins=15, color='#f59311', alpha=0.5, edgecolor='k')
ax1.set_title('Salary')
ax1.set_xlabel('Salary')
ax1.set_ylabel('Frequency')

# 将GPA、实习、竞赛、罚款与薪资的关系可视化
ax2 = fig.add_subplot(222)
ax2.scatter(self.GPA, self.Salary, color='#f59311', alpha=0.5, edgecolor='k')
ax2.set_title('GPA vs Salary')
ax2.set_xlabel('GPA')
ax2.set_ylabel('Salary')

ax3 = fig.add_subplot(223)
ax3.scatter(self.Internships, self.Salary, color='#f59311', alpha=0.5, edgecolor='k')
ax3.set_title('Internships vs Salary')
ax3.set_xlabel('Internships')
ax3.set_ylabel('Salary')

ax4 = fig.add_subplot(224)
ax4.scatter(self.Competitions, self.Salary, color='#f59311', alpha=0.5, edgecolor='k')
ax4.set_title('Competitions vs Salary')
ax4.set_xlabel('Competitions')
ax4.set_ylabel('Salary')

plt.show()

fig = plt.figure(figsize=(16, 12))
ax5 = fig.add_subplot(231)
ax5.scatter(self.Penalties, self.Salary, color='#f59311', alpha=0.5, edgecolor='k')
ax5.set_title('Penalties vs Salary')
ax5.set_xlabel('Penalties')
ax5.set_ylabel('Salary')

ax6 = fig.add_subplot(232)
ax6.scatter(self.Nationality, self.Salary, color='#f59311', alpha=0.5, edgecolor='k')
ax6.set_title('Nationality vs Salary')
ax6.set_xlabel('Nationality')
ax6.set_ylabel('Salary')

plt.show()

# 计算均值和方差
def average(self, C, K=None):
# data = np.array([])
# for i in range(5):
# if i is not K:
# data = np.hstack((data, C[100*i:100*(i+1)]))
return np.mean(C, axis=0)

def variance(self, C, K=None):
# data = np.array([])
# for i in range(5):
# if i is not K:
# data = np.hstack((data, C[100 * i:100 * (i + 1)]))

return np.var(C, axis=0)

# One-Hot 编码
def one_hot_encoder(self, Nationality):

unique_categories = np.unique(Nationality)

category_to_index = {category: index for index, category in enumerate(unique_categories)}

# 初始化 One-Hot 编码矩阵
Nationality_onehot = np.zeros((len(Nationality), len(unique_categories)))

# 将类别映射为 One-Hot 向量
for i, category in enumerate(Nationality):
Nationality_onehot[i, category_to_index[category]] = 1

# print("Unique categories:", unique_categories)
# print("One-Hot Encoded matrix:")
# print(self.Nationality_onehot)

return Nationality_onehot

# 将连续数据标准化
def standardization(self, C, average, variance, k):
return (C - average) / variance

# 最小二乘法
def Least_Square(self, train_data, train_label, test_data, test_label):
X_transpose = np.transpose(train_data)

XTX = np.dot(X_transpose, train_data)

XTX_inv = np.linalg.inv(XTX)

XTy = np.dot(X_transpose, train_label)

self.beta = np.dot(XTX_inv, XTy)

train_predictions = np.dot(train_data, self.beta)
test_predictions = np.dot(test_data, self.beta)

# 计算训练数据和测试数据的均方误差(MSE)
train_mse = np.mean((train_predictions - train_label) ** 2)
test_mse = np.mean((test_predictions - test_label) ** 2)

# print("Model weights (beta):", self.beta)
# print("Training MSE:", train_mse)
# print("Test MSE:", test_mse)

return self.beta, train_mse, test_mse

# 对任务3的二分类数据进行处理
def task3_dataprocess(self, i):
# 数据归一化,以及相关的划分
# 除薪资数据外,其余部分与 k-fold 中数据处理方法大致相同
GPA_average = self.average(np.delete(self.GPA, slice(100 * i, 100 * (i + 1))), i)
GPA_variance = self.variance(np.delete(self.GPA, slice(100 * i, 100 * (i + 1))), i)
GPA = self.standardization(self.GPA, GPA_average, GPA_variance, i)

GPA_test = GPA[100 * i:100 * (i + 1)]
GPA = np.delete(GPA, slice(100 * i, 100 * (i + 1)), axis=0)

Interships_test = self.Internships[100 * i:100 * (i + 1)]
Interships = np.delete(self.Internships, slice(100 * i, 100 * (i + 1)), axis=0)

Competitions_test = self.Competitions[100 * i:100 * (i + 1)]
Competitions = np.delete(self.Competitions, slice(100 * i, 100 * (i + 1)), axis=0)

Penalties_test = self.Penalties[100 * i:100 * (i + 1)]
Penalties = np.delete(self.Penalties, slice(100 * i, 100 * (i + 1)), axis=0)

Nationality_onehot = self.Nationality_onehot
Nationality_test = Nationality_onehot[100 * i:100 * (i + 1)]
Nationality = np.delete(Nationality_onehot, slice(100 * i, 100 * (i + 1)), axis=0)

# Salary_average = self.average(np.delete(self.Salary, slice(100 * i, 100 * (i + 1))), i)
# Salary_variance = self.variance(np.delete(self.Salary, slice(100 * i, 100 * (i + 1))), i)
# Salary = self.standardization(self.Salary, Salary_average, Salary_variance, i)
# high = 1, low = 0

# 将薪资的数据进行比较替换
Salary = self.Salary_Classification
# print(Salary)
# print(Salary.shape)
Salary_test = Salary[100 * i:100 * (i + 1)]
Salary = np.delete(Salary, slice(100 * i, 100 * (i + 1)), axis=0)

# train_data = np.hstack(GPA, Interships, Competitions, Penalties, Nationality)
train_data = np.concatenate((GPA[:, np.newaxis], Interships[:, np.newaxis], Competitions[:, np.newaxis],
Penalties[:, np.newaxis], Nationality), axis=1).astype(float)

train_label = Salary.astype(int)

test_data = np.concatenate(
(GPA_test[:, np.newaxis], Interships_test[:, np.newaxis], Competitions_test[:, np.newaxis],
Penalties_test[:, np.newaxis], Nationality_test), axis=1).astype(float)
test_label = Salary_test.astype(float)

# 返回分好的训练数据和测试数据
return train_data, train_label, test_data, test_label

# i 表述第几个数据集
def pre_dataprocess(self, i):
# 数据归一化,以及相关的划分
GPA_average = self.average(np.delete(self.GPA, slice(100 * i, 100 * (i + 1))), i)
GPA_variance = self.variance(np.delete(self.GPA, slice(100 * i, 100 * (i + 1))), i)
GPA = self.standardization(self.GPA, GPA_average, GPA_variance, i)

GPA_test = GPA[100 * i:100 * (i + 1)]
GPA = np.delete(GPA, slice(100 * i, 100 * (i + 1)), axis=0)

Interships_test = self.Internships[100 * i:100 * (i + 1)]
Interships = np.delete(self.Internships, slice(100 * i, 100 * (i + 1)), axis=0)

Competitions_test = self.Competitions[100 * i:100 * (i + 1)]
Competitions = np.delete(self.Competitions, slice(100 * i, 100 * (i + 1)), axis=0)

Penalties_test = self.Penalties[100 * i:100 * (i + 1)]
Penalties = np.delete(self.Penalties, slice(100 * i, 100 * (i + 1)), axis=0)

Nationality_onehot = self.Nationality_onehot
Nationality_test = Nationality_onehot[100 * i:100 * (i + 1)]
Nationality = np.delete(Nationality_onehot, slice(100 * i, 100 * (i + 1)), axis=0)

Salary_average = self.average(np.delete(self.Salary, slice(100 * i, 100 * (i + 1))), i)
Salary_variance = self.variance(np.delete(self.Salary, slice(100 * i, 100 * (i + 1))), i)
Salary = self.standardization(self.Salary, Salary_average, Salary_variance, i)

Salary_test = Salary[100 * i:100 * (i + 1)]
Salary = np.delete(Salary, slice(100 * i, 100 * (i + 1)), axis=0)

# train_data = np.hstack(GPA, Interships, Competitions, Penalties, Nationality)
train_data = np.concatenate((GPA[:, np.newaxis], Interships[:, np.newaxis], Competitions[:, np.newaxis],
Penalties[:, np.newaxis], Nationality), axis=1).astype(float)

train_label = Salary.astype(float)

test_data = np.concatenate(
(GPA_test[:, np.newaxis], Interships_test[:, np.newaxis], Competitions_test[:, np.newaxis],
Penalties_test[:, np.newaxis], Nationality_test), axis=1).astype(float)
test_label = Salary_test.astype(float)

return train_data, train_label, test_data, test_label

def K_fold(self, k=5):
# GPA, Internships, Competitions, Penalties, Nationality, Salary
# 最小二乘法的储存
beta_array, train_mse_array, test_mse_array = [], [], []
# SGD的数据结构
# weight_dict, bias_dict, train_loss_dict, test_loss_dict = {}, {}, {}, {}
for i in range(k):
train_data, train_label, test_data, test_label = self.pre_dataprocess(i)
# 最小二乘法
beta, train_mse, test_mse = self.Least_Square(train_data, train_label, test_data, test_label)

beta_array.append(beta)
train_mse_array.append(train_mse)
test_mse_array.append(test_mse)


print("############################################")
print('my K-fold function:')
print(f"beta: {np.mean(beta_array)} + {np.var(beta_array)}")
print(f"train_mse: {np.mean(train_mse_array)} + {np.var(train_mse_array)}")
print(f"test_mse: {np.mean(test_mse_array)} + {np.var(test_mse_array)}")

x_labels = [f'Run {i + 1}' for i in range(5)]
x = np.arange(len(x_labels)) # X 轴的位置
width = 0.35 # 柱子的宽度

fig, ax = plt.subplots()
# 绘制柱状图

train_mse_array_scaled = [value * 1e15 for value in train_mse_array]
test_mse_array_scaled = [value * 1e15 for value in test_mse_array]

# 绘制训练 MSE 和测试 MSE 的柱状图
rects1 = ax.bar(x - width / 2, train_mse_array_scaled, width, label='Training MSE', color='b')
rects2 = ax.bar(x + width / 2, test_mse_array_scaled, width, label='Testing MSE', color='r')

ax.set_ylabel('MSE / e^-15')
ax.set_title('Training and Testing MSE for Each Run')
ax.set_xticks(x)
ax.set_xticklabels(x_labels)
ax.legend()

for rect in rects1:
height = rect.get_height()
ax.annotate(f'{height:.2f}',
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 偏移量
textcoords="offset points",
ha='center', va='bottom')

for rect in rects2:
height = rect.get_height()
ax.annotate(f'{height:.2f}',
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 偏移量
textcoords="offset points",
ha='center', va='bottom')

plt.tight_layout()
plt.show()

####################################
# sklearn库进行验证
####################################
kf = KFold(n_splits=5)
model = LinearRegression()
train_mse_list, test_mse_list = [], []
for train_index, test_index in kf.split(train_data):
train_data_kf, test_data_kf = train_data[train_index], train_data[test_index]
train_label_kf, test_label_kf = train_label[train_index], train_label[test_index]

model.fit(train_data_kf, train_label_kf)

train_pred = model.predict(train_data_kf)
test_pred = model.predict(test_data_kf)

train_mse = mean_squared_error(train_label_kf, train_pred)
test_mse = mean_squared_error(test_label_kf, test_pred)

train_mse_list.append(train_mse)
test_mse_list.append(test_mse)

print("############################################")
print('sklearn K-fold function:')
print("beta: ", model.intercept_)
print("train_mse: ", np.mean(train_mse_list))
print("test_mse: ", np.mean(test_mse_list))
print("############################################")

# SGD的实现
def SGD(self, train_data, train_label, test_data, test_label, learning_rate=0.001, max_iter=100, batch=1):

# 初始化相关参数
n_samples, n_features = train_data.shape
weights = np.zeros(n_features)
bias = 0
# 保存相关的 mse 差值
train_loss_list, test_loss_list = [], []
for epoch in range(max_iter):
for i in range(0, n_samples, batch):
X_batch = train_data[i:i + batch]
y_batch = train_label[i:i + batch]

y_pred = np.dot(X_batch, weights) + bias

gradient_w = -2 * np.dot(X_batch.T, (y_batch - y_pred)) / batch
gradient_b = -2 * np.mean(y_batch - y_pred)

weights -= learning_rate * gradient_w
bias -= learning_rate * gradient_b

train_loss = np.mean((train_label - (np.dot(train_data, weights) + bias)) ** 2) * 1e11
train_loss_list.append(train_loss)

# 用训练好的参数预测测试集,计算 mse
test_loss = np.mean((test_label - (np.dot(test_data, weights) + bias)) ** 2) * 1e11
test_loss_list.append(test_loss)

# if epoch % 10 == 0:
# print(f"Epoch {epoch + 1}/{max_iter}, Training Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}")

return weights, bias, train_loss_list, test_loss_list

def mysgd(self, k=5):
# GPA, Internships, Competitions, Penalties, Nationality, Salary
# SGD的数据结构
batches = [1, 4, 8, 16]
weight_dict, bias_dict, train_loss_dict, test_loss_dict = {}, {}, {}, {}

# SGD算法
for batch in batches:
weight_b = []
bias_b = []
train_loss_b = []
test_loss_b = []
for i in range(k):
train_data, train_label, test_data, test_label = self.pre_dataprocess(i)

sgd_weight, sgd_bias, train_loss_mse, test_loss_mse = self.SGD(train_data, train_label, test_data, test_label, learning_rate=0.001, batch=batch)

weight_b.append(sgd_weight)
bias_b.append(sgd_bias)
train_loss_b.append(train_loss_mse)
test_loss_b.append(test_loss_mse)

weight_dict[batch] = np.mean(weight_b, axis=0)
bias_dict[batch] = np.mean(bias_b, axis=0)
train_loss_dict[batch] = np.mean(train_loss_b, axis=0)
test_loss_dict[batch] = np.mean(test_loss_b, axis=0)

fig = plt.figure(figsize=(16, 12))
print("############################################")
print('my SGD function:')
for batch in batches:
print("############################################")
print("batch: ", batch)
print("weight: ", weight_dict[batch])
print("bias: ", bias_dict[batch])
# print("train_mse: ", train_loss_dict[batch])
# print("test_mse: ", test_loss_dict[batch])
ax = fig.add_subplot(2, 2, batches.index(batch) + 1)
ax.plot(train_loss_dict[batch], label='train_mse')
# ax.plot(test_loss_dict[batch], label='test_mse')
ax.set_title(f'Batch Size: {batch}')
ax.set_xlabel('Epoch')
ax.set_ylabel('MSE ^ 1e-11')
ax.legend()

print("############################################")
plt.show()

# 作业一中的可视化,包括数据分布和K折交叉验证
def task1(self):
self.Visualization()
self.K_fold()

# 作业二中的SGD算法
def task2(self):
self.mysgd()


# 感知机算法
class Perceptron:
def __init__(self, learning_rate=0.001, max_iter=100):
self.learning_rate = learning_rate
self.max_iter = max_iter
self.weights = None
self.bias = None
self.dp = dataprocess('salary.csv')

def sigmoid(self, x):
return 1 / (1 + np.exp(-x))
def SGD(self, train_data, train_label, test_data, test_label, learning_rate=0.001, max_iter=100, batch=1):

n_samples, n_features = train_data.shape
weights = np.zeros(n_features)
bias = 0
train_loss_list, test_loss_list = [], []

for epoch in range(max_iter):

for i in range(0, n_samples, batch):
X_batch = train_data[i:i + batch]
y_batch = train_label[i:i + batch]

y_pred = self.sigmoid(np.dot(X_batch, weights) + bias)

gradient_w = -2 * np.dot(X_batch.T, (y_batch - y_pred)) / batch
gradient_b = -2 * np.mean(y_batch - y_pred)

weights -= learning_rate * gradient_w
bias -= learning_rate * gradient_b

y_train_pred = self.sigmoid(np.dot(train_data, weights) + bias)
train_loss = -np.mean(
train_label * np.log(y_train_pred + 1e-9) + (1 - train_label) * np.log(1 - y_train_pred + 1e-9))
train_loss_list.append(train_loss)

y_test_pred = self.sigmoid(np.dot(test_data, weights) + bias)
test_loss = -np.mean(test_label * np.log(y_test_pred + 1e-9) + (1 - test_label) * np.log(1 - y_test_pred + 1e-9))
test_loss_list.append(test_loss)

# if epoch % 10 == 0:
# print(f"Epoch {epoch + 1}/{max_iter}, Training Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}")

return weights, bias, train_loss_list, test_loss_list

def train(self, k=5):
weight_list, bias_list,train_mse_list, test_mse_list = [], [], [], []
print('training with Perceptron')
for i in range(k):
train_data, train_label, test_data, test_label = self.dp.task3_dataprocess(i)
weight, bias, train_mse_error, test_mse_error = self.SGD(train_data, train_label, test_data, test_label, learning_rate=0.001, batch=1)

weight_list.append(weight)
bias_list.append(bias)
train_mse_list.append(train_mse_error)
test_mse_list.append(test_mse_error)

print("############################################")
print('Perceptron function:')
print("############################################")
print("weight: ", np.mean(weight_list, axis=0))
print("bias: ", np.mean(bias_list, axis=0))
print("train_mse: ", np.mean(train_mse_list, axis=0))
print("test_mse: ", np.mean(test_mse_list, axis=0))

plt.plot(np.mean(train_mse_list, axis=1), label='train_mse')
# plt.plot(np.mean(test_mse_list, axis=0), label='test_mse')
plt.xlabel('Epoch')
plt.ylabel('MSE')
plt.legend()
plt.show()

class LogisticRegressionSGD:
def __init__(self, learning_rate=0.001, max_iter=100):
self.learning_rate = learning_rate
self.max_iter = max_iter
self.weights = None
self.bias = None
self.dp = dataprocess('salary.csv')

def sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def SGD(self, train_data, train_label, test_data, test_label, learning_rate=0.001, max_iter=1000, batch=1):

n_samples, n_features = train_data.shape
weights = np.zeros(n_features)
bias = 0
train_loss_list, test_loss_list = [], []
for epoch in range(max_iter):

for i in range(0, n_samples, batch):
X_batch = train_data[i:i + batch]
y_batch = train_label[i:i + batch]

y_pred = np.dot(X_batch, weights) + bias
y_pred = self.sigmoid(y_pred)

gradient_w = -2 * np.dot(X_batch.T, (y_batch - y_pred)) / batch
gradient_b = -2 * np.mean(y_batch - y_pred)

weights -= learning_rate * gradient_w
bias -= learning_rate * gradient_b

train_loss = -np.mean(y_batch * np.log(y_pred + 1e-9) + (1 - y_batch) * np.log(1 - y_pred + 1e-9))
train_loss_list.append(train_loss)

test_pred = self.sigmoid(np.dot(test_data, weights) + bias)
test_loss = -np.mean(test_label * np.log(test_pred + 1e-9) + (1 - test_label) * np.log(1 - test_pred + 1e-9))
test_loss_list.append(test_loss)

# if epoch % 10 == 0:
# print(f"Epoch {epoch + 1}/{max_iter}, Training Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}")

return weights, bias, train_loss_list, test_loss_list

def train(self, k=5):
weight_list, bias_list,train_mse_list, test_mse_list = [], [], [], []
print('training Logistic Regression model...')
for i in range(k):
train_data, train_label, test_data, test_label = self.dp.task3_dataprocess(i)
weight, bias, train_mse_error, test_mse_error = self.SGD(train_data, train_label, test_data, test_label, learning_rate=0.01, batch=1)

weight_list.append(weight)
bias_list.append(bias)
train_mse_list.append(train_mse_error)
test_mse_list.append(test_mse_error)

print(test_mse_list)
print("############################################")
print('Logistic Regression function:')
print("############################################")
print("weight: ", np.mean(weight_list, axis=0))
print("bias: ", np.mean(bias_list, axis=0))
print("train_mse: ", np.mean(train_mse_list, axis=0))
print("test_mse: ", np.mean(test_mse_list, axis=0))

plt.plot(np.mean(train_mse_list, axis=0), label='train_mse')
# plt.plot(np.mean(test_mse_list, axis=0), label='test_mse')
plt.xlabel('Epoch')
plt.ylabel('MSE')
plt.legend()
plt.show()


class DecisionTree:
def __init__(self, epsilon=0.1, data_path='salary.csv'):
self.epsilon = epsilon # 停止生长的阈值
self.tree = None
self.dp = dataprocess(data_path)
self.labels = ["GPA", "Internships", "Competitions", "Penalties", "Australia", "U.K.", "U.S."]

def _grow_tree(self, X, y):
num_samples = len(y)

# 检查样本数量
if num_samples < 2: # 至少需要两个样本才能分割
return {'label': np.argmax(np.bincount(y)), 'num_samples': num_samples, 'loss': self._cross_entropy(y)}

if len(set(y)) == 1: # 如果所有样本属于同一类
return {'label': y[0], 'num_samples': num_samples, 'loss': self._cross_entropy(y)}

probabilities = np.bincount(y) / num_samples
if np.max(probabilities) >= (1 - self.epsilon): # 如果几乎纯净
return {'label': np.argmax(probabilities), 'num_samples': num_samples, 'loss': self._cross_entropy(y)}

best_feature = self._best_feature_to_split(X, y)
threshold = np.median(X[:, best_feature]) # 使用中位数作为阈值
left_indices = X[:, best_feature] <= threshold
right_indices = X[:, best_feature] > threshold

# 检查左右子集是否有样本
if not np.any(left_indices) or not np.any(right_indices):
return {'label': np.argmax(np.bincount(y)), 'num_samples': num_samples, 'loss': self._cross_entropy(y)}

tree = {
'feature_index': best_feature,
'threshold': threshold,
'children': {
'left': self._grow_tree(X[left_indices], y[left_indices]),
'right': self._grow_tree(X[right_indices], y[right_indices])
},
'num_samples': num_samples,
'loss': self._cross_entropy(y)
}

return tree

# 为左右子树创建标签
left_indices = X[:, best_feature] <= threshold
right_indices = X[:, best_feature] > threshold

tree['children']['left'] = self._grow_tree(X[left_indices], y[left_indices])
tree['children']['right'] = self._grow_tree(X[right_indices], y[right_indices])

return tree

def _cross_entropy(self, y):
probabilities = np.bincount(y) / len(y)
return -np.sum(probabilities[probabilities > 0] * np.log(probabilities[probabilities > 0]))

def _best_feature_to_split(self, X, y):
num_features = X.shape[1]
best_gain = -np.inf
best_feature = -1

for feature in range(num_features):
gain = self._information_gain(X, y, feature)
if gain > best_gain:
best_gain = gain
best_feature = feature

return best_feature

def _information_gain(self, X, y, feature_index):
total_entropy = self._cross_entropy(y)

# 检查样本数量
if len(y) <= 1: # 至少需要两个样本才能分割
return 0

# 计算阈值
threshold = np.median(X[:, feature_index]) # 使用中位数作为阈值
left_indices = X[:, feature_index] <= threshold
right_indices = X[:, feature_index] > threshold

# 检查左右子集是否有样本
if not np.any(left_indices) or not np.any(right_indices):
return 0 # 如果任一子集为空,返回0

left_entropy = self._cross_entropy(y[left_indices])
right_entropy = self._cross_entropy(y[right_indices])

weighted_entropy = (
(np.sum(left_indices) / len(y)) * left_entropy +
(np.sum(right_indices) / len(y)) * right_entropy
)

return total_entropy - weighted_entropy

def predict(self, X, tree):
return np.array([self._predict_sample(sample, tree) for sample in X])

def _predict_sample(self, sample, tree):
if tree is None or 'label' in tree:
return tree['label'] if tree is not None else None

feature_value = sample[tree['feature_index']]
if feature_value <= tree['threshold']:
return self._predict_sample(sample, tree['children']['left'])
else:
return self._predict_sample(sample, tree['children']['right'])

def train(self, X, y):
return self._grow_tree(X, y)

def K_fold(self, k=5):
test_acc = 0
train_acc = 0
for i in range(k):
train_data, train_label, test_data, test_label = self.dp.task3_dataprocess(i)
tree = self.train(train_data, train_label)
train_pred = self.predict(train_data, tree)
test_pred = self.predict(test_data, tree)
train_acc += np.sum(train_pred == train_label)
test_acc += np.sum(test_pred == test_label)

train_accuracy = train_acc / (len(train_label) * k)
print(f'Tree decision K-fold train accuracy is: {train_accuracy}')
test_accuracy = test_acc / (len(test_label) * k)
print(f'Tree decision K-fold test accuracy is: {test_accuracy}')
# print(f'train_data length is :{len(train_label)}')
# print(f'test_data length is :{len(test_label)}')

# 可视化,需要下载 graphviz
# self.visualize(tree, train_data)

def visualize(self, tree, X):
dot = Digraph()
# dot.attr(ranksep='1.5', nodesep='0.5')
self._add_nodes_edges(dot, tree, X=X)
dot.render('simple_decision_tree' + str(datetime.datetime.today()), format='png', view=True)

def _add_nodes_edges(self, dot, tree, parent_name=None, X=None):
if tree is None:
return # 如果树为空,直接返回

if 'label' in tree:
# print(f"Label: {tree['label']}")
# print(f"Entropy: {tree['loss']:.3f}")
# print(f"Samples: {tree['num_samples']}")
return # 不再添加叶节点

feature_index = tree['feature_index']
threshold = tree['threshold']
feature_name = self.labels[feature_index]

node_name = f"Node_{parent_name}_{feature_index}"
dot.node(node_name, f"{feature_name} <= {threshold}")
if parent_name:
dot.edge(parent_name, node_name)

# 递归处理子节点
self._add_nodes_edges(dot, tree['children']['left'], node_name, X=X)
self._add_nodes_edges(dot, tree['children']['right'], node_name, X=X)

def sklearn_tree(self, X, y):
clf = DecisionTreeClassifier(criterion='entropy', max_depth=5)
clf.fit(X, y)

dot_data = clf.export_graphviz(clf, out_file=None,feature_names=self.labels)
dot = graphviz.Source(dot_data)
dot.view()

class assignment2:
def __init__(self, data_path):
self.data_path = data_path
self.dp = dataprocess(data_path)
self.LR = LogisticRegressionSGD(data_path)
self.p = Perceptron(data_path)

# 作业一
def task1(self):
self.dp.task1()

# 作业二
def task2(self):
self.dp.task2()

# 作业三
def task3(self):
print("############################################")
self.LR.train()
print("############################################")
self.p.train()
print("############################################")
for epsilon in [0.05, 0.1, 0.2]:
tree = DecisionTree(epsilon=epsilon)
tree.K_fold(5)
print("############################################")



if __name__ == '__main__':
data_path = 'salary.csv'

homework = assignment2(data_path)
print("############################################")
print("PROBLEM 1 Regression on Salary")
print("############################################")
homework.task1()
print("PROBLEM 2 Regression with SGD")
print("############################################")
homework.task2()
print("############################################")
print("PROBLEM 3 Classification on Salary")
homework.task3()
print("############################################")

PROBLEM 1

Visualization

salary

![[Pasted image 20241007152404.png]]

  1. 薪资是呈现一个正太分布的情况
  2. 大部分人的薪资水平都处于[100000, 175000]的区间

GPA

![[Pasted image 20241007152612.png]]

  1. GPA与薪水呈现一个线性关系,随着 GPA 的上升,薪资呈现出一个上升的趋势。
  2. GPA的分布主要集中在 [2.00, 4.00]之间

Internships

![[Pasted image 20241007153013.png]]

  1. Internships 是一个离散型的数据,随着 internships 的升高,可以发现薪资的上限在不断加高,下限也在不断加高,同时分布集中点也在提高

Competitions

![[Pasted image 20241007153209.png]]

Competitions 是一个离散型的数据,随着 Competitions 的升高,可以发现薪资的上限在不断加高,下限也在不断加高,同时分布集中点也在提高

Penalties

![[Pasted image 20241007153348.png]]

Penalties 是一个离散型的数据,随着 Penalties 的升高,可以发现薪资的上限和下限在不断降低,但是受到处罚为 2.0 的人数相对较少

Nationality

![[Pasted image 20241007153538.png]]

Nationality 为离散数据,数据中 Australia, U.K.占数据的大部分,U.S. 人数明显少于其他两国

K-fold cross validation

![[Pasted image 20241007153827.png]]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
最终实验数据如下:
############################################
my K-fold function:
beta: [ 4.86229593e-06 1.73602268e-05 8.67408337e-06 -3.47258147e-05
-9.21844391e-06 -4.87899209e-06 -3.13396456e-06] ± [6.81674363e-08 4.79016568e-07 2.40329394e-07 9.61716710e-07
7.41399918e-07 6.83886356e-07 6.64396429e-07]

train_mse: 5.465679839331648e-15 ± 3.2737986198218384e-16
test_mse: 5.641269773977836e-15 ± 4.453573881649367e-16
############################################
sklearn K-fold function:
beta: [ 4.90709664e-06 1.82544487e-05 9.12638507e-06 -3.65263865e-05
-3.65840432e-06 9.10871654e-07 2.74753266e-06]

train_mse: 6.041639987563109e-15
test_mse: 6.474849714869361e-15
###########################################

可以发现,与标准库相比,数据误差较小

Conclusions

  1. 训练MSE和测试MSE均在 101510^{-15}101610^{-16} 的数量级,显示出模型对数据的拟合非常精确,几乎没有预测误差。这表明模型能够很好地捕捉数据中的趋势和关系。
  2. 数据中包括了多种特征(如GPA、实习经历、竞赛次数和违规记录),这些特征可能对薪资有显著影响,并且呈现线性关系。

PROBLEM 2           Regression with SGD

![[Pasted image 20241007160209.png]]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
实验数据如下:
(其中train_mse,test_mse由于plt显示精度有限,均放大了1e11倍,上图为每次迭代时候train_mse 的变化图片)

############################################
my SGD function:
############################################
batch: 1
weight: [ 4.86256567e-06 1.73605319e-05 8.67414653e-06 -3.47259825e-05
-4.91039945e-06 -5.71725846e-07 1.17409987e-06]
bias: -4.308025419954941e-06
test_mse: [0.00057053]
############################################
batch: 4
weight: [ 4.86068715e-06 1.73742397e-05 8.66585492e-06 -3.47046536e-05
-4.84013736e-06 -5.06873714e-07 9.74622112e-07]
bias: -4.372388966869282e-06
test_mse: [0.00128614]
############################################
batch: 8
weight: [ 4.86423291e-06 1.71428948e-05 8.36850732e-06 -3.43209527e-05
-4.63462739e-06 -2.40183117e-07 5.75862310e-07]
bias: -4.298948202211374e-06
test_mse: [0.02306401]
############################################
batch: 16
weight: [ 4.90350060e-06 1.55124551e-05 7.04386406e-06 -3.10383599e-05
-4.11329838e-06 3.02350357e-07 1.62895767e-07]
bias: -3.648052259500667e-06
test_mse: [0.96528455]
############################################

![[Pasted image 20241007160746.png]]

将SGD所得权重与最小二乘法的的权重进行比较,发现每个特征的权重的误差都十分的小

![[Pasted image 20241007161822.png]]

在batch = 1, 4 ,8 的时候与 K- Fold 的 test-mse 相差十分的近,当达到 batch = 16 时可能由于学习率,或者迭代等因素,导致误差偏高

PROBLEM 3           Classification on Salary

Logistic Regression (LR)

![[Pasted image 20241007162255.png]]

随着训练次数的增加,MSE 逐渐减小,最后开始收敛在 0.3 附近

1
2
3
4
5
6
7
8
9
实验数据如下:
(train_mse 见上图)
############################################
Logistic Regression function:
############################################
weight: [ 3.44120469 11.92428137 6.2301448 -23.86514155 -2.55695149
0.09951324 1.63580443]
bias: -0.8216338251227515
test_mse: [0.0390241]

Perceptron

![[Pasted image 20241007163113.png]]

1
2
3
4
5
6
7
8
############################################
Perceptron function:
############################################
weight: [ 0.72449474 2.15297636 0.97787551 -4.20352564 -0.25714974 0.16387187
0.21670045]
bias: 0.12342257835854971
test_mse: [0.16665284]
############################################

Decision Tree (ID3)

决策树示意图如下

![[Pasted image 20241007163344.png]]

1
2
3
4
5
6
7
8
9
10
11
12
13
############################################
epsilon = 0.05
Tree decision K-fold train accuracy is: 0.9815
Tree decision K-fold test accuracy is: 0.908
############################################
epsilon = 0.1
Tree decision K-fold train accuracy is: 0.9765
Tree decision K-fold test accuracy is: 0.91
############################################
epsilon = 0.2
Tree decision K-fold train accuracy is: 0.84
Tree decision K-fold test accuracy is: 0.84
############################################

Conclusion

从实验结果来看,逻辑回归在测试均方误差上表现出色,表明其适合于处理线性关系的分类问题。感知器在复杂数据集上表现不足,未能有效收敛。决策树则在不同的停止标准下表现出色,显示出较高的准确性,特别是在严格的停止标准下。

通过本实验可以看出,不同模型在相同数据集上会有不同的表现,需要我们合理选择模型