节选:常见机器学习算法列表(Python 调用)

  1. 线性回归
  2. 逻辑回归
  3. 决策树
  4. 支持向量机
  5. 朴素贝叶斯
  6. 神经网络
  7. K均值
  8. 随机森林
  9. 降维算法
  10. 梯度提升算法
    1. GBM
    2. XGBoost
    3. LightGBM
    4. Catboost

大致而言,共有 3 种类型的机器学习算法

  • 监督学习

工作原理:此算法由目标/结果变量(或因变量)组成,该目标/结果变量将从给定的一组预测变量(因变量)中进行预测。使用这些变量集,我们生成了一个将输入映射到所需输出的函数。训练过程将继续进行,直到模型在训练数据上达到所需的准确性水平为止。

监督学习的例子:线性回归,决策树,随机森林,KNN,逻辑回归等。

  • 无监督学习

工作原理: 在此算法中,我们没有任何目标或结果变量可以预测/估算。它用于对不同组中的人群进行聚类,广泛用于对不同组中的客户进行细分以进行特定干预。

无监督学习的示例:Apriori算法,K均值。

  • 强化学习:

工作原理:使用此算法,机器经过训练后可以做出特定决策。它是这样工作的:机器处于反复试验不断训练自身的环境中。该机器将从过去的经验中学习,并尝试捕获最佳的知识以做出准确的业务决策。

强化学习的例子:马尔可夫决策过程

常见机器学习算法列表

1.线性回归

它用于根据连续变量估算实际价值(房屋成本,通话次数,总销售额等)。在这里,我们通过拟合一条最佳线来建立自变量和因变量之间的关系。该最佳拟合线称为回归线,并由线性方程 \(Y = a * X + b\) 表示。

Linear_Regression
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
'''
The following code is for the Linear Regression
Created by- ANALYTICS VIDHYA
'''

# importing required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# read the train and test dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

print(train_data.head())

# shape of the dataset
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Item_Outlet_Sales

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Item_Outlet_Sales'],axis=1)
train_y = train_data['Item_Outlet_Sales']

# seperate the independent and target variable on training data
test_x = test_data.drop(columns=['Item_Outlet_Sales'],axis=1)
test_y = test_data['Item_Outlet_Sales']

'''
Create the object of the Linear Regression model
You can also add other parameters and test your code here
Some parameters are : fit_intercept and normalize
Documentation of sklearn LinearRegression:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

'''
model = LinearRegression()

# fit the model with the training data
model.fit(train_x,train_y)

# coefficeints of the trained model
print('\nCoefficient of model :', model.coef_)

# intercept of the model
print('\nIntercept of model',model.intercept_)

# predict the target on the test dataset
predict_train = model.predict(train_x)
print('\nItem_Outlet_Sales on training data',predict_train)

# Root Mean Squared Error on training dataset
rmse_train = mean_squared_error(train_y,predict_train)**(0.5)
print('\nRMSE on train dataset : ', rmse_train)

# predict the target on the testing dataset
predict_test = model.predict(test_x)
print('\nItem_Outlet_Sales on test data',predict_test)

# Root Mean Squared Error on testing dataset
rmse_test = mean_squared_error(test_y,predict_test)**(0.5)
print('\nRMSE on test dataset : ', rmse_test)

2. Logistic回归

不要被它的名字弄糊涂了!它是一种分类,而不是回归算法。它用于基于给定的一组独立变量来估计离散值(二进制值,如 0/1,yes/no,true/false)。简而言之,它通过将数据拟合到logit函数来预测事件发生的可能性。因此,这也称为 对数回归。由于可以预测概率,因此其输出值介于0和1之间(如预期)。

Logistic_Regression
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
'''
The following code is for Logistic Regression
Created by - ANALYTICS VIDHYA
'''

# importing required libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')


print(train_data.head())

# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

'''
Create the object of the Logistic Regression model
You can also add other parameters and test your code here
Some parameters are : fit_intercept and penalty
Documentation of sklearn LogisticRegression:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

'''
model = LogisticRegression()

# fit the model with the training data
model.fit(train_x,train_y)

# coefficeints of the trained model
print('Coefficient of model :', model.coef_)

# intercept of the model
print('Intercept of model',model.intercept_)

# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train)

# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

3. 决策树

这是我最喜欢的算法之一,我经常使用它。它是一种监督学习算法,主要用于分类问题。令人惊讶的是,它适用于分类因变量和连续因变量。在此算法中,我们将总体分为两个或多个同构集合。这是基于最重要的属性/自变量来完成的,以尽可能地形成不同的组。

Decision Tree

理解决策树如何工作的最好方法是玩 Jezzball,这是 Microsoft 的经典游戏(下图)。本质上,您有一间活动墙的房间,您需要创建墙以使最大的区域在没有球的情况下被清除。

Jezzball
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
'''
The following code is for Decision Tree
Created by - Analytics Vidhya
'''

# importing required libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

'''
Create the object of the Decision Tree model
You can also add other parameters and test your code here
Some parameters are : max_depth and max_features
Documentation of sklearn DecisionTreeClassifier:

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

'''
model = DecisionTreeClassifier()

# fit the model with the training data
model.fit(train_x,train_y)

# depth of the decision tree
print('Depth of the Decision Tree :', model.get_depth())

# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train)

# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

4. SVM(支持向量机)

这是一种分类方法。在此算法中,我们将每个数据项绘制为n维空间(其中n是您拥有的特征数)中的一个点,其中每个特征的值就是特定坐标的值。

例如,如果我们只有两个特征,例如一个人的身高和头发长度,我们首先将这两个变量绘制在二维空间中,其中每个点都有两个坐标(这些坐标称为支持向量)

SVM1

现在,我们将找到一行将数据划分为两个不同分类的数据组。这条线将使距两组中最近点的距离最远。

SVM2

在上面显示的示例中,将数据分为两个不同类别的组的线是黑线,因为两个最接近的点距离该线最远。这行是我们的分类器。然后,根据测试数据在行两边的位置,可以将新数据归类为该类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
'''
The following code is for Support Vector Machines
Created by - ANALYTICS VIDHYA
'''
# importing required libraries
import pandas as pd
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

'''
Create the object of the Support Vector Classifier model
You can also add other parameters and test your code here
Some parameters are : kernal and degree
Documentation of sklearn Support Vector Classifier:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

'''
model = SVC()

# fit the model with the training data
model.fit(train_x,train_y)

# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train)

# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

5. 朴素贝叶斯

这是一种基于贝叶斯定理的分类技术,假设预测变量之间具有独立性。简单来说,朴素贝叶斯分类器假定类中某个特定功能的存在与任何其他功能的存在无关。例如,如果水果是红色,圆形且直径约3英寸,则可以将其视为苹果。即使这些特征相互依赖或依赖于其他特征的存在,朴素的贝叶斯分类器也会考虑所有这些特征,以独立地有助于该果实是苹果的可能性。

朴素贝叶斯模型易于构建,对于非常大的数据集特别有用。除简单之外,朴素的贝叶斯(Naive Bayes)甚至胜过非常复杂的分类方法。

Bayes_rule
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
'''
The following code is for Naive Bayes
Created by - ANALYTICS VIDHYA
'''

# importing required libraries
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

'''
Create the object of the Naive Bayes model
You can also add other parameters and test your code here
Some parameters are : var_smoothing
Documentation of sklearn GaussianNB:

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

'''
model = GaussianNB()

# fit the model with the training data
model.fit(train_x,train_y)

# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train)

# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

6. kNN(k-最近邻)

它可以用于分类和回归问题。但是,它更广泛地用于行业中的分类问题。K个最近邻居是一种简单的算法,可以存储所有可用案例,并通过其k个邻居的多数票对新案例进行分类。在通过距离函数测得的K个最近邻居中,分配给该类别的案例最为常见。

这些距离函数可以是欧几里得距离,曼哈顿距离,明可夫斯基距离和汉明距离。前三个函数用于连续函数,第四个函数(汉明)用于分类变量。如果K = 1,则将案例简单分配给其最近邻居的类别。有时,执行kNN建模时选择K确实是一个挑战。

KNN

KNN可以轻松地映射到我们的现实生活。如果您想了解一个没有信息的人,则可能想了解他的密友和他所进入的圈子并获得他/她的信息!

选择kNN之前要考虑的事项:

  1. KNN在计算上很昂贵
  2. 变量应归一化,否则范围较大的变量会对其产生偏差
  3. 在进行kNN处理之前(如离群值,噪声消除)在预处理阶段进行更多工作
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
'''
The following code is for the K-Nearest Neighbors
Created by - ANALYTICS VIDHYA
'''
# importing required libraries
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

'''
Create the object of the K-Nearest Neighbor model
You can also add other parameters and test your code here
Some parameters are : n_neighbors, leaf_size
Documentation of sklearn K-Neighbors Classifier:

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

'''
model = KNeighborsClassifier()

# fit the model with the training data
model.fit(train_x,train_y)

# Number of Neighbors used to predict the target
print('\nThe number of neighbors used to predict the target : ',model.n_neighbors)

# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train)

# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

7. K-均值聚类

这是一种无监督算法,可以解决聚类问题。它的过程遵循一种简单的方法,可以通过一定数量的聚类(假设k个聚类)对给定的数据集进行分类。集群中的数据点对同级组是同质的,并且是异构的。

还记得从墨水印迹中找出形状吗?k表示此活动有点类似。您查看形状并展开以解释存在多少个不同的群集/种群!

Ink

K-均值如何形成聚类:

  1. K均值为每个聚类选择k个点,称为质心。
  2. 每个数据点形成一个具有最接近质心的聚类,即k个聚类。
  3. 根据现有集群成员查找每个集群的质心。在这里,我们有了新的质心。
  4. 当我们有了新的质心时,请重复步骤2和3。找到每个数据点与新质心的最近距离,并与新的k簇相关联。重复此过程,直到会聚即质心不变为止。

如何确定K的值:

在K均值中,我们有簇,每个簇都有自己的质心。质心和群集中数据点之间的差平方和构成该群集的平方值之和。同样,当所有聚类的平方和相加时,它成为聚类解的平方和之和。

我们知道,随着簇数的增加,该值会不断减少,但是如果绘制结果,您可能会看到平方距离的总和急剧减小,直至达到某个k值,然后逐渐减小。在这里,我们可以找到最佳的群集数量。

Kmenas
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
'''
The following code is for the K-Means
Created by - ANALYTICS VIDHYA
'''

# importing required libraries
import pandas as pd
from sklearn.cluster import KMeans

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

# Now, we need to divide the training data into differernt clusters
# and predict in which cluster a particular data point belongs.

'''
Create the object of the K-Means model
You can also add other parameters and test your code here
Some parameters are : n_clusters and max_iter
Documentation of sklearn KMeans:

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
'''

model = KMeans()

# fit the model with the training data
model.fit(train_data)

# Number of Clusters
print('\nDefault number of Clusters : ',model.n_clusters)

# predict the clusters on the train dataset
predict_train = model.predict(train_data)
print('\nCLusters on train data',predict_train)

# predict the target on the test dataset
predict_test = model.predict(test_data)
print('Clusters on test data',predict_test)

# Now, we will train a model with n_cluster = 3
model_n3 = KMeans(n_clusters=3)

# fit the model with the training data
model_n3.fit(train_data)

# Number of Clusters
print('\nNumber of Clusters : ',model_n3.n_clusters)

# predict the clusters on the train dataset
predict_train_3 = model_n3.predict(train_data)
print('\nCLusters on train data',predict_train_3)

# predict the target on the test dataset
predict_test_3 = model_n3.predict(test_data)
print('Clusters on test data',predict_test_3)

8. 随机森林

随机森林是决策树集合的商标术语。在随机森林中,我们收集了决策树(也称为“森林”)。为了基于属性对新对象进行分类,每棵树都进行了分类,我们称该树为该类“投票”。森林选择投票最多的类别(在森林中的所有树木上)。

每棵树的种植和生长如下:

  1. 如果训练集中的病例数为N,则随机抽取N个病例作为样本,并进行替换。该样本将成为树木生长的训练集。
  2. 如果有M个输入变量,则指定数字m << M,以便在每个节点上从M个中随机选择m个变量,并使用对这m个变量的最佳分割来分割节点。在森林生长过程中,m的值保持恒定。
  3. 每棵树都尽可能地生长。没有修剪。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
'''
The following code is for the Random Forest
Created by - ANALYTICS VIDHYA
'''

# importing required libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# view the top 3 rows of the dataset
print(train_data.head(3))

# shape of the dataset
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

'''

Create the object of the Random Forest model
You can also add other parameters and test your code here
Some parameters are : n_estimators and max_depth
Documentation of sklearn RandomForestClassifier:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

'''
model = RandomForestClassifier()

# fit the model with the training data
model.fit(train_x,train_y)

# number of trees used
print('Number of Trees used : ', model.n_estimators)

# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train)

# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)

9.降维算法

在过去的4-5年中,每个可能阶段的数据捕获都呈指数级增长。公司/政府机构/研究组织不仅提供了新的来源,而且还非常详细地捕获数据。

例如:电子商务公司正在捕获有关客户的更多详细信息,例如他们的人口统计信息,网络爬网历史记录,他们喜欢或不喜欢的东西,购买历史记录,反馈以及许多其他信息,这些东西比最近的杂货店店主更能给予他们个性化的关注。

作为数据科学家,我们提供的数据还包含许多功能,这对于构建良好的鲁棒模型听起来不错,但仍存在挑战。您如何识别1000或2000中的高有效变量?在这种情况下,降维算法可与其他各种算法(例如决策树,随机森林,PCA,因子分析,基于相关矩阵识别,缺失值比率等)一起帮助我们。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
'''
The following code is for Principal Component Analysis (PCA)
Created by - ANALYTICS VIDHYA
'''
# importing required libraries
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# read the train and test dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# view the top 3 rows of the dataset
print(train_data.head(3))

# shape of the dataset
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
# target variable - Item_Outlet_Sales
train_x = train_data.drop(columns=['Item_Outlet_Sales'],axis=1)
train_y = train_data['Item_Outlet_Sales']

# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Item_Outlet_Sales'],axis=1)
test_y = test_data['Item_Outlet_Sales']

print('\nTraining model with {} dimensions.'.format(train_x.shape[1]))

# create object of model
model = LinearRegression()

# fit the model with the training data
model.fit(train_x,train_y)

# predict the target on the train dataset
predict_train = model.predict(train_x)

# Accuray Score on train dataset
rmse_train = mean_squared_error(train_y,predict_train)**(0.5)
print('\nRMSE on train dataset : ', rmse_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)

# Accuracy Score on test dataset
rmse_test = mean_squared_error(test_y,predict_test)**(0.5)
print('\nRMSE on test dataset : ', rmse_test)

# create the object of the PCA (Principal Component Analysis) model
# reduce the dimensions of the data to 12
'''
You can also add other parameters and test your code here
Some parameters are : svd_solver, iterated_power
Documentation of sklearn PCA:

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
'''
model_pca = PCA(n_components=12)

new_train = model_pca.fit_transform(train_x)
new_test = model_pca.fit_transform(test_x)

print('\nTraining model with {} dimensions.'.format(new_train.shape[1]))

# create object of model
model_new = LinearRegression()

# fit the model with the training data
model_new.fit(new_train,train_y)

# predict the target on the new train dataset
predict_train_pca = model_new.predict(new_train)

# Accuray Score on train dataset
rmse_train_pca = mean_squared_error(train_y,predict_train_pca)**(0.5)
print('\nRMSE on new train dataset : ', rmse_train_pca)

# predict the target on the new test dataset
predict_test_pca = model_new.predict(new_test)

# Accuracy Score on test dataset
rmse_test_pca = mean_squared_error(test_y,predict_test_pca)**(0.5)
print('\nRMSE on new test dataset : ', rmse_test_pca)

10. 梯度提升算法

10.1 GBM

当我们处理大量数据以进行具有高预测能力的预测时,GBM是一种增强算法。Boosting实际上是一种学习算法的集合,该算法结合了多个基本估计量的预测,以提高单个估计量的鲁棒性。它将多个弱或平均预测变量组合为一个构建强的预测变量。这些增强算法在Kaggle,AV Hackathon,CrowdAnalytix等数据科学竞赛中始终能很好地发挥作用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
'''
The following code is for Gradient Boosting
Created by - ANALYTICS VIDHYA
'''

# importing required libraries
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

'''
Create the object of the GradientBoosting Classifier model
You can also add other parameters and test your code here
Some parameters are : learning_rate, n_estimators
Documentation of sklearn GradientBoosting Classifier:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
'''
model = GradientBoostingClassifier(n_estimators=100,max_depth=5)

# fit the model with the training data
model.fit(train_x,train_y)

# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train)

# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)

10.2 XGBoost

在某些Kaggle比赛中,另一种经典的梯度提升算法是决定胜负的决定性选择。

XGBoost具有极高的预测能力,这使其成为事件准确性的最佳选择,因为它同时具有线性模型和树学习算法,这使得该算法比现有的梯度增强技术快了近10倍。

支持包括各种目标功能,包括回归,分类和排名。

关于XGBoost的最有趣的事情之一是,它也被称为正则化增强技术。这有助于减少过拟合模型,并且对Scala,Java,R,Python,Julia和C ++等多种语言提供了广泛的支持。

支持在包含GCE,AWS,Azure和Yarn群集的许多计算机上进行分布式和广泛的培训。XGBoost还可以与Spark,Flink和其他云数据流系统集成,在提升过程的每次迭代中都具有内置的交叉验证。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
'''
The following code is for XGBoost
Created by - ANALYTICS VIDHYA
'''

# importing required libraries
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

'''
Create the object of the XGBoost model
You can also add other parameters and test your code here
Some parameters are : max_depth and n_estimators
Documentation of xgboost:

https://xgboost.readthedocs.io/en/latest/
'''
model = XGBClassifier()

# fit the model with the training data
model.fit(train_x,train_y)


# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train)

# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)

10.3 LightGBM

LightGBM 是使用基于树的学习算法的梯度增强框架。它被设计为分布式且高效的,具有以下优点:

  • 更快的训练速度和更高的效率
  • 降低内存使用量
  • 精度更高
  • 支持并行和GPU学习
  • 能够处理大规模数据

该框架是基于决策树算法的一种快速,高性能的梯度提升算法,用于排名,分类和许多其他机器学习任务。它是在Microsoft的分布式机器学习工具包项目下开发的。

由于LightGBM基于决策树算法,因此它以最佳拟合的方式对树进行拆分,而其他增强算法则对树的深度或层次进行拆分,而不是对叶进行拆分。因此,当在Light GBM中的同一叶上生长时,与逐级算法相比,逐叶算法可以减少更多的损失,因此可以得到更好的精度,而现有的任何增强算法都很少达到这种精度。

而且,它出奇地快,因此是“ Light”一词。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
data = np.random.rand(500, 10) # 500 entities, each contains 10 features
label = np.random.randint(2, size=500) # binary target

train_data = lgb.Dataset(data, label=label)
test_data = train_data.create_valid('test.svm')

param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}
param['metric'] = 'auc'

num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[test_data])

bst.save_model('model.txt')

# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
ypred = bst.predict(data)

10.4 Catboost

CatBoost是Yandex最近提供的开源机器学习算法。它可以轻松地与深度学习框架(如Google的TensorFlow和Apple的Core ML)集成。

关于CatBoost的最好之处在于,它不需要像其他ML模型一样进行大量的数据培训,并且可以处理多种数据格式。不会破坏它的坚固性。

在继续实施之前,请确保处理好丢失的数据。

Catboost可以自动处理分类变量,而不会显示类型转换错误,这可以帮助您专注于更好地调整模型,而不是解决琐碎的错误。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd
import numpy as np

from catboost import CatBoostRegressor

#Read training and testing files
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

#Imputing missing values for both train and test
train.fillna(-999, inplace=True)
test.fillna(-999,inplace=True)

#Creating a training set for modeling and validation set to check model performance
X = train.drop(['Item_Outlet_Sales'], axis=1)
y = train.Item_Outlet_Sales

from sklearn.model_selection import train_test_split

X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.7, random_state=1234)
categorical_features_indices = np.where(X.dtypes != np.float)[0]

#importing library and building model
from catboost import CatBoostRegressormodel=CatBoostRegressor(iterations=50, depth=3, learning_rate=0.1, loss_function='RMSE')

model.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_validation, y_validation),plot=True)

submission = pd.DataFrame()

submission['Item_Identifier'] = test['Item_Identifier']
submission['Outlet_Identifier'] = test['Outlet_Identifier']
submission['Item_Outlet_Sales'] = model.predict(test)

节选自:Commonly used Machine Learning Algorithms (with Python and R Codes)
作者:SUNIL RAY