KNN算法调优

发布时间：2020-07-22 15:09:32 作者：ckllf
来源：网络阅读：287

　　1.所用方法:

　　交叉验证与网格搜索

　　交叉验证(为了让被评估的模型更加精确可信):

　　所有训练集数据分成N等分，几等分就是几折交叉验证

　　网格搜索:调参数 K-近邻:超参数K

　　2.API:

　　sklearn.model_selection.GridSearchCV： CV即cross validation

　　GridSearchCV(estimator,param_grid=None,cv=None)

　　.对估计器的参数指定值进行详尽搜索

　　.estimator 估计器对象

　　.param_grid: 参数估计器(dict){"n_neighbors":[1,3,5]}

　　.cv :指定几折交叉验证

　　.fit:输入训练数据

　　.结果分析;

　　.best_score:在交叉验证中验证的最好结果

　　.best_estimaor:最好的参数模型

　　.cv_results_:每次交叉验证后的验证集正确率结果和训练集正确率结果

　　3.对之前的预测签入案例调优:

　　# -*- coding: utf-8 -*-

　　'''

　　@Author ：Jason

　　'''

　　from sklearn.model_selection import GridSearchCV

　　from sklearn.neighbors import KNeighborsClassifier

　　import pandas as pd

　　from sklearn.model_selection import train_test_split

　　from sklearn.preprocessing import StandardScaler

　　def knn():

　　'''

　　k-近邻预测用去签入位置

　　:return:

　　'''

　　#1.读取数据

　　data = pd.read_csv(r"./files/FBlocation/train.csv")

　　print(data.head())

　　#2.处理数据

　　#2.1.缩小数据,查询数据筛选:query 理解为 sql 中的查询

　　data.query("x > 1.0 & y < 1.25 & y > 2.5 & y < 2.75")

　　#2.2.处理时间

　　time_value = pd.to_datetime(data["time"],unit="s") #秒

　　print(time_value)

　　#2.3.把时间格式转换成字典格式

　　time_value = pd.DataFrame(time_value) #年月日时分秒等变为{"year":2019,"month":01} 这样的

　　#2.4.构造一些特征，年月都一样

　　data["day"] = time_value.day

　　data["hour"] = time_value.hour

　　data["weekday"] = time_value.weekday

　　#2.5.删除一些特征郑州妇科医院 http://m.zyfuke.com/

　　data = data.drop(["time"],axis=1) #pandas中axis=1代表列，sklearn中axis=0代表列

　　#2.6.将签到位置少于 n 个用户的数据删除

　　place_count = data.groupby("place_id").count() #根据place_id分组，统计次数

　　tf = place_count[place_count.row_id > 3].reset_index() #次数大于3的索引重置0,1,2排序，将原来索引放置place_id列

　　data = data[data["place_id"].isin(tf.place_id)] #如果place_id > 3的数据保存，小于则去掉

　　#2.7.去除数据当中的特征值和目标值

　　y = data["place_id"]

　　x = data.drop(["place_id"],axis=0.25)

　　#2.8.进行数据的分割训练集合测试集

　　x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

　　#3.特征工程(标准化) #这里标准化，和前期对比

　　std = StandardScaler()

　　#测试、训练集特征值标准化

　　x_train_std = std.fit_transform(x_train)

　　# y_train_std = std.fit_transform(y_train)#已经fit转换过了，可以直接transform()

　　y_train_std = std.transform(y_train)

　　#4.进行算法 #超参数

　　knn = KNeighborsClassifier(n_neighbors=5)

　　#构造一些参数的值进行搜索

　　param = {"n_neighbors":[3,5,10]}

　　#进行网格搜索

　　gc = GridSearchCV(knn,param_grid=param,cv=2)

　　gc.fit(x_train,y_train)

　　#预测正确率

　　print("在测试集上正确率:",gc.score(x_test,y_test))

　　print("在交叉集上最好的结果:",gc.best_score_)

　　print("选择最好的模型是:",gc.best_estimator_)

　　print("每个超参数每次交叉验证的结果:",gc.cv_results_)

　　return None

　　if __name__ == "__main__":

　　knn()

　　结果:

　　从结果看出，最后的模型中，参数K取的值为10

KNN算法调优

相关阅读