数据-聚类-案例

Title: Auto-Mpg Data
Sources:
(a) Origin: This dataset was taken from the StatLib library which is
maintained at Carnegie Mellon University. The dataset was
used in the 1983 American Statistical Association Exposition.
(c) Date: July 7, 1993
Past Usage:
- See 2b (above)
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning.
  In Proceedings on the Tenth International Conference of Machine
  Learning, 236-243, University of Massachusetts, Amherst. Morgan
  Kaufmann.
Relevant Information:

This dataset is a slightly modified version of the dataset provided in
the StatLib library. In line with the use by Ross Quinlan (1993) in
predicting the attribute "mpg", 8 of the original instances were removed
because they had unknown values for the "mpg" attribute. The original
dataset is available in the file "auto-mpg.data-original".

"The data concerns city-cycle fuel consumption in miles per gallon,
to be predicted in terms of 3 multivalued discrete and 5 continuous
attributes." (Quinlan, 1993)
Number of Instances: 398
Number of Attributes: 9 including the class attribute
Attribute Information:
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)
Missing Attribute Values: horsepower has 6 missing values

1、导入必要的包

from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.cluster import MeanShift
from sklearn.cluster import SpectralClustering
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

2、读入数据

data=pd.read_csv("d:/datasets/auto-mpg.csv")

3、数据探索

data.head()
data.info()
data.describe()

4、数据预处理

data_auto=data.drop("car name",axis=1)
print(data_auto[data_auto["horsepower"]=="?"])
horse=data_auto["horsepower"].value_counts() #统计缺失值数量
#删除不完整的样本
data_auto.drop(data_auto[data_auto["horsepower"]=="?"].index,inplace=True)
data_auto.horsepower=data_auto.horsepower.astype("int64")
#标准化
model_sc=StandardScaler()
model_sc.fit(data_auto)
data_auto_sc=model_sc.transform(data_auto)

5、建模

5.1 Kmeans

model_km=KMeans(n_clusters=3,random_state=10)
model_km.fit(data_auto_sc)
auto_label=model_km.labels_
auto_cluster=model_km.cluster_centers_
pd.Series(auto_label).value_counts()
print(auto_cluster)
print(model_sc.inverse_transform(auto_cluster))

探寻最优的K值

for k in [2,3,4,6,300]:
model_km=KMeans(n_clusters=k,random_state=10).fit(data_auto_sc)
auto_label=model_km.labels_
auto_cluster=model_km.cluster_centers_
print(k," ",round(metrics.silhouette_score(data_auto_sc,auto_label),4))

5.2 MeanShift

model_mn=MeanShift(bandwidth=2).fit(data_auto_sc)
auto_label=model_mn.labels_
auto_cluster=model_mn.cluster_centers_

bandwidth_grid=np.arange(1,2.5,0.2)
cluster_number=[]
slt_score=[]
for i in bandwidth_grid:
model=MeanShift(bandwidth=i).fit(data_auto_sc)
cluster_number.append(len(np.unique(model.labels_)))
slt_score.append(metrics.silhouette_score(data_auto_sc,model.labels_))

from prettytable import PrettyTable
x = PrettyTable(["窗宽","蔟的个数","轮廓系数"])
#x.align["窗宽"] = "1" #以姓名字段左对齐
#x.padding_width = 1 # 填充宽度
for i,j,k in zip(bandwidth_grid,cluster_number,slt_score):
x.add_row([i,j,k])
print(x)

5.3 AgglomerativeClustering

model=AgglomerativeClustering(n_clusters=3,linkage="average").fit(data_auto_sc)
auto_label=model.labels_

lbs=pd.Series(auto_label).value_counts()
#plt.bar(x=lbs.index,height=lbs )
lbs.plot(kind="bar",rot=0)

# 绘制谱系图
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
#利用scipy中pdist,linkage,dendrogram函数绘制谱系图
#pdist函数返回距离矩阵，linkage函数返回一个ndarray对象，描述了簇合并的过程
#dendrogram函数用来绘制谱系图
row_clusters = linkage(pdist(data_auto_sc,metric='euclidean'),method='ward')
fig = plt.figure(figsize=(16,8))
#参数p和参数truncate_mode用来将谱系图截断，部分结点的子树被剪枝，横轴显示的是该结点包含的样本数
row_dendr = dendrogram(row_clusters, p=50, truncate_mode='lastp',color_threshold=5)
plt.tight_layout()
plt.title('谱系图', fontsize=15)

5.4 DBSCAN

# 训练模型
model = DBSCAN(eps=1,min_samples=2).fit(data_auto_sc_0)
# 输出模型结果
auto_label = model.labels_
# 核心对象的索引
model.core_sample_indices_
# 输出核心对象
model.components_

clu_num=[]
for min_ in [1,3,5,7,9]:
model = DBSCAN(eps=1,min_samples=min_).fit(data_auto_sc_0)
# 输出模型结果
labels=model.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
clu_num.append(n_clusters_)

5.5 SpectralClustering

from sklearn.cluster import SpectralClustering
model= SpectralClustering(n_clusters=3)
model.fit(data_auto_sc)
auto_label=model.labels_

这篇好文章是转载于：学新通技术网

数据-聚类-案例

0、数据集介绍

Content

1、导入必要的包

2、读入数据

3、数据探索

4、数据预处理

5、建模

5.1 Kmeans

探寻最优的K值

5.2 MeanShift

5.3 AgglomerativeClustering

5.4 DBSCAN

5.5 SpectralClustering

photoshop保存的图片太大微信发不了怎么办

word里面弄一个表格后上面的标题会跑到下面怎么办

photoshop扩展功能面板显示灰色怎么办

《学习通》视频自动暂停处理方法

TikTok加速器哪个好免费的TK加速器推荐

Android 11 保存文件到外部存储，并分享文件

微信公众号没有声音提示怎么办

excel下划线不显示怎么办

微信运动停用后别人还能看到步数吗

excel打印预览压线压字怎么办