• 首页 首页 icon
  • 工具库 工具库 icon
    • IP查询 IP查询 icon
  • 内容库 内容库 icon
    • 快讯库 快讯库 icon
    • 精品库 精品库 icon
    • 问答库 问答库 icon
  • 更多 更多 icon
    • 服务条款 服务条款 icon

数据-聚类-案例

武飞扬头像
ITLiu_JH
帮助4

目录

0、数据集介绍

1、导入必要的包

2、读入数据

3、数据探索

4、数据预处理

5、建模

5.1 Kmeans

探寻最优的K值

5.2 MeanShift

5.3 AgglomerativeClustering

5.4 DBSCAN

5.5 SpectralClustering


0、数据集介绍

The data is technical spec of cars. The dataset is downloaded from UCI Machine Learning Repository

UCI Machine Learning Repository: Auto MPG Data Set

Content

  1. Title: Auto-Mpg Data

  2. Sources:
    (a) Origin: This dataset was taken from the StatLib library which is
    maintained at Carnegie Mellon University. The dataset was
    used in the 1983 American Statistical Association Exposition.
    (c) Date: July 7, 1993

  3. Past Usage:

    • See 2b (above)
    • Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning.
      In Proceedings on the Tenth International Conference of Machine
      Learning, 236-243, University of Massachusetts, Amherst. Morgan
      Kaufmann.
  4. Relevant Information:

    This dataset is a slightly modified version of the dataset provided in
    the StatLib library. In line with the use by Ross Quinlan (1993) in
    predicting the attribute "mpg", 8 of the original instances were removed
    because they had unknown values for the "mpg" attribute. The original
    dataset is available in the file "auto-mpg.data-original".

    "The data concerns city-cycle fuel consumption in miles per gallon,
    to be predicted in terms of 3 multivalued discrete and 5 continuous
    attributes." (Quinlan, 1993)

  5. Number of Instances: 398

  6. Number of Attributes: 9 including the class attribute

  7. Attribute Information:

    1. mpg: continuous
    2. cylinders: multi-valued discrete
    3. displacement: continuous
    4. horsepower: continuous
    5. weight: continuous
    6. acceleration: continuous
    7. model year: multi-valued discrete
    8. origin: multi-valued discrete
    9. car name: string (unique for each instance)
  8. Missing Attribute Values: horsepower has 6 missing values

1、导入必要的包

  1.  
    from sklearn.cluster import KMeans
  2.  
    from sklearn.cluster import AgglomerativeClustering
  3.  
    from sklearn.cluster import DBSCAN
  4.  
    from sklearn.cluster import MeanShift
  5.  
    from sklearn.cluster import SpectralClustering
  6.  
    from sklearn.preprocessing import LabelEncoder
  7.  
    from sklearn.preprocessing import MaxAbsScaler
  8.  
    from sklearn.preprocessing import StandardScaler
  9.  
    from sklearn import metrics
  10.  
    import numpy as np
  11.  
    import pandas as pd
  12.  
    import matplotlib.pyplot as plt
  13.  
    import warnings
  14.  
    warnings.filterwarnings("ignore")

2、读入数据

data=pd.read_csv("d:/datasets/auto-mpg.csv")

3、数据探索

  1.  
    data.head()
  2.  
    data.info()
  3.  
    data.describe()

4、数据预处理

  1.  
    data_auto=data.drop("car name",axis=1)
  2.  
    print(data_auto[data_auto["horsepower"]=="?"])
  3.  
    horse=data_auto["horsepower"].value_counts() #统计缺失值数量
  4.  
    #删除不完整的样本
  5.  
    data_auto.drop(data_auto[data_auto["horsepower"]=="?"].index,inplace=True)
  6.  
    data_auto.horsepower=data_auto.horsepower.astype("int64")
  7.  
    #标准化
  8.  
    model_sc=StandardScaler()
  9.  
    model_sc.fit(data_auto)
  10.  
    data_auto_sc=model_sc.transform(data_auto)

5、建模

5.1 Kmeans

  1.  
    model_km=KMeans(n_clusters=3,random_state=10)
  2.  
    model_km.fit(data_auto_sc)
  3.  
    auto_label=model_km.labels_
  4.  
    auto_cluster=model_km.cluster_centers_
  5.  
    pd.Series(auto_label).value_counts()
  6.  
    print(auto_cluster)
  7.  
    print(model_sc.inverse_transform(auto_cluster))

探寻最优的K值

  1.  
    for k in [2,3,4,6,300]:
  2.  
    model_km=KMeans(n_clusters=k,random_state=10).fit(data_auto_sc)
  3.  
    auto_label=model_km.labels_
  4.  
    auto_cluster=model_km.cluster_centers_
  5.  
    print(k," ",round(metrics.silhouette_score(data_auto_sc,auto_label),4))

5.2 MeanShift

  1.  
    model_mn=MeanShift(bandwidth=2).fit(data_auto_sc)
  2.  
    auto_label=model_mn.labels_
  3.  
    auto_cluster=model_mn.cluster_centers_

  1.  
    bandwidth_grid=np.arange(1,2.5,0.2)
  2.  
    cluster_number=[]
  3.  
    slt_score=[]
  4.  
    for i in bandwidth_grid:
  5.  
    model=MeanShift(bandwidth=i).fit(data_auto_sc)
  6.  
    cluster_number.append(len(np.unique(model.labels_)))
  7.  
    slt_score.append(metrics.silhouette_score(data_auto_sc,model.labels_))
  1.  
    from prettytable import PrettyTable
  2.  
    x = PrettyTable(["窗宽","蔟的个数","轮廓系数"])
  3.  
    #x.align["窗宽"] = "1" #以姓名字段左对齐
  4.  
    #x.padding_width = 1 # 填充宽度
  5.  
    for i,j,k in zip(bandwidth_grid,cluster_number,slt_score):
  6.  
    x.add_row([i,j,k])
  7.  
    print(x)

5.3 AgglomerativeClustering

  1.  
    model=AgglomerativeClustering(n_clusters=3,linkage="average").fit(data_auto_sc)
  2.  
    auto_label=model.labels_
  1.  
    lbs=pd.Series(auto_label).value_counts()
  2.  
    #plt.bar(x=lbs.index,height=lbs )
  3.  
    lbs.plot(kind="bar",rot=0)
  1.  
    # 绘制谱系图
  2.  
    from scipy.spatial.distance import pdist
  3.  
    from scipy.cluster.hierarchy import linkage, dendrogram
  4.  
    import matplotlib.pyplot as plt
  5.  
    plt.rcParams['font.sans-serif'] = ['SimHei']
  6.  
    #利用scipy中pdist,linkage,dendrogram函数绘制谱系图
  7.  
    #pdist函数返回距离矩阵,linkage函数返回一个ndarray对象,描述了簇合并的过程
  8.  
    #dendrogram函数用来绘制谱系图
  9.  
    row_clusters = linkage(pdist(data_auto_sc,metric='euclidean'),method='ward')
  10.  
    fig = plt.figure(figsize=(16,8))
  11.  
    #参数p和参数truncate_mode用来将谱系图截断,部分结点的子树被剪枝,横轴显示的是该结点包含的样本数
  12.  
    row_dendr = dendrogram(row_clusters, p=50, truncate_mode='lastp',color_threshold=5)
  13.  
    plt.tight_layout()
  14.  
    plt.title('谱系图', fontsize=15)

学新通

5.4 DBSCAN

  1.  
    # 训练模型
  2.  
    model = DBSCAN(eps=1,min_samples=2).fit(data_auto_sc_0)
  3.  
    # 输出模型结果
  4.  
    auto_label = model.labels_
  5.  
    # 核心对象的索引
  6.  
    model.core_sample_indices_
  7.  
    # 输出核心对象
  8.  
    model.components_
  1.  
    clu_num=[]
  2.  
    for min_ in [1,3,5,7,9]:
  3.  
    model = DBSCAN(eps=1,min_samples=min_).fit(data_auto_sc_0)
  4.  
    # 输出模型结果
  5.  
    labels=model.labels_
  6.  
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
  7.  
    clu_num.append(n_clusters_)

5.5 SpectralClustering

  1.  
    from sklearn.cluster import SpectralClustering
  2.  
    model= SpectralClustering(n_clusters=3)
  3.  
    model.fit(data_auto_sc)
  4.  
    auto_label=model.labels_

这篇好文章是转载于:学新通技术网

  • 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
  • 本站站名: 学新通技术网
  • 本文地址: /boutique/detail/tanhiaahci
系列文章
更多 icon
同类精品
更多 icon
继续加载