SPSS(十五)spss之聚类分析(图文+数据集)
聚类分析简介
按照个体(记录)的特征将它们分类,使同一类别内的个体具有尽可能高的同质性,而类别之间则具有尽可能高的异质性。
为了得到比较合理的分类,首先要采用适当的指标来定量地描述研究对象之间的联系的紧密程度。
假定研究对象均用所谓的“点”来表示。
在聚类分析中,一般的规则是将“距离”较小的点归为同一类,将“距离”较大的点归为不同的类。
常见的是对个体分类,也可以对变量分类,但对于变量分类此时一般使用相似系数作为“距离”测量指标
聚类分析前所有个体所属的类别是未知的,类别个数一般也未知,分析的依据就是原始数据,可能事先没有任何有关类别的信息可参考。
严格说来聚类分析并不是纯粹的统计技术,它不像其它多元分析法那样,需要从样本去推断总体。一般都涉及不到有关统计量的分布,也不需要进行显著性检验。
聚类分析更像是一种建立假设的方法,而对相关假设的检验还需要借助其它统计方法。
注意:聚类分析更像是一种建立假设的方法,而对于相关假设的检验还需要借助其他统计的方法,比如判别分析、T-检验、方差分析等,看聚类出来的几个类别是否存在差异
聚类的用途
设计抽样方案(分层抽样)
预分析过程(先通过聚类分析达到简化数据的目的,将众多的个体先聚集成比较好处理的几个类别或子集,然后再进行后续的多元分析)
细分市场、个体消费行为划分(先聚类,然后再利用判别分析进一步研究各个群体之间的差异)
聚类分析的基本步骤总结
聚类方法
K均值聚类(K-means Cluster)
方法原理
选择(或人为指定)某些记录作为凝聚点
按就近原则将其余记录向凝聚点凝集
计算出各个初始分类的中心位置(均值)
用计算出的中心位置重新进行聚类
如此反复循环,直到凝聚点位置收敛为止
方法特点
要求已知类别数
可人为指定初始位置
节省运算时间
样本量过大时有必要考虑
只能使用连续性变量
案例:移动通讯客户细分
数据包含6个变量
是客户编号(Customer_ID)
工作日上班时期电话时长(Peak_mins)
工作日下班时期电话时长(OffPeak_mins)
周末电话时长(Weekend_mins)
国际电话时长(International_mins)
总通话时长(Total_mins)
平均每次通话时长(average_mins)
根据前期的调研,研究者认为移动用户应当被分为5个主要群体,现希望得到相应的定量聚类结果。
(由于数据集过多,可到我的资源下载“spss之聚类分析--移动通讯客户细分”)
看到结果无法收敛,所以重新设置迭代次数,让其收敛
但是最终聚类出来,结果怪怪的
各变量测量尺度,量纲不一样,聚类计算其距离时量纲大的对结果影响大
如何进行标化呢?
重新聚类
标准化的变量,一般在正负3以内,0代表平均水平
但是我们想看原始变量的原始水平,不看这标准化的
保存个案被划分为哪个类
我们只想看均值
得到非标准化的聚类中心结果
解读
第一类:高端商用客户,总通话时间长,工作日上班时间通话比例高
第二类:少使用低端客户,总通话时间短,各时段通话时间都短
第三类:中端商用客户,总通话时间居中,工作日上班时间通话比例高
第四类:中端日常用客户,总通话时间居中,工作日下班时间通话比例高
第五类:长聊客户,每次通话时间长
层次聚类(Hierarchical Cluster)
属于系统聚类法的一种,其聚类过程可以用树形结构(treelike structure)来描绘的方法
方法原理
先将所有n个变量/观测看成不同的n类
然后将性质最接近(距离最近)的两类合并为一类
再从这n-1类中找到最接近的两类加以合并
依此类推,直到所有的变量/观测被合为一类
使用者再根据具体的问题和聚类结果来决定应当分为几类
特点
一旦记录/变量被划定类别,其分类结果就不会再进行更改
可以对变量或记录进行聚类
变量可以为连续或分类变量(变量虽然可以为连续型或者分类型,但是不能混用,要不就是全分类这样使用,要不就全连续变量聚类)
提供的距离测量方法非常丰富
运算速度较慢
案例:体操裁判打分倾向聚类(这个案例是对变量进行聚类)
SPSS自带数据集judges.sav是中、美、法等七个国家的裁判和未经严格训练体育爱好者在评判体育比赛中对选手的评分情况。请根据在评分上的差异将它们分为适当的类。
7.308.007.107.707.207.207.007.67.808.707.208.407.508.107.307.17.207.407.107.507.207.107.007.07.308.407.207.907.508.507.307.17.707.807.208.407.607.407.107.17.307.607.208.107.307.207.007.08.308.307.708.507.807.807.207.89.609.809.309.808.809.909.4010.09.108.808.609.107.809.308.508.59.509.709.009.608.909.809.2010.07.808.508.309.108.009.507.607.98.608.907.809.008.008.707.807.88.509.108.109.308.008.307.808.59.209.108.009.408.509.608.608.98.209.207.909.107.808.307.508.27.007.507.107.407.107.107.007.79.709.909.109.709.0010.009.609.99.809.909.509.809.0010.009.709.98.609.408.209.508.709.808.309.58.809.007.908.508.109.308.009.89.309.809.309.808.7010.009.309.37.507.907.208.107.307.707.107.29.009.307.809.108.209.408.108.58.909.708.909.508.609.709.409.29.9010.009.709.909.4010.009.909.97.207.207.007.907.107.107.007.37.007.107.007.207.007.007.007.07.307.507.107.607.307.907.107.69.009.008.109.107.809.307.707.17.508.507.208.508.008.907.207.69.109.708.609.609.109.909.507.67.608.007.208.907.407.607.107.27.708.307.408.407.508.307.408.57.707.907.308.707.407.707.308.38.108.407.908.907.908.607.707.08.509.608.709.208.409.308.308.79.809.809.709.908.909.909.809.99.209.208.109.608.409.608.809.29.909.909.709.909.1010.009.709.88.908.707.909.008.409.708.407.59.9010.009.209.909.0010.009.308.07.508.507.508.207.508.707.909.48.909.608.809.308.409.909.209.17.708.707.408.507.708.307.307.19.709.909.409.809.209.909.6010.09.609.908.809.709.009.909.309.27.608.407.608.607.508.007.207.18.308.407.308.507.607.607.307.88.908.908.309.008.309.508.709.49.509.908.609.608.609.909.308.87.908.507.608.307.608.407.307.18.709.608.009.308.309.308.008.87.307.907.207.707.308.007.008.07.307.407.007.807.207.607.007.27.107.207.107.507.107.207.007.09.8010.009.509.909.4010.009.9010.09.309.508.509.407.909.508.709.88.509.007.508.907.909.107.708.69.7010.009.409.909.1010.009.909.78.708.408.008.907.609.407.409.97.507.807.208.107.307.807.308.58.208.707.909.008.108.908.507.98.709.508.309.608.109.808.508.39.409.709.109.408.809.908.909.98.909.808.209.308.209.708.509.59.809.909.209.609.309.909.9010.09.009.608.309.608.209.608.709.88.308.607.908.507.909.508.409.38.809.108.009.108.009.508.809.88.408.407.408.707.908.407.408.09.9010.009.809.709.5010.0010.008.38.809.208.609.208.009.507.507.38.909.008.009.107.909.307.908.39.5010.009.509.809.1010.009.709.19.009.508.309.308.609.809.309.19.709.409.109.509.009.909.4010.08.709.608.909.708.809.709.1010.07.407.807.108.207.107.907.108.78.108.908.009.108.109.308.108.57.808.407.608.307.507.807.208.27.507.407.108.107.207.307.107.87.708.107.408.707.608.507.607.29.8010.009.709.909.3010.009.8010.07.107.607.007.807.207.207.007.19.709.709.209.708.6010.009.407.39.609.209.009.408.609.609.109.89.809.909.309.708.8010.009.309.67.107.207.107.707.107.107.007.07.407.807.208.407.507.607.208.58.708.508.209.207.808.207.607.97.308.107.108.007.107.507.207.58.509.208.409.307.909.508.609.48.909.608.709.508.509.609.307.87.007.407.007.507.007.107.007.27.708.907.208.707.908.607.508.27.908.907.409.007.908.807.707.27.107.507.107.907.107.207.007.08.309.107.508.707.808.407.508.28.509.008.108.907.909.307.809.09.609.909.309.709.009.909.509.38.809.408.209.308.109.908.608.68.308.607.508.807.608.607.307.39.509.208.909.508.509.909.109.97.608.307.508.707.508.007.807.69.409.809.109.608.409.709.408.97.307.907.407.907.208.307.207.78.709.207.509.207.508.107.409.77.407.507.108.307.107.307.007.18.509.608.409.108.109.508.309.78.408.807.709.108.009.308.308.28.809.307.509.307.809.208.307.99.9010.009.709.809.5010.0010.0010.07.007.207.007.607.107.107.007.08.809.408.609.107.808.708.309.27.207.807.108.007.107.607.107.18.709.708.909.708.409.709.309.27.107.407.107.707.107.007.007.17.909.007.308.507.308.307.307.08.709.407.709.107.809.108.709.87.207.607.108.107.307.707.107.39.109.808.409.508.609.608.209.88.308.408.208.607.909.308.207.08.709.808.909.408.009.708.309.58.408.807.309.307.408.607.707.09.309.909.009.708.509.709.409.99.209.508.609.608.209.509.408.37.407.607.108.307.207.107.007.07.308.007.108.307.307.507.407.88.709.508.209.608.209.508.808.07.208.607.108.007.307.707.008.58.808.107.508.707.708.407.507.79.209.708.709.708.509.409.009.57.607.807.208.307.408.007.107.17.908.507.408.307.408.407.107.47.707.707.208.307.208.107.109.17.508.307.608.607.308.207.507.38.808.607.608.907.809.007.809.57.607.907.308.107.407.707.107.19.309.608.709.407.909.108.307.88.108.707.708.807.708.707.909.39.809.909.409.708.809.809.408.89.909.809.309.908.809.909.809.98.909.108.509.508.409.808.809.97.908.307.808.807.508.307.307.98.109.007.308.307.307.307.109.09.208.707.809.008.009.107.907.19.809.909.709.909.3010.009.909.89.209.108.708.907.909.208.409.17.508.007.307.907.308.107.307.68.108.907.708.907.608.408.107.69.609.909.209.708.909.909.609.79.009.008.009.408.309.108.708.27.107.307.107.807.107.307.007.09.709.909.409.709.3010.009.809.59.609.909.509.608.7010.009.307.67.207.807.208.007.307.107.007.09.609.909.509.909.309.909.9010.08.308.807.309.007.608.707.108.97.909.107.508.907.508.608.107.69.309.708.809.708.709.709.507.79.509.908.709.709.109.909.509.98.108.707.208.707.507.807.409.57.807.807.308.607.407.507.107.29.709.909.009.709.2010.009.609.48.109.307.609.007.908.108.007.47.808.407.208.107.407.807.107.98.508.907.508.807.908.607.708.99.309.809.209.408.909.809.109.97.908.307.408.607.507.707.307.88.209.108.209.007.808.508.108.29.409.708.309.108.209.408.407.59.009.609.009.508.809.609.409.98.608.608.009.108.109.207.707.79.409.909.509.809.2010.009.409.98.209.408.309.208.209.108.809.47.909.207.909.007.908.907.709.97.908.307.608.807.508.007.207.19.409.909.709.709.2010.009.409.89.109.408.409.208.509.908.807.77.908.407.408.607.608.807.507.08.809.508.509.708.009.608.909.49.809.909.409.909.309.909.709.68.309.307.709.207.909.308.609.29.709.808.809.709.209.909.508.39.009.608.409.408.309.109.009.27.207.107.007.507.107.307.007.57.708.507.408.707.708.307.309.58.208.007.408.407.408.307.208.97.407.707.107.807.307.107.207.08.308.807.609.107.708.907.807.59.809.809.809.908.6010.009.908.89.809.909.309.809.1010.009.509.38.709.108.209.108.209.108.208.47.908.607.408.507.608.407.308.57.508.007.308.007.408.007.107.17.207.607.208.007.407.307.008.99.509.708.309.608.709.808.709.78.308.107.509.007.307.507.207.49.509.809.509.709.109.808.809.38.408.708.109.207.908.507.308.19.809.808.709.608.709.909.409.98.708.507.508.407.708.207.308.69.709.908.909.809.009.909.209.47.808.508.008.808.009.207.307.48.308.707.508.407.508.707.207.88.008.407.508.607.507.507.208.87.508.207.208.507.308.007.107.09.709.909.509.808.7010.009.909.89.709.909.709.909.4010.009.809.87.007.107.007.207.007.007.007.07.508.207.108.407.207.807.108.08.609.208.509.208.509.608.408.19.009.407.909.508.409.308.209.78.408.707.509.207.808.307.707.68.009.207.808.907.908.407.3010.08.508.708.209.407.809.508.207.97.208.207.308.307.507.807.107.57.207.907.107.907.207.707.008.38.308.607.309.007.808.208.107.77.207.407.107.807.107.207.007.39.209.808.809.508.6010.009.309.68.409.007.509.108.008.908.009.27.307.507.008.007.307.107.007.88.409.508.309.408.209.409.209.07.608.107.608.607.308.007.107.67.608.407.308.407.308.207.107.37.407.807.108.007.207.507.107.08.709.008.009.407.708.907.807.99.509.909.309.609.109.909.509.09.309.808.809.609.209.709.408.79.809.809.009.708.6010.009.309.88.209.007.108.907.508.907.507.98.008.407.308.407.307.907.407.28.809.308.709.208.209.108.208.48.708.707.408.907.809.207.609.47.107.607.007.707.107.107.007.18.608.507.909.207.809.207.408.67.708.007.208.507.408.107.109.28.108.807.708.908.009.107.908.78.409.208.509.508.509.808.209.79.209.709.209.609.009.809.609.59.909.809.309.808.809.909.809.99.609.908.709.608.309.909.309.57.007.207.007.307.007.007.007.37.608.207.408.907.608.107.307.28.309.207.909.108.009.008.008.99.609.808.509.508.609.809.207.68.508.807.608.807.808.907.408.59.8010.009.309.909.209.909.9010.09.509.809.409.508.9010.009.7010.08.509.208.109.108.509.707.709.97.407.907.108.207.107.907.307.28.009.207.909.108.008.807.309.69.209.608.009.508.409.609.5010.08.809.708.209.508.909.508.208.87.107.607.407.807.207.407.007.19.009.708.209.507.909.608.609.97.608.908.409.207.908.207.608.29.709.709.309.709.009.909.607.88.209.208.109.007.709.008.607.39.009.008.109.308.0010.008.709.99.109.308.209.508.209.708.509.410.0010.009.809.909.4010.009.909.47.508.307.308.307.907.607.207.39.809.909.109.408.309.909.209.48.909.508.609.608.409.208.009.77.908.807.408.507.508.208.109.29.309.909.109.609.009.609.509.87.808.707.608.707.609.107.307.19.009.608.609.207.809.308.107.39.509.808.509.308.409.608.908.28.709.608.709.108.409.708.409.97.908.407.408.407.507.907.307.57.007.107.007.607.107.107.007.09.209.908.709.709.509.909.609.38.909.007.609.207.809.307.707.77.808.107.708.607.508.607.208.27.808.107.408.107.308.407.108.09.409.909.609.709.5010.009.8010.09.309.508.709.508.209.708.908.58.409.508.609.208.409.708.608.29.709.909.009.909.0010.009.708.49.009.708.709.408.109.809.109.69.309.508.809.708.509.909.508.37.909.308.109.207.709.008.407.38.809.608.809.508.309.808.507.67.407.507.107.807.307.307.007.49.709.908.609.809.3010.009.708.27.808.207.409.007.407.907.209.79.209.508.709.608.209.608.808.88.308.707.909.108.409.508.409.27.808.707.508.307.408.407.407.39.809.909.609.909.1010.009.809.97.908.407.508.507.908.207.507.19.009.408.709.608.409.708.707.98.409.208.009.007.909.208.408.29.709.609.209.508.909.509.4010.07.307.607.107.707.107.107.007.77.808.207.508.207.407.607.207.08.409.208.109.307.708.407.709.8
为什么不能使用K均值聚类呢?
因为K均值聚类只能对案例做聚类,这个是对变量做聚类
而且K均值聚类需要确定类别数,目前是不知道的
我们是对变量做聚类,冰柱图看起来太麻烦了,直接看树状图
聚类过程 ,系数代表距离,距离什么含义,要看我们使用了什么距离指标
树状图,233.297换算成下面的25
发现意大利和东方集团(中国、俄罗斯、罗马尼亚)聚类有一些问题
变量聚类一般默认距离为相关性(默认是平方欧氏距离)
得到的结果好很多
这个例子也可以使用因子分析解决
扩展:
一般聚类方法组间联接是最好的;ward法聚类出来会比较平均
度量标准 :案例--平方欧式距离最好
变量--皮尔逊相关性最好
关于标准化问题
K均值聚类需要自己手动
系统聚类如下
前面说的两种方法是经典的分析聚类方法,还有智能分析聚类方法
两步聚类算法(TwoStep Cluster)
特点:
处理对象:分类变量和连续变量
自动决定最佳分类数
快速处理大数据集
前提假设:
变量间彼此独立
分类变量服从多项分布,连续变量服从正态分布
其实稍微违反假设条件其实也不要紧,结果很稳健,其会自动剔除异常值
数据集还是(我的资源下载“spss之聚类分析--移动通讯客户细分”)
spss使用该模型自动对连续变量进行标化
设置其最大聚类数
聚类需要注意的地方
距离测量方法
使用默认值即可
变量选择
无关变量有时会引起严重的错分
应当只引入在不同类间有显著差别的变量
尽量只使用相同类型的变量进行分析(使用连续变量,将分类变量用于结果解释;新的聚类方法比如两步聚类算法可以同时使用这些变量)
共线性问题
对记录聚类结果有较大的影响,相当于某个变量在聚类中的权重大于其它变量
最好先进行预处理
变量的标准化
变量量纲/变异程度相差非常大时需要进行
数理统计算法上要求一律标准化
标准化后会削弱有用变量的作用
异常值
影响较大
还没有比较好的解决办法
尽力避免
分类数
从实用角度讲,2~8类比较合适
专业意义
一定要结合专业知识进行分析
其他方面
聚类分析主要应用于探索性的研究,其分析的结果可以提供多个可能的解,选择最终的解需要研究者的主观判断和后续的分析
聚类分析的解完全依赖于研究者所选择的聚类变量,增加或删除一些变量对最终的解都可能产生实质性的影响
不管实际数据中是否真正存在不同的类别,利用聚类分析都能得到分成若干类别的解