part05그룹화

책/파이썬 데이터과학통계학습(23.01.03-23.01.09)(정보문화사)

part05그룹화

정지홍 2023. 1. 9. 12:01

그룹화:주어진 데이터의 관측치들을 묶는 방법(군집 분석, 연관성 분석, 링크 분석)

그룹화의 특징: 데이터에 기반을 하여 그룹안에서는 서로 비슷하고 그룹간은 서로 다른 특징을 가지게한다

군집분석

-데이터의 관측치를 여러개로 묶어 그룹을 만들어 내는 방법

-관측치 혹은 변수의 유사성을 기반으로 묶어낸다.

-분할적인 방법과 계층적인 방법 2가지로 나뉜다

-분할적:군집의 수를 미리 결정한 후 군집화를 수행한다. 군집이 서로 겹치지 않아 계산량이 적어 대용량 데이터에 적합

-계층적:각각 관측치를 하나의 군집으로 간주하고 가까운 군집끼리 순차적으로 결합해 나간다.

-군집분석은 데이터변동에 민감하다

k-평균 군집분석

-분할적 군집 분석의 하나이다.

-미리 정한 k개의 군집으로 구분하는 방법이며 관측치는 하나의 군집에만 들어가야 한다.

-각 관측치들의 유사한 정도를 각 변수들간의 거리를 이영하여 결정한다.

-데이터가 뭉쳐있는 경우에는 잘 적용되지만 이상점이 있는 경우는 문제가 발생한다.

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn import datasets
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
iris=datasets.load_iris()
df_iris=pd.DataFrame(data=np.c_[iris.data , iris.target],
columns=iris.feature_names+['target'])
df_iris['target'] = df_iris['target'].apply(lambda x: 'setosa' if 0 == x else ('versicolor' if 1 == x else 'virginica'))
df_iris.head()

sns.pairplot(data=df_iris.loc[:, ['target', 'sepal length (cm)','sepal width (cm)']], hue="target", size=5,
plot_kws=dict(s=50, linewidth=1))

cluster = KMeans(n_clusters = 3, n_jobs = -1, random_state=0)
model = cluster.fit(df_iris[['sepal length (cm)','sepal width (cm)']])
centers = model.cluster_centers_
print(centers)

[[5.77358491 2.69245283]
 [5.006      3.428     ]
 [6.81276596 3.07446809]]

model.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 2,
       0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])

df_iris['km_cluster'] = model.labels_
df_iris['km_cluster'] = df_iris['km_cluster'].apply(lambda x: 'setosa' if 0 == x else ('versicolor' if 1 == x else 'virginica'))

df_iris.head()

pd.DataFrame(df_iris.groupby(['target','km_cluster'])[['sepal length (cm)','sepal width (cm)']].count())

sns.pairplot(data=df_iris.loc[:, ['km_cluster', 'sepal length (cm)','sepal width (cm)']], hue="km_cluster", size=5, plot_kws=dict(s=50, linewidth=1))

from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=200, n_features=2, centers=5, random_state=27)

df_k5 = pd.DataFrame(data=X, columns=['x1','x2'])
df_k5['target'] = y
df_k5.head()

x1x2target01234

-5.673888	9.134580	2
7.783390	-7.145758	3
7.532874	-4.235851	3
-1.497504	9.437434	2
-2.912560	6.562741	0

plt.scatter(df_k5['x1'], df_k5['x2'], marker='o', c=y, s=50,
edgecolor="k")
plt.show()

cluster = KMeans(n_clusters = 5, n_jobs = -1, max_iter=200, random_state=27)
pred = cluster.fit_predict(X)
df_k5['pred']= pred

pd.DataFrame(df_k5.groupby(['target','pred'])[['x1','x2']].count())

x1x2targetpred0041232431423

37	37
3	3
38	38
2	2
40	40
40	40
1	1
39	39

plt.scatter(
    X[pred == 0, 0], X[pred == 0, 1],
    s=50, c='green',
    marker='o', label='cluster 1'
)

plt.scatter(
    X[pred == 1, 0], X[pred == 1, 1],
    s=50, c='orange',
    marker='s', label='cluster 2'
)

plt.scatter(
    X[pred == 2, 0], X[pred == 2, 1],
    s=50, c='gold',
    marker='v', label='cluster 3'
)

plt.scatter(
    X[pred == 3, 0], X[pred == 3, 1],
    s=50, c='brown',
    marker='h', label='cluster 4'
)

plt.scatter(
    X[pred == 4, 0], X[pred == 4, 1],
    s=50, c='violet',
    marker='d', label='cluster 5'
)
plt.scatter(
    cluster.cluster_centers_[:, 0], cluster.cluster_centers_[:, 1],
    s=100, c='r',
    marker='*', label='centroids'
)

plt.legend()
plt.grid()
plt.show()

inertia = list()

for i in range(1,11):
    model = KMeans(n_clusters=i)
    model.fit(X)
    inertia.append(model.inertia_)

plt.plot(range(1,11), inertia, '-o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Distortion')
plt.show()

계측정 군집 분석

-한 군집내에 부분 군집을 허용하는 방법

-시각화로 군집의 수를 결정하는 것이 가능하다

-전체 데이터를 하나의 군집으로 보고 이를 부분 군집으로 나누고 또 그 군집을 부분군집으로 다시 계층적으로 나눈다--->이러한 특징으로 군집 수를 결정할 필요는 없다.

연결의 종류와 방법

단일 연결법
완전 연결법
평균 연결법
중심 연결법
와드 연결법