본문 바로가기
스터디/인공지능, 딥러닝, 머신러닝

[Ensemble] Random Forests in Python (scikit-learn)

by 궁금한 준이 2023. 5. 13.
728x90
반응형

Setup

bagging과 random forest를 실습하기 위해 필요한 라이브러리를 import하자.

# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# dataset
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

# Ensemble model and Decision-tree
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# metric
from sklearn.metrics import accuracy_score

사용할 데이터셋은 make_moon을 이용한 2개의 초승달 모양의 2차원 데이터이다.

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

 

moon data를 시각화하기 위한 함수이다.

def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5, contour=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
    if contour:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

 

Bagging

# define BaggingClassifier
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, random_state=42
)

# fitting the training data
bag_clf.fit(X_train, y_train)

# predict
y_pred = bag_clf.predict(X_test)

# accuracy
print(accuracy_score(y_test, y_pred)) # 0.904

# compare with Single DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_tree)) # 0.856
  • n_estimators=500: 500개의 DecisionTreeClassifier를 앙상블할 것이다.
  • max_samples=100: X(fit 할때 넘겨주는 X_train) 에서 추출할 sample 수
  • bootstrap=True: sampling 방법은 bootstrap sampling을 이용한다.
  • n_jobs=-1: 가능한 모든 CPU 자원을 사용한다.
  • random_state=42: 재현성을 위해 랜덤 시드를 42로 한다.
반응형
# figure size setting
plt.figure(figsize=(11,4))

# plot single decision tree classifier
plt.subplot(121)
plot_decision_boundary(tree_clf, X, y)
plt.title("Decision Tree", fontsize=14)

# plot ensemble model (500 decision tree classifier)
plt.subplot(122)
plot_decision_boundary(bag_clf, X, y)
plt.title("Decision Trees with Bagging", fontsize=14)

# plot all
plt.show()

A Single Decision Tree vs a Bagging Ensemble of 500 trees
A Single Decision Tree vs a Bagging Ensemble of 500 trees

Out-of-Bag Evaluation (OOB Score)

OOB score를 보고 싶다면 bagging classifier를 fitting할 때 해당 옵션을 설정해야한다.

# oob score
bag_clf_2 = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16, random_state=42),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1, random_state=42,
    oob_score=True
)
bag_clf_2.fit(X_train, y_train)

print(bag_clf_2.oob_score_) # 0.92

y_pred_2 = bag_clf_2.predict(X_test)
print(accuracy_score(y_test, y_pred_2)) # 0.92

oob evaluation의 결과가 0.92를 얻었다. 이는 test set에서 약 92%의 정확도를 갖는 성능을 갖는다.

실제 y_test와 y_pred_2의 accuracy를 비교하면 0.92이다.

(데이터셋이 간단해서 동일하게 나왔다. 실제로는 약간 차이가 있을 수 있다. 그러나 여전히 oob_score와 실제 testset의 결과는 비슷할 것이다.)

Random Forests

동일한 moon dataset에 대하여 random forest를 학습시켜보자.

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

# almost identical predictions
np.sum(y_pred == y_pred_rf) / len(y_pred) # 0.976
  • n_estimators=500: 500개의 decision tree를 학습하여 앙상블
  • max_leaf_nodes=16: 각 tree는 최대 16개의 leaf node를 갖도록 제한
  • n_jobs=-1 (생략)
  • random_state=42: (생략)

Feature Importance (Variable Importance)

iris dataset에서도 적용해보자.

사이킷런 random forest는 자체적으로 feature importance를 계산한다. (이들의 합은 $1$이다)

from sklearn.datasets import load_iris
iris = load_iris()

# train the Random-Forest-Classifier
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
rnd_clf.fit(iris["data"], iris["target"])

# print feature importances
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)
    
'''
sepal length (cm) 0.11249225099876375
sepal width (cm) 0.02311928828251033
petal length (cm) 0.4410304643639577
petal width (cm) 0.4233579963547682
'''

print(rnd_clf.feature_importances_)
# array([0.11249225, 0.02311929, 0.44103046, 0.423358  ])

 

2D data (pixel importance)

# load MNIST dataset
try:
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1)
    mnist.target = mnist.target.astype(np.int64)
except ImportError:
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')
    
# define and fit the RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
rnd_clf.fit(mnist["data"], mnist["target"])   

# function to display the feature importance of classifier
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.hot,
               interpolation="nearest")
    plt.axis("off")
    
# display the feature importances of digit    
plot_digit(rnd_clf.feature_importances_)

cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(), rnd_clf.feature_importances_.max()])
cbar.ax.set_yticklabels(['Not important', 'Very important'])

plt.show()

MNIST pixel importance of Random Forest Classifier
MNIST pixel importance of Random Forest Classifier

 

Random Forest는 어떤 feature가 중요한지 매우 쉽게 파악할 수 있다. (이를 이용하여 feature selection에도 활용할 수 있다)

728x90
반응형