2021-12-25 게시 됨2021-12-24 업데이트 됨python / machineLeaning6분안에 읽기 (약 851 단어)

DTS: ML_Grid search(Hyper Parameter)

§ 이전 posting

ML pipeLine 검증 곡선 그리기

ML 그리드 서치
- grid search를 이용한 파이프라인(pipeLine) 설계및
  하이퍼 파라미터 튜닝(hyper parameter)
- 그리드 서치와 랜덤 서치가 있다.
  - 랜덤 서치로 먼저 뽑아 낸 후 그리드 서치를 이용하여 안정적으로 서치 !
나도 공부 하기 싫으닌까 그냥
남 이 하는거 따라 쓰고 싶다.

남 : Kaggle competition

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV 
from sklearn.svm import SVC 

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 
               'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 
               'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 
               'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

df = pd.read_csv(data_url, names=column_name)

X = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values

le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, 
                                                    # stratify = y, 
                                                    random_state=1)
kfold = StratifiedKFold(n_splits = 10, random_state=1, shuffle=True)

pipe_tree = make_pipeline(StandardScaler(), 
                          PCA(n_components=2), 
                          DecisionTreeClassifier(random_state=1))

# 이 Line이 핵쉼 !!

# estimator.get_params().keys()
# pipe_tree.get_params().keys() ---> 이렇게 씀. 

print(pipe_tree.get_params().keys())
param_grid = [{"decisiontreeclassifier__max_depth": [1, 2, 3, 4, 5, 6, 7, None]}]

gs = GridSearchCV(estimator = pipe_tree, 
                  param_grid = param_grid, 
                  scoring="accuracy", 
                  cv = kfold)

gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

clf = gs.best_estimator_
# 자동으로 제일 좋은 것을 뽑아서 알려줌.
clf.fit(X_train, y_train) 
print("테스트 정확도:", clf.score(X_test, y_test))

dict_keys([‘memory’, ‘steps’, ‘verbose’, ‘standardscaler’, ‘pca’, ‘decisiontreeclassifier’, ‘standardscaler__copy’, ‘standardscaler__with_mean’, ‘standardscaler__with_std’, ‘pca__copy’, ‘pca__iterated_power’, ‘pca__n_components’, ‘pca__random_state’, ‘pca__svd_solver’, ‘pca__tol’, ‘pca__whiten’, ‘decisiontreeclassifier__ccp_alpha’, ‘decisiontreeclassifier__class_weight’, ‘decisiontreeclassifier__criterion’, ‘decisiontreeclassifier__max_depth’, ‘decisiontreeclassifier__max_features’, ‘decisiontreeclassifier__max_leaf_nodes’, ‘decisiontreeclassifier__min_impurity_decrease’, ‘decisiontreeclassifier__min_samples_leaf’, ‘decisiontreeclassifier__min_samples_split’, ‘decisiontreeclassifier__min_weight_fraction_leaf’, ‘decisiontreeclassifier__random_state’, ‘decisiontreeclassifier__splitter’])
0.927536231884058
{‘decisiontreeclassifier__max_depth’: 7}
테스트 정확도: 0.9210526315789473

svc를 이용한 hyperparameter tuenning

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV 
from sklearn.svm import SVC 

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 
               'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 
               'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 
               'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

df = pd.read_csv(data_url, names=column_name)

X = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values

le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, 
                                                    # stratify = y, 
                                                    random_state=1)

pipe_svc = make_pipeline(StandardScaler(), 
                        PCA(n_components=2), 
                        SVC(random_state=1))

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
param_grid = [{"svc__C": param_range, 
               "svc__gamma": param_range, 
               "svc__kernel": ["linear"]}]

gs = GridSearchCV(estimator = pipe_svc, 
                  param_grid = param_grid, 
                  scoring="accuracy", 
                  cv = 10)

gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

clf = gs.best_estimator_
clf.fit(X_train, y_train) 
print("테스트 정확도:", clf.score(X_test, y_test))

효효효

2021-12-24 게시 됨2021-12-24 업데이트 됨python / machineLeaning5분안에 읽기 (약 823 단어)

DTS: ML_Validation CurveG(01)

§ 이전 posting

☞ PipeLine

☞ Learning curve

검증 곡선 그려 보기

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
from lightgbm import LGBMClassifier

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 
               'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 
               'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 
               'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

df = pd.read_csv(data_url, names=column_name)

X = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values

le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, 
                                                    # stratify = y, 
                                                    random_state=1)
kfold = StratifiedKFold(n_splits = 10, random_state=1, shuffle=True)

pipe_lr = make_pipeline(StandardScaler(), 
                        PCA(n_components=2), 
                        LogisticRegression(solver = "liblinear", penalty = "l2", random_state=1))

param_range = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_scores, test_scores = validation_curve(estimator=pipe_lr, 
                                                        X = X_train, 
                                                        y = y_train, 
                                                        param_name = "logisticregression__C", 
                                                        param_range = param_range, 
                                                        cv = kfold)

train_mean = np.mean(train_scores, axis = 1)
train_std = np.std(train_scores, axis = 1)
test_mean = np.mean(test_scores, axis = 1)
test_std = np.std(test_scores, axis = 1)

fig, ax = plt.subplots(figsize = (16, 10))
ax.plot(param_range, train_mean, color = "blue", marker = "o", markersize=5, label = "training accuracy")
ax.fill_between(param_range, train_mean + train_std, train_mean - train_std, alpha = 0.15, color = "blue") # 추정 분산
ax.plot(param_range, test_mean, color = "green", marker = "s", linestyle = "--", markersize=5, label = "Validation accuracy")
ax.fill_between(param_range, test_mean + test_std, test_mean - test_std, alpha = 0.15, color = "green")
plt.grid()
plt.xscale("log")
plt.xlabel("Parameter C")
plt.ylabel("Accuracy")
plt.legend(loc = "lower right")
plt.ylim([0.8, 1.03])
plt.tight_layout()
plt.show()

ML_ValidationCurve

data 불러오기

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
from lightgbm import LGBMClassifier

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 
               'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 
               'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 
               'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

df = pd.read_csv(data_url, names=column_name)

X = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values

train, test 나누고 pipe line 설계


le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, 
                                                    # stratify = y, 
                                                    random_state=1)
kfold = StratifiedKFold(n_splits = 10, random_state=1, shuffle=True)

pipe_lr = make_pipeline(StandardScaler(), 
                        PCA(n_components=2), 
                        LogisticRegression(solver = "liblinear", penalty = "l2", random_state=1))

그리드 서치


param_range = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_scores, test_scores = validation_curve(estimator=pipe_lr, 
                                                        X = X_train, 
                                                        y = y_train, 
                                                        param_name = "logisticregression__C", 
                                                        param_range = param_range, 
                                                        cv = kfold)

train_mean = np.mean(train_scores, axis = 1)
train_std = np.std(train_scores, axis = 1)
test_mean = np.mean(test_scores, axis = 1)
test_std = np.std(test_scores, axis = 1)

fig, ax = plt.subplots(figsize = (8, 5))
ax.plot(param_range, train_mean, color = "blue", marker = "o", markersize=5, label = "training accuracy")
ax.fill_between(param_range, train_mean + train_std, train_mean - train_std, alpha = 0.15, color = "blue") # 추정 분산
ax.plot(param_range, test_mean, color = "green", marker = "s", linestyle = "--", markersize=5, label = "Validation accuracy")
ax.fill_between(param_range, test_mean + test_std, test_mean - test_std, alpha = 0.15, color = "green")
plt.grid()
plt.xscale("log")
plt.xlabel("Parameter C")
plt.ylabel("Accuracy")
plt.legend(loc = "lower right")
plt.ylim([0.8, 1.03])
plt.tight_layout()
plt.show()

ML_Gridsearch_g

2021-12-24 게시 됨2021-12-24 업데이트 됨python / machineLeaning4분안에 읽기 (약 532 단어)

DTS: ML_Learning CurveG(01)

§ 이전 posting

☞ PipeLine

§ 다음 posting

☞ PipeLine

Learning curve 그리기

pipeLine 이용하여 ML 돌림
이후 ML 을 확인 하기 위해 Learning, validation curve를 그려 확인
일반적으로 두 curve 를 함께 그린다.

data 불러오기, 훈련 세트 분리, 교차검증 정의

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 
               'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 
               'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 
               'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

df = pd.read_csv(data_url, names=column_name)
print(df.info())

X = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values

le = LabelEncoder()
y = le.fit_transform(y)
print("종속변수 클래스:", le.classes_)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, stratify = y, random_state=1)

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipe_lr = make_pipeline(StandardScaler(), 
                        PCA(n_components=2), 
                        LogisticRegression(solver="liblinear", random_state=1))

Learning curve 결과 값 구하기

from sklearn.model_selection import learning_curve
import numpy as np

train_sizes, train_scores, test_scores = learning_curve(
    estimator = pipe_lr, 
    X = X_train, 
    y = y_train, 
    train_sizes = np.linspace(0.1, 1.0, 10), 
    cv = 10
)

train_mean = np.mean(train_scores, axis = 1)
train_std = np.std(train_scores, axis = 1)
test_mean = np.mean(test_scores, axis = 1)
test_std = np.std(test_scores, axis = 1)

print("mean(test)-----------------\n", train_mean,"\n mean(train)-----------------\n",test_mean )

print("STD(test)-----------------\n", train_std,"\n STD(train)-----------------\n",test_std )

mean(test)—————–
[0.9525 0.96049383 0.93032787 0.92822086 0.93382353 0.93469388
0.94090909 0.94740061 0.94945652 0.95378973]
mean(train)—————–
[0.92763285 0.92763285 0.93415459 0.93415459 0.93855072 0.94516908
0.94956522 0.947343 0.94516908 0.94956522]
STD(test)—————–
[0.0075 0.00493827 0.00839914 0.01132895 0.00395209 0.00730145
0.00862865 0.0072109 0.00656687 0.00632397]
STD(train)—————–
[0.0350718 0.02911549 0.02165313 0.02743013 0.02529372 0.02426857
0.0238436 0.02421442 0.02789264 0.02919026]

Learning Curve Graph

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize = (8,5))
ax.plot(train_sizes, 
        train_mean, 
        color = "blue", 
        marker = "o", 
        markersize = 10, 
        label = "training acc.")
ax.fill_between(train_sizes, 
                train_mean + train_std, 
                train_mean - train_std, 
                alpha = 0.15, color = "darkblue")

ax.plot(train_sizes,
        test_mean, color = "green",
        marker = "s",
        linestyle = "--", # 점선으로 표시
        markersize = 10,
        label = "testing acc.")

ax.fill_between(train_sizes, 
                test_mean + test_std, 
                test_mean - test_std, 
                alpha = 0.15, color = "salmon")
plt.grid()
plt.xlabel("Number of training samples")
plt.ylabel("Accuracy")
plt.legend(loc = "lower right")
plt.ylim([0.8, 1.03])
plt.tight_layout()
plt.show

# sample 수가 많아지면, 점점 가까워 진다.

ML_Learning_Curve

분야 좋은데 인거 알겠고, 재미있는데 참 ㅎㅎ

DTS: ML_Grid search(Hyper Parameter)

ML pipeLine 검증 곡선 그리기

svc를 이용한 hyperparameter tuenning

DTS: ML_Validation CurveG(01)

검증 곡선 그려 보기

data 불러오기

train, test 나누고 pipe line 설계

그리드 서치

DTS: ML_Learning CurveG(01)

Learning curve 그리기

data 불러오기, 훈련 세트 분리, 교차검증 정의

Learning curve 결과 값 구하기

Learning Curve Graph

광고

링크

카테고리

최근 글

아카이브

태그

업데이트 소식 받기

follow.it