DecisionTreeMachineLearning(03)

machine Learning Model Algoridms

  • 비 선형 모델 : KNN,
  • 선형 모델 :




Decision Tree MachineLearning

ML_DecisionTree01

Introduction

  • 과적합 : 모델의 정확도만 높이기 위해 분류 조건(depth)만 강조하여 실제 상황에서 유연하게 대처하는 능력이 떨어지게 되는 문제가 발생하게 되는것.
  • 가지치기(pruning)을 통해 유연성을 유지.
    • Max_depth를 대략적으로 잡아서 (3, 5, 10…) RMS 값 비교
    • Random search
    • 하이퍼파라미터 (grid Search)



  • 분류기준 (수식은 아래서 책에서 확인)

    1. 정보이득 :
      • 자식노드의 불순도가 낮을 수록 정보의 이득이 커진다.(효율성 Up)
      • 정보 이득이 높은 속성을 기준으로 알아서 나누어 준다.
      1. 엔트로피의 정의 :
        • 엔트로피는 높을 수록 좋다.
      2. 지니불순도 :
        • 순도는 높을 수록 좋다.
      3. 분류오차 :
        • 어떤 시나리오가 더 좋은가에 대한 계산
        • 1이 되면 균등, 완벽하게 나누어 졌다고
  • ㅇㅇ

PythonMacnineLeanting_equ

공식은 이쪽에 가면 있다.


계산은 컴퓨터가 다 해준다.

우리는 보고 좋은 분류 기준을 선택 하며 됩니다.

분류기준 1. 분류 오차

PythonMacnineLeanting_E01

분류기준 2. 지니 불순도

PythonMacnineLeanting_E02

분류기준 2. 엔트로피

PythonMacnineLeanting_E03


  • 정보이득을 최대로 하는 옵션을 찾는다.





실습

1
2
3
4
5
6
7
8
from sklearn import datasets 
import numpy as np

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

print("클래스 레이블:", np.unique(y))

클래스 레이블: [0 1 2]

1
2
3
4
5
6
7
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.3, random_state = 1
)

print("y 레이블 갯수:", np.bincount(y))

y 레이블 갯수: [50 50 50]



시각화

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt

def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

# 마커와 컬러맵을 설정합니다.
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])

# 결정 경계를 그립니다.
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())

for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0],
y=X[y == cl, 1],
alpha=0.8,
c=colors[idx],
marker=markers[idx],
label=cl,
edgecolor='black')

# 테스트 샘플을 부각하여 그립니다.
if test_idx:
X_test, y_test = X[test_idx, :], y[test_idx]

plt.scatter(X_test[:, 0],
X_test[:, 1],
c='',
edgecolor='black',
alpha=1.0,
linewidth=1,
marker='o',
s=100,
label='test set')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import matplotlib.pyplot as plt
import numpy as np

# 지니 불순도 함수
def gini(p):
return p * (1 - p) + (1 - p) * (1 - (1 - p))


# 엔트로피 함수
def entropy(p):
return - p * np.log2(p) - (1 - p) * np.log2((1 - p))

# 분류 오차
def error(p):
return 1 - np.max([p, 1 - p])

x = np.arange(0.0, 1.0, 0.01)

ent = [entropy(p) if p != 0 else None for p in x]
sc_ent = [e * 0.5 if e else None for e in ent]
err = [error(i) for i in x]

fig = plt.figure()
ax = plt.subplot(111)
for i, lab, ls, c, in zip([ent, sc_ent, gini(x), err],
['Entropy', 'Entropy (scaled)',
'Gini Impurity', 'Misclassification Error'],
['-', '-', '--', '-.'],
['black', 'lightgray', 'red', 'green', 'cyan']):
line = ax.plot(x, i, label=lab, linestyle=ls, lw=2, color=c)

ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15),
ncol=5, fancybox=True, shadow=False)

ax.axhline(y=0.5, linewidth=1, color='k', linestyle='--')
ax.axhline(y=1.0, linewidth=1, color='k', linestyle='--')
plt.ylim([0, 1.1])
plt.xlabel('p(i=1)')
plt.ylabel('Impurity Index')
plt.show()

Impurity_Index


  • 정보 이득을 최대로 하는 옵션을 찾아서
1
2
3
4
from sklearn.tree import DecisionTreeClassifier

tree_gini = DecisionTreeClassifier(criterion="gini", max_depth=3)
tree_gini.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=3)

  • depth를 3으로 해 주었기 때문에 과적합 X
1
2
3
4
5
6
7
8
9
X_combined = np.vstack((X_train, X_test))
y_combined = np.hstack((y_train, y_test))

plot_decision_regions(X_combined, y_combined, classifier=tree_gini, test_idx = range(105, 150))
plt.xlabel("petal length")
plt.ylabel("petal width")
plt.legend(loc = "upper left")
plt.tight_layout()
plt.show()

tree_gini_layout

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from pydotplus import graph_from_dot_data
from sklearn.tree import export_graphviz

dot_data = export_graphviz(tree_gini,
filled=True,
rounded=True,
class_names=['Setosa',
'Versicolor',
'Virginica'],
feature_names=['petal length',
'petal width'],
out_file=None)
graph = graph_from_dot_data(dot_data)
graph.write_png('gini_tree.png')

True

gini_tree




  • gini 로 1개 Entripy 로 1개 짜서 해야함
  • gini: default
  • Entropy : 도 해보고 비교
1
2
3
4
5
6
7
8
9
10
11
12
tree_entropy = DecisionTreeClassifier(criterion="entropy", max_depth=3)
tree_entropy.fit(X_train, y_train)

X_combined = np.vstack((X_train, X_test))
y_combined = np.hstack((y_train, y_test))

plot_decision_regions(X_combined, y_combined, classifier=tree_entropy, test_idx = range(105, 150))
plt.xlabel("petal length")
plt.ylabel("petal width")
plt.legend(loc = "upper left")
plt.tight_layout()
plt.show()

tree_Entropy_layout

  • 모형을 도식화로
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from pydotplus import graph_from_dot_data
from sklearn.tree import export_graphviz

dot_data = export_graphviz(tree_entropy,
filled=True,
rounded=True,
class_names=['Setosa',
'Versicolor',
'Virginica'],
feature_names=['petal length',
'petal width'],
out_file=None)
graph = graph_from_dot_data(dot_data)
graph.write_png('Entropy_tree.png')

Entropy_tree

entropy가 0이 되면 더이상 나눌 필요가 없다.

  • sklearn에서는 분류오차는 없다.
  • 지니 와 엔트로피 두개를 보고 더 나은 것을 선택

Entropy_gini

머신러닝 배우기

<아직 안배운 부분>

  • 스태킹 알고리즘 (앙상블)

Auto Machine Learning by pycaret(01)

AutoMachineLearning by pycaret

pycaret

gitHub_pycaret



pycaret으로 autoML 하기

  • low-code machine learning library
  • PyCaret 2.0 ver.
    • 분석가가 가야 하는 최종 도착지
    • 머신러닝 + operation (운영) : 배포 ->
      • MLflow, Airflow, Kubeflow…

gitHub and pycaret




pycaret install

1
2
3
!pip install pycaret

# !pip install pycaret==2.0

Collecting pycaret
Downloading pycaret-2.3.5-py3-none-any.whl (288 kB)
|████████████████████████████████| 288 kB 5.4 MB/s
Collecting lightgbm>=2.3.1
Downloading lightgbm-3.3.1-py3-none-manylinux1_x86_64.whl (2.0 MB)
|████████████████████████████████| 2.0 MB 54.5 MB/s
Collecting pyod
Downloading pyod-0.9.5.tar.gz (113 kB)
|████████████████████████████████| 113 kB 67.4 MB/s
Requirement already satisfied: textblob in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.15.3)
Requirement already satisfied: yellowbrick>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.3.post1)
Collecting Boruta
Downloading Boruta-0.3-py3-none-any.whl (56 kB)
|████████████████████████████████| 56 kB 4.6 MB/s
Collecting pyLDAvis
Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
|████████████████████████████████| 1.7 MB 63.6 MB/s
Installing build dependencies … done
Getting requirements to build wheel … done
Installing backend dependencies … done
Preparing wheel metadata … done
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.11.2)
Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.2.5)
Collecting imbalanced-learn==0.7.0
Downloading imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
|████████████████████████████████| 167 kB 65.5 MB/s
Requirement already satisfied: numpy==1.19.5 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.19.5)
Collecting kmodes>=0.10.1
Downloading kmodes-0.11.1-py2.py3-none-any.whl (19 kB)
Requirement already satisfied: spacy<2.4.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (2.2.4)
Collecting umap-learn
Downloading umap-learn-0.5.2.tar.gz (86 kB)
|████████████████████████████████| 86 kB 4.8 MB/s
Requirement already satisfied: IPython in /usr/local/lib/python3.7/dist-packages (from pycaret) (5.5.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.2.2)
Requirement already satisfied: wordcloud in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.5.0)
Requirement already satisfied: cufflinks>=0.17.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.17.3)
Collecting mlflow
Downloading mlflow-1.22.0-py3-none-any.whl (15.5 MB)
|████████████████████████████████| 15.5 MB 68.5 MB/s
Collecting mlxtend>=0.17.0
Downloading mlxtend-0.19.0-py2.py3-none-any.whl (1.3 MB)
|████████████████████████████████| 1.3 MB 66.9 MB/s
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.1.5)
Collecting scikit-plot
Downloading scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.1.0)
Requirement already satisfied: plotly>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from pycaret) (4.4.1)
Collecting scikit-learn==0.23.2
Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
|████████████████████████████████| 6.8 MB 37.1 MB/s
Requirement already satisfied: gensim<4.0.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.6.0)
Collecting pandas-profiling>=2.8.0
Downloading pandas_profiling-3.1.0-py2.py3-none-any.whl (261 kB)
|████████████████████████████████| 261 kB 60.5 MB/s
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.7/dist-packages (from pycaret) (7.6.5)
Requirement already satisfied: scipy<=1.5.4 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.4.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==0.23.2->pycaret) (3.0.0)
Requirement already satisfied: setuptools>=34.4.1 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (57.4.0)
Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (0.3.0)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (1.15.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim<4.0.0->pycaret) (5.2.1)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (0.7.5)
Requirement already satisfied: pexpect in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (4.8.0)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (5.1.1)
Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (2.6.1)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (0.8.1)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (1.0.18)
Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (4.4.2)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (5.1.3)
Requirement already satisfied: widgetsnbextension=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (3.5.2)
Requirement already satisfied: ipython-genutils
=0.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (0.2.0)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (4.10.1)
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (1.0.2)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.3.5)
Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.1.1)
Requirement already satisfied: wheel in /usr/local/lib/python3.7/dist-packages (from lightgbm>=2.3.1->pycaret) (0.37.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (3.0.6)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (2.8.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (0.11.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (2.6.0)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (4.9.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->pycaret) (2018.9)
Collecting visions[type_image_path]==0.7.4
Downloading visions-0.7.4-py3-none-any.whl (102 kB)
|████████████████████████████████| 102 kB 12.4 MB/s
Collecting pydantic>=1.8.1
Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
|████████████████████████████████| 10.1 MB 24.8 MB/s
Collecting tangled-up-in-unicode==0.1.0
Downloading tangled_up_in_unicode-0.1.0-py3-none-any.whl (3.1 MB)
|████████████████████████████████| 3.1 MB 22.1 MB/s
Collecting joblib
Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
|████████████████████████████████| 303 kB 60.4 MB/s
Collecting requests>=2.24.0
Downloading requests-2.26.0-py2.py3-none-any.whl (62 kB)
|████████████████████████████████| 62 kB 805 kB/s
Requirement already satisfied: tqdm>=4.48.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (4.62.3)
Collecting PyYAML>=5.0.0
Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
|████████████████████████████████| 596 kB 42.5 MB/s
Requirement already satisfied: missingno>=0.4.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (0.5.0)
Requirement already satisfied: markupsafe=2.0.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (2.0.1)
Collecting htmlmin>=0.1.12
Downloading htmlmin-0.1.12.tar.gz (19 kB)
Collecting multimethod>=1.4
Downloading multimethod-1.6-py3-none-any.whl (9.4 kB)
Collecting phik>=0.11.1
Downloading phik-0.12.0-cp37-cp37m-manylinux2010_x86_64.whl (675 kB)
|████████████████████████████████| 675 kB 41.5 MB/s
Requirement already satisfied: jinja2>=2.11.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (2.11.3)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (2.6.3)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (21.2.0)
Requirement already satisfied: Pillow in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (7.1.2)
Collecting imagehash
Downloading ImageHash-4.2.1.tar.gz (812 kB)
|████████████████████████████████| 812 kB 37.7 MB/s
Collecting scipy<=1.5.4
Downloading scipy-1.5.4-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)
|████████████████████████████████| 25.9 MB 74.1 MB/s
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly>=4.4.1->pycaret) (1.3.3)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython->pycaret) (0.2.5)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from pydantic>=1.8.1->pandas-profiling>=2.8.0->pycaret) (3.10.0.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (2021.10.8)
Requirement already satisfied: charset-normalizer
=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (2.0.8)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (2.10)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (3.0.6)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.5)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (0.8.2)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (7.4.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.6)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.1.3)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (2.0.6)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (0.4.1)
Requirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<2.4.0->pycaret) (4.8.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<2.4.0->pycaret) (3.6.0)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from widgetsnbextension=3.5.0->ipywidgets->pycaret) (5.3.1)
Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension
=3.5.0->ipywidgets->pycaret) (0.12.1)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension=3.5.0->ipywidgets->pycaret) (5.6.1)
Requirement already satisfied: Send2Trash in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension
=3.5.0->ipywidgets->pycaret) (1.8.0)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.7/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets->pycaret) (22.3.0)
Requirement already satisfied: ptyprocess in /usr/local/lib/python3.7/dist-packages (from terminado>=0.8.1->notebook>=4.4.1->widgetsnbextension=3.5.0->ipywidgets->pycaret) (0.7.0)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.7/dist-packages (from imagehash->visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (1.2.0)
Collecting querystring-parser
Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)
Requirement already satisfied: entrypoints in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (0.3)
Collecting alembic<=1.4.1
Downloading alembic-1.4.1.tar.gz (1.1 MB)
|████████████████████████████████| 1.1 MB 66.5 MB/s
Collecting gitpython>=2.1.0
Downloading GitPython-3.1.24-py3-none-any.whl (180 kB)
|████████████████████████████████| 180 kB 40.6 MB/s
Requirement already satisfied: protobuf>=3.7.0 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (3.17.3)
Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.4.27)
Requirement already satisfied: Flask in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.1.4)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.3.0)
Collecting databricks-cli>=0.8.7
Downloading databricks-cli-0.16.2.tar.gz (58 kB)
|████████████████████████████████| 58 kB 5.6 MB/s
Collecting prometheus-flask-exporter
Downloading prometheus_flask_exporter-0.18.6-py3-none-any.whl (17 kB)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (21.3)
Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (7.1.2)
Requirement already satisfied: sqlparse>=0.3.1 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (0.4.2)
Collecting gunicorn
Downloading gunicorn-20.1.0-py3-none-any.whl (79 kB)
|████████████████████████████████| 79 kB 7.6 MB/s
Collecting docker>=4.0.0
Downloading docker-5.0.3-py2.py3-none-any.whl (146 kB)
|████████████████████████████████| 146 kB 58.9 MB/s
Collecting Mako
Downloading Mako-1.1.6-py2.py3-none-any.whl (75 kB)
|████████████████████████████████| 75 kB 4.2 MB/s
Collecting python-editor>=0.3
Downloading python_editor-1.0.4-py3-none-any.whl (4.9 kB)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.7/dist-packages (from databricks-cli>=0.8.7->mlflow->pycaret) (0.8.9)
Collecting websocket-client>=0.32.0
Downloading websocket_client-1.2.3-py3-none-any.whl (53 kB)
|████████████████████████████████| 53 kB 1.2 MB/s
Collecting gitdb<5,>=4.0.1
Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
|████████████████████████████████| 63 kB 1.6 MB/s
Collecting smmap<6,>=3.0.1
Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy->mlflow->pycaret) (1.1.2)
Requirement already satisfied: itsdangerous<2.0,>=0.24 in /usr/local/lib/python3.7/dist-packages (from Flask->mlflow->pycaret) (1.1.0)
Requirement already satisfied: Werkzeug<2.0,>=0.15 in /usr/local/lib/python3.7/dist-packages (from Flask->mlflow->pycaret) (1.0.1)
Requirement already satisfied: bleach in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension
=3.5.0->ipywidgets->pycaret) (4.1.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension=3.5.0->ipywidgets->pycaret) (1.5.0)
Requirement already satisfied: testpath in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension
=3.5.0->ipywidgets->pycaret) (0.5.0)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension=3.5.0->ipywidgets->pycaret) (0.8.4)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension
=3.5.0->ipywidgets->pycaret) (0.7.1)
Requirement already satisfied: webencodings in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension=3.5.0->ipywidgets->pycaret) (0.5.1)
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.7/dist-packages (from prometheus-flask-exporter->mlflow->pycaret) (0.12.0)
Collecting pyLDAvis
Downloading pyLDAvis-3.3.0.tar.gz (1.7 MB)
|████████████████████████████████| 1.7 MB 44.1 MB/s
Installing build dependencies … done
Getting requirements to build wheel … done
Installing backend dependencies … done
Preparing wheel metadata … done
Downloading pyLDAvis-3.2.2.tar.gz (1.7 MB)
|████████████████████████████████| 1.7 MB 30.5 MB/s
Requirement already satisfied: numexpr in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (2.7.3)
Requirement already satisfied: future in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (0.16.0)
Collecting funcy
Downloading funcy-1.16-py2.py3-none-any.whl (32 kB)
Requirement already satisfied: numba>=0.35 in /usr/local/lib/python3.7/dist-packages (from pyod->pycaret) (0.51.2)
Requirement already satisfied: statsmodels in /usr/local/lib/python3.7/dist-packages (from pyod->pycaret) (0.10.2)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.35->pyod->pycaret) (0.34.0)
Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from statsmodels->pyod->pycaret) (0.5.2)
Collecting pynndescent>=0.5
Downloading pynndescent-0.5.5.tar.gz (1.1 MB)
|████████████████████████████████| 1.1 MB 55.1 MB/s
Building wheels for collected packages: htmlmin, imagehash, alembic, databricks-cli, pyLDAvis, pyod, umap-learn, pynndescent
Building wheel for htmlmin (setup.py) … done
Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27098 sha256=6dff1694390dae41ea8bd3ca00f5564142023ea037fa606be0a8ffba9c16d1da
Stored in directory: /root/.cache/pip/wheels/70/e1/52/5b14d250ba868768823940c3229e9950d201a26d0bd3ee8655
Building wheel for imagehash (setup.py) … done
Created wheel for imagehash: filename=ImageHash-4.2.1-py2.py3-none-any.whl size=295207 sha256=9e38b104e77871b6f6a6a9267c3debd3ac85d39441acb3cda64d4dc07a11dd27
Stored in directory: /root/.cache/pip/wheels/4c/d5/59/5e3e297533ddb09407769762985d134135064c6831e29a914e
Building wheel for alembic (setup.py) … done
Created wheel for alembic: filename=alembic-1.4.1-py2.py3-none-any.whl size=158172 sha256=652d8b88b2468cf1d1c9f1c3242dda689e0f670bc6e5b88dc4dbf087fecbaccc
Stored in directory: /root/.cache/pip/wheels/be/5d/0a/9e13f53f4f5dfb67cd8d245bb7cdffe12f135846f491a283e3
Building wheel for databricks-cli (setup.py) … done
Created wheel for databricks-cli: filename=databricks_cli-0.16.2-py3-none-any.whl size=106811 sha256=9dbaaca3ece5f6a1522d676d8b1ec35c065a0bd0c564bd862ae3012984b70c9a
Stored in directory: /root/.cache/pip/wheels/f4/5c/ed/e1ce20a53095f63b27b4964abbad03e59cf3472822addf7d29
Building wheel for pyLDAvis (setup.py) … done
Created wheel for pyLDAvis: filename=pyLDAvis-3.2.2-py2.py3-none-any.whl size=135618 sha256=471d50a9a2e725465ffc7d32b21edb15c9b84dc0573891185a43e97f567aa0a7
Stored in directory: /root/.cache/pip/wheels/f8/b1/9b/560ac1931796b7303f7b517b949d2d31a4fbc512aad3b9f284
Building wheel for pyod (setup.py) … done
Created wheel for pyod: filename=pyod-0.9.5-py3-none-any.whl size=132699 sha256=d116f5b46155bf0fa31aa88cc21da0e3be461b448e9c9b2d599c763a5ef0a6a1
Stored in directory: /root/.cache/pip/wheels/3d/bb/b7/62b60fb451b33b0df1ab8006697fba7a6a49709a629055cf77
Building wheel for umap-learn (setup.py) … done
Created wheel for umap-learn: filename=umap_learn-0.5.2-py3-none-any.whl size=82709 sha256=7c48e34d2c19d333a623ed12491d3c7d07bafd52f2d35e474df56908f5cc7525
Stored in directory: /root/.cache/pip/wheels/84/1b/c6/aaf68a748122632967cef4dffef68224eb16798b6793257d82
Building wheel for pynndescent (setup.py) … done
Created wheel for pynndescent: filename=pynndescent-0.5.5-py3-none-any.whl size=52603 sha256=7abff97eebc36deea7220f1b5e9907020826a07404003a9c7d794fef4d396e87
Stored in directory: /root/.cache/pip/wheels/af/e9/33/04db1436df0757c42fda8ea6796d7a8586e23c85fac355f476
Successfully built htmlmin imagehash alembic databricks-cli pyLDAvis pyod umap-learn pynndescent
Installing collected packages: tangled-up-in-unicode, smmap, scipy, multimethod, joblib, websocket-client, visions, scikit-learn, requests, python-editor, Mako, imagehash, gitdb, querystring-parser, PyYAML, pynndescent, pydantic, prometheus-flask-exporter, phik, htmlmin, gunicorn, gitpython, funcy, docker, databricks-cli, alembic, umap-learn, scikit-plot, pyod, pyLDAvis, pandas-profiling, mlxtend, mlflow, lightgbm, kmodes, imbalanced-learn, Boruta, pycaret
Attempting uninstall: scipy
Found existing installation: scipy 1.4.1
Uninstalling scipy-1.4.1:
Successfully uninstalled scipy-1.4.1
Attempting uninstall: joblib
Found existing installation: joblib 1.1.0
Uninstalling joblib-1.1.0:
Successfully uninstalled joblib-1.1.0
Attempting uninstall: scikit-learn
Found existing installation: scikit-learn 1.0.1
Uninstalling scikit-learn-1.0.1:
Successfully uninstalled scikit-learn-1.0.1
Attempting uninstall: requests
Found existing installation: requests 2.23.0
Uninstalling requests-2.23.0:
Successfully uninstalled requests-2.23.0
Attempting uninstall: PyYAML
Found existing installation: PyYAML 3.13
Uninstalling PyYAML-3.13:
Successfully uninstalled PyYAML-3.13
Attempting uninstall: pandas-profiling
Found existing installation: pandas-profiling 1.4.1
Uninstalling pandas-profiling-1.4.1:
Successfully uninstalled pandas-profiling-1.4.1
Attempting uninstall: mlxtend
Found existing installation: mlxtend 0.14.0
Uninstalling mlxtend-0.14.0:
Successfully uninstalled mlxtend-0.14.0
Attempting uninstall: lightgbm
Found existing installation: lightgbm 2.2.3
Uninstalling lightgbm-2.2.3:
Successfully uninstalled lightgbm-2.2.3
Attempting uninstall: imbalanced-learn
Found existing installation: imbalanced-learn 0.8.1
Uninstalling imbalanced-learn-0.8.1:
Successfully uninstalled imbalanced-learn-0.8.1
ERROR: pip’s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests
=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed Boruta-0.3 Mako-1.1.6 PyYAML-6.0 alembic-1.4.1 databricks-cli-0.16.2 docker-5.0.3 funcy-1.16 gitdb-4.0.9 gitpython-3.1.24 gunicorn-20.1.0 htmlmin-0.1.12 imagehash-4.2.1 imbalanced-learn-0.7.0 joblib-1.0.1 kmodes-0.11.1 lightgbm-3.3.1 mlflow-1.22.0 mlxtend-0.19.0 multimethod-1.6 pandas-profiling-3.1.0 phik-0.12.0 prometheus-flask-exporter-0.18.6 pyLDAvis-3.2.2 pycaret-2.3.5 pydantic-1.8.2 pynndescent-0.5.5 pyod-0.9.5 python-editor-1.0.4 querystring-parser-1.2.4 requests-2.26.0 scikit-learn-0.23.2 scikit-plot-0.3.7 scipy-1.5.4 smmap-5.0.0 tangled-up-in-unicode-0.1.0 umap-learn-0.5.2 visions-0.7.4 websocket-client-1.2.3

pycaret을 그냥 설치 할 수도 있고,
version을 정해서 설치 할 수도 있다.

  • 먼저 개괄적으로 확인만 하고, github 에서 autoML을 pycaret 2.0 ver로 진행 해야 한다.




google colab에 설치 한 경우

Install 후 런타임>런타임다시시작(CTRL+M) 을 꼭 한번 해 준 후

아래 설명에 따라 가는 것이 좋음.

** 만약 오류가 난다면, 런타임 초기화 후 import, 런타임다시시작 후 진행 하는 것을 추천

  • 왜 그런지 모름 _ 그냥 이렇게 하면 된다는 것만 알려주겠음




Data Load

1
2
from pycaret.datasets import get_data
data = get_data("diamond")

pycaret_diamond_data




pycaret.regression

1
2
3
from pycaret.regression import *
reg_set = setup(data, target = 'Price', transform_target = True,
log_experiment = True, experiment_name = 'diamond')
  • pycaret.regression : 종류
\ Description Value
0 session_id 2882
1 Target Price
2 Original Data (6000, 8)
3 Missing Values False
4 Numeric Features 1
5 Categorical Features 6
6 Ordinal Features False
7 High Cardinality Features False
8 High Cardinality Method None
9 Transformed Train Set (4199, 28)
10 Transformed Test Set (1801, 28)
11 Shuffle Train-Test True
12 Stratify Train-Test False
13 Fold Generator KFold
14 Fold Number 10
15 CPU Jobs -1
16 Use GPU False
17 Log Experiment True
18 Experiment Name diamond
19 USI 116c
20 Imputation Type simple
21 Iterative Imputation Iteration None
22 Numeric Imputer mean
23 Iterative Imputation Numeric Model None
24 Categorical Imputer constant
25 Iterative Imputation Categorical Model None
26 Unknown Categoricals Handling least_frequent
27 Normalize False
28 Normalize Method None
29 Transformation False
30 Transformation Method None
31 PCA False
32 PCA Method None
33 PCA Components None
34 Ignore Low Variance False
35 Combine Rare Levels False
36 Rare Level Threshold None
37 Numeric Binning False
38 Remove Outliers False
39 Outliers Threshold None
40 Remove Multicollinearity False
41 Multicollinearity Threshold None
42 Remove Perfect Collinearity True
43 Clustering False
44 Clustering Iteration None
45 Polynomial Features False
46 Polynomial Degree None
47 Trignometry Features False
48 Polynomial Threshold None
49 Group Features False
50 Feature Selection False
51 Feature Selection Method classic
52 Features Selection Threshold None
53 Feature Interaction False
54 Feature Ratio False
55 Interaction Threshold None
56 Transform Target True
57 Transform Target Method box-cox
  • 더 많이 알고 싶으면 저거 다 공부 해 ^0^




모델 만들기

  • 최적의 모델을 만들기 위해 한줄의 코드면 된다 ㅠㅠ
1
best  = compare_models()
Model MAE MSE RMSE R2 RMSLE MAPE TT (Sec)
lightgbm Light Gradient Boosting Machine 637.8811 1.928277e+06 1367.4159 0.9813 0.0677 0.0491 0.120
et Extra Trees Regressor 748.9529 2.253684e+06 1478.3926 0.9782 0.0802 0.0594 1.199
rf Random Forest Regressor 742.9041 2.417200e+06 1528.6437 0.9765 0.0785 0.0579 1.090
gbr Gradient Boosting Regressor 764.6458 2.449865e+06 1544.3382 0.9762 0.0783 0.0583 0.288
dt Decision Tree Regressor 946.3401 3.350058e+06 1811.0705 0.9672 0.1034 0.0756 0.040
ada AdaBoost Regressor 1997.1826 1.710448e+07 4091.7565 0.8350 0.1895 0.1511 0.251
knn K Neighbors Regressor 3072.0318 3.642699e+07 6017.2046 0.6421 0.3636 0.2323 0.086
omp Orthogonal Matching Pursuit 3317.3424 8.643676e+07 9045.7885 0.1344 0.2823 0.2209 0.026
llar Lasso Least Angle Regression 6540.9142 1.144871e+08 10682.7674 -0.1241 0.7130 0.5636 0.281
lasso Lasso Regression 6540.9147 1.144871e+08 10682.7665 -0.1241 0.7130 0.5636 0.025
en Elastic Net 6540.9147 1.144871e+08 10682.7665 -0.1241 0.7130 0.5636 0.025
dummy Dummy Regressor 6540.9142 1.144871e+08 10682.7674 -0.1241 0.7130 0.5636 0.021
ridge Ridge Regression 3376.7759 4.409370e+08 17429.1601 -3.0382 0.2235 0.1734 0.026
br Bayesian Ridge 3464.5342 6.180348e+08 19047.2745 -4.5803 0.2244 0.1745 0.028
huber Huber Regressor 3490.0167 7.900161e+08 19860.5244 -6.0721 0.2254 0.1729 0.118
lr Linear Regression 3566.8112 8.908481e+08 21034.8582 -6.9766 0.2253 0.1755 0.309
par Passive Aggressive Regressor 8585.4060 5.154119e+10 94736.3961 -439.8984 0.2947 0.2745 0.031




모형 평가

  • 최적의 모델 확인 후 평가 역시 코드 한줄 ㅠㅠ 감동
1
plot_model(best)

plot_model



1
plot_model(best, plot = "feature")

pycarat_plot_model_feature




모형 저장, 모형 배포

  • MLOps 개념, RestAPI, Flask
1
2
3
4
finalize_best = finalize_model(best)

#save model
save_model(finalize_best, "diamond_pipeline")

Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=None,
steps=[(‘dtypes’,
DataTypes_Auto_infer(categorical_features=[],
display_types=True, features_todrop=[],
id_columns=[], ml_usecase=’regression’,
numerical_features=[], target=’Price’,
time_features=[])),
(‘imputer’,
Simple_Imputer(categorical_strategy=’not_available’,
fill_value_categorical=None,
fill_value_numerical=None,
numeric_strategy=’…
learning_rate=0.1,
max_depth=-1,
min_child_samples=20,
min_child_weight=0.001,
min_split_gain=0.0,
n_estimators=100,
n_jobs=-1,
num_leaves=31,
objective=None,
random_state=2882,
reg_alpha=0.0,
reg_lambda=0.0,
silent=’warn’,
subsample=1.0,
subsample_for_bin=200000,
subsample_freq=0),
silent=’warn’, subsample=1.0,
subsample_for_bin=200000,
subsample_freq=0)]],
verbose=False), ‘diamond_pipeline.pkl’)




MLOps

  • devOPs (개발과 운영 팀이 별도로 있었음.)
  • 자동화 되면서 같이 됨.
    MLOps dash board

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
!pip install mlflow --quiet
!pip install pyngrok --quiet

import mlflow

with mlflow.start_run(run_name="MLflow on Colab"):
mlflow.log_metric("m1", 2.0)
mlflow.log_param("p1", "mlflow-colab")

# run tracking UI in the background
get_ipython().system_raw("mlflow ui --port 5000 &") # run tracking UI in the background


# create remote tunnel using ngrok.com to allow local port access
# borrowed from https://colab.research.google.com/github/alfozan/MLflow-GBRT-demo/blob/master/MLflow-GBRT-demo.ipynb#scrollTo=4h3bKHMYUIG6

from pyngrok import ngrok

# Terminate open tunnels if exist
ngrok.kill()

# Setting the authtoken (optional)
# Get your authtoken from https://dashboard.ngrok.com/auth
NGROK_AUTH_TOKEN = ""
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# Open an HTTPs tunnel on port 5000 for http://localhost:5000
ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLflow Tracking UI:", ngrok_tunnel.public_url)
|████████████████████████████████| 745 kB 5.4 MB/s
Building wheel for pyngrok (setup.py) ... done
---------------------------------------
Exception Traceback (most recent call last)
in ()
4 import mlflow
5
----> 6 with mlflow.start_run(run_name="MLflow on Colab"):
7 mlflow.log_metric("m1", 2.0)
8 mlflow.log_param("p1", "mlflow-colab")

/usr/local/lib/python3.7/dist-packages/mlflow/tracking/fluent.py in start_run(run_id, experiment_id, run_name, nested, tags)

229 + “current run with mlflow.end_run(). To start a nested “

230 + “run, call start_run with nested=True”

–> 231 ).format(_active_run_stack[0].info.run_id)

232 )

233 client = MlflowClient()



Exception: Run with UUID 3cbca838cdd44eac8620700ac1929a64 is already active.

To start a new run, first end the current run with mlflow.end_run().

To start a nested run, call start_run with nested=True


배포 하는 것이 마지막 단계인데, 구글 코랩에서 안먹는 다는 것이 함정이라고 한다.

언젠간 내가 스스로 할 수 있는 날이 오지 않을까 한다.

sklearn_mL_04_ModuleSelection(2.4)

Chepter 2 _사이킷런으로 시작하는 머신러닝(04)

파이썬 머신러닝 완벽 가이드
ref. & copyright(c) Book



Model Selection 모듈 소개

  • 사이킷런 (scikit-Learn) : 파이썬 머신러닝 라이브러리 중 가장 많이 사용되는 라이브러리
1
2
3
4
5

import sklearn

print(cklearn.__version__)

House_price prediction Practice 01

Kaggle 주택가격 예측

Kaggle house oruces advabced regression




1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

데이터 다운로드 및 불러오기

1
2
3
4
5
6
7
import pandas as pd

train = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
test = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")

train.shape, test.shape
#변수를 줄여야 겠다. 어떤 변수를 줄여야 할까 ?

EDA

  • 이상치과 중복값 제거
  • overallQual (주택의 상태를 1~10등급으로 책정)
  • 평점 1 : 판매가가 높음 = 이상치라고 판단 할 수 있따. : 이걸 제거 해 줘야 함.
1
2
train.info()
#80개 컬럼, SalePrice(독립변수)를 제거 하고는 나머지가 종속변수 = 너무 많다 !!
1
2
3
train.drop(train[(train['OverallQual'] < 5) & (train['SalePrice']> 200000)].index, inplace = True)
train.reset_index(drop = True, inplace = True)
train.shape

종속변수 시각화

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm

(mu, sigma) = norm.fit(train['SalePrice'])
print("The value of mu before log transformation is:", mu)
print("The value of sigma before log transformation is:", sigma)

fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(train['SalePrice'], color="b", stat="probability")
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="SalePrice")
ax.set(title="SalePrice distribution")

plt.axvline(mu, color='r', linestyle='--')
plt.text(mu + 10000, 0.11, 'Mean of SalePrice', rotation=0, color='r')
fig.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np 

train["SalePrice"] = np.log1p(train["SalePrice"]) # 로그 변환 후 종속 변수 시각화

(mu, sigma) = norm.fit(train['SalePrice'])
print("The value of mu before log transformation is:", mu)
print("The value of sigma before log transformation is:", sigma)

fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(train['SalePrice'], color="b", stat="probability")
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="SalePrice")
ax.set(title="SalePrice distribution")

plt.axvline(mu, color='r', linestyle='--')
plt.text(mu + 0.05, 0.111, 'Mean of SalePrice', rotation=0, color='r')
plt.ylim(0, 0.12)
fig.show()

data feature 제거

  1. 모형 학습 시간 감소
  2. 연산 시 noise 감소

그래서 : train ID 를 빼주기로함

1
2
3
4
5
train_ID = train['Id']
test_ID = test['Id']
train.drop(['Id'], axis=1, inplace=True)
test.drop(['Id'], axis=1, inplace=True)
train.shape, test.shape
1
2
3
4
5
6
7
8
# y 값 추출, dataset 분리할때 사용
y = train['SalePrice'].reset_index(drop=True)

# 뽑고 나면 원래 df에서 제거
train = train.drop('SalePrice', axis = 1)
train.shape, test.shape, y.shape


1
2
3
4
5
# data 합치기 
# - train data 와 Test를 같이 전처리 하기 위해

all_df = pd.concat([train, test]).reset_index(drop=True)
all_df.shape

결측치 확인

  • 결측치 처리
    1. 제거하기 : column 제거, 특정 행만 제거하기
    2. 채우기 : 1) numeric(수치형) : 평균 또는 중간값으로 채우기
          2) String(문자형) : 최빈값으로 채우기 
      
    3. 통계 기법이용, 채우기 (data 보간)
      • 실무에서는 (KNNImput)등, 시계열 자료 or 산업군에 따라 다르므로 가서 배워라.
1
2
3
4
5
6
7
8
9
10
#결측치 확인 

def check_na(data, head_num = 6):
isnull_na = (data.isnull().sum() / len(data)) * 100
data_na = isnull_na.drop(isnull_na[isnull_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :data_na,
'Data Type': data.dtypes[data_na.index]})
print("결측치 데이터 컬럼과 건수:\n", missing_data.head(head_num))

check_na(all_df, 20)
1
2
3
4
all_df.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'LotFrontage'], axis=1, inplace=True)
check_na(all_df)

#아직도 결측치가 많이 있다.

채우기

1. 문자열 채우기 
2. 
  1. object column 추출
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#a = all_df['BsmtCond'].value_counts().mode() #mode() : 최빈값 가장 빈도수가 높은 값 찾기
#a

print(all_df['BsmtCond'].value_counts())
print()
print(all_df['BsmtCond'].mode()[0])

#object column, 갯수 확인
import numpy as np
cat_all_vars = train.select_dtypes(exclude=[np.number]) #숫자인 것을 제외한 type column 이름 추출
print("The whole number of all_vars(문자형data)", len(list(cat_all_vars)))

#column 이름 뽑아내기
final_cat_vars = []
for v in cat_all_vars:
if v not in ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'LotFrontage']:
final_cat_vars.append(v)

print("The whole number of final_cat_vars", len(final_cat_vars))

#최빈값을 찾아 넣어주기
for i in final_cat_vars:
all_df[i] = all_df[i].fillna(all_df[i].mode()[0])

check_na(all_df, 20)
print("숫자형 data set의 결측치만 남은 것을 알 수 있다. ")
1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
num_all_vars = list(train.select_dtypes(include=[np.number]))
print("The whole number of all_vars", len(num_all_vars))

num_all_vars.remove('LotFrontage')

print("The whole number of final_cat_vars", len(num_all_vars))
for i in num_all_vars:
all_df[i].fillna(value=all_df[i].median(), inplace=True)

print("결측치가 존재 하지 않은 것을 알 수 있다. ")
check_na(all_df, 20)
1
all_df.info()

왜도(Skewnewss) 처리하기 : 정규 분포를 이룰 수 있게 (설문조사 논문 통계의 경우 -1< 외도 <1)

  • boxcose를 사용 할 예정

  • 왜도가 양수일때, 음수일때 (좌, 우로 치우친 정도)

  • 첨도가 양수일때, 음수일때 (뽀족한 정도)

  • RMSE를 최적(낮게)으로 만들기 위해 조정.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from scipy.stats import skew

#외도 판정을 받을 만한 data set을 확인
def find_skew(x):
return skew(x)
#앞에서 뽑은 numeric columns : num_all_vars
#사용자 정의함수를 쓰기 위해 apply(find_skew)를 사용, 오름차순정렬

skewness_features = all_df[num_all_vars].apply(find_skew).sort_values(ascending=False)
skewness_features

#high_skew = skew_valrs[slew_var > 1]

#0~1사이에 있는 것이 기준. 기준 밖으로 나간 경우 조정이 필요(정규분포를 만들어 주기 위해)
# 1. 박스코스 변환 : ML -> RMSE (2.5)
# 2. 로그변환 : ML -> RMSE (2.1)
# => RMSE는 적은 것이 좋기 때문에, 로그 변환으로 사용 하는 것이 좋다.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
skewnewss_index = list(skewness_features.index)
skewnewss_index.remove('LotArea')
#외도 정도가 너무 높은 LotArea를 날려주는 것.
all_numeric_df = all_df.loc[:, skewnewss_index]


fig, ax = plt.subplots(figsize=(10, 6))
ax.set_xlim(0, all_numeric_df.max().sort_values(ascending=False)[0])
ax = sns.boxplot(data=all_numeric_df[skewnewss_index] , orient="h", palette="Set1")
ax.xaxis.grid(False)
ax.set(ylabel="Feature names")
ax.set(xlabel="Numeric values")
ax.set(title="Numeric Distribution of Features Before Box-Cox Transformation")
sns.despine(trim=True, left=True)
1
2
3
4
5
6
7
8
9
10
11
12
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

high_skew = skewness_features[skewness_features > 1]
high_skew_index = high_skew.index

print("The data before Box-Cox Transformation: \n", all_df[high_skew_index].head())

for num_var in high_skew_index:
all_df[num_var] = boxcox1p(all_df[num_var], boxcox_normmax(all_df[num_var] + 1))

print("The data after Box-Cox Transformation: \n", all_df[high_skew_index].head())
1
2
3
4
5
6
7
8
fig, ax = plt.subplots(figsize=(10, 6))
ax.set_xscale('log')
ax = sns.boxplot(data=all_df[high_skew_index] , orient="h", palette="Set1")
ax.xaxis.grid(False)
ax.set(ylabel="Feature names")
ax.set(xlabel="Numeric values")
ax.set(title="Numeric Distribution of Features Before Box-Cox Transformation")
sns.despine(trim=True, left=True)

도출 변수

Feature Engineering 의 Key step

  • 판매량, 단가, 매출액 X
  • 판매량 X 단가 = 매출액(New Value) : 도출 변수
    • ML은 수식이기 때문에 도출변수가 생성 되는것은 연산의 증가로 이어진다.
    • 시간이 오래 걸린다.
    • 결론 : 변수를 줄이는 것이 좋다.
1
2
3
4
#집의 층수 를 더해서 전체면적이라는 변수를 도출 
all_df['TotalSF'] = all_df['TotalBsmtSF'] + all_df['1stFlrSF'] + all_df['2ndFlrSF']
all_df = all_df.drop(['TotalBsmtSF', '1stFlrSF', '2ndFlrSF'], axis=1)
print(all_df.shape)
1
2
3
4
all_df['Total_Bathrooms'] = (all_df['FullBath'] + (0.5 * all_df['HalfBath']) + all_df['BsmtFullBath'] + (0.5 * all_df['BsmtHalfBath']))
all_df['Total_porch_sf'] = (all_df['OpenPorchSF'] + all_df['3SsnPorch'] + all_df['EnclosedPorch'] + all_df['ScreenPorch'])
all_df = all_df.drop(['FullBath', 'HalfBath', 'BsmtFullBath', 'BsmtHalfBath', 'OpenPorchSF', '3SsnPorch', 'EnclosedPorch', 'ScreenPorch'], axis=1)
print(all_df.shape)
  • 따라서 data 정의서 먼저 봐야 한다. : data description.txt를 먼저 봐야 한다. !!! (실무에서는 없는 경우가 많다.)
  • 시각화 : 각각의 data 무한작업, 도메인 공부
1
2
3
4
5
6
7
8
9
10
11
# 연도와 관련된.
num_all_vars = list(train.select_dtypes(include=[np.number]))
year_feature = []
for var in num_all_vars:
if 'Yr' in var:
year_feature.append(var)
elif 'Year' in var:
year_feature.append(var)
else:
print(var, "is not related with Year")
print(year_feature)
1
2
3
4
5
6
7
8
fig, ax = plt.subplots(3, 1, figsize=(10, 6), sharex=True, sharey=True)
for i, var in enumerate(year_feature):
if var != 'YrSold':
ax[i].scatter(train[var], y, alpha=0.3)
ax[i].set_title('{}'.format(var), size=15)
ax[i].set_ylabel('SalePrice', size=15, labelpad=12.5)
plt.tight_layout()
plt.show()

Kgg_House_years



1
2
all_df = all_df.drop(['YearBuilt', 'GarageYrBlt'], axis=1)
print(all_df.shape)
1
2
3
4
5
6
# 리모델링 시점으로 부터 얼마나 되었나 + 팔리는거
YearsSinceRemodel = train['YrSold'].astype(int) - train['YearRemodAdd'].astype(int)

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(YearsSinceRemodel, y, alpha=0.3)
fig.show()

Kgg_House_YearsSinceRemodel



1
2
3
all_df['YearsSinceRemodel'] = all_df['YrSold'].astype(int) - all_df['YearRemodAdd'].astype(int)
all_df = all_df.drop(['YrSold', 'YearRemodAdd'], axis=1)
print(all_df.shape)

더미변수

String data (non-Numeric)

  • 명목형 : 남학생, 여학생…
  • 서열형(순서) : 1등급, 2등급, 3등급 (가중치, 등급숫자 등으로 바꿀 수 있다. )

Kgg_House_StringF



  • 세부적으로 customize 하는 것이 낫다.
  • 명목형 series에 따라 17개의 model을 개별적으로 만들되, 하나의 모델처럼 보이게 시각화 대시보드로 만들어 줘야 합니다.
1
2
all_df['PoolArea'].value_counts()
#0과 다른값들.. 으로 되어있어서
1
2
3
4
5
6
# 0과 1로 나누어 적용
def count_dummy(x):
if x > 0:
return 1
else:
return 0
1
2
3
4
all_df['PoolArea'] = all_df['PoolArea'].apply(count_dummy)
all_df['PoolArea'].value_counts()

# 전체 경향 등에 거의 영향을 주지 않음
1
2
all_df['GarageArea'] = all_df['GarageArea'].apply(count_dummy)
all_df['GarageArea'].value_counts()
1
2
all_df['Fireplaces'] = all_df['Fireplaces'].apply(count_dummy)
all_df['Fireplaces'].value_counts()



Label Encoding, Ordinal Encoding, One-Hot Encoding


  • Label Encoding : 종속변수에만
  • Ordinal Encoding : 독립변수에만
  • 써야 하지만, 개념은 같다.
  • One-Hot Encoding :
1
2
3
4
5
6
7
8
9
10
from sklearn.preprocessing import LabelEncoder
import pandas as pd

temp = pd.DataFrame({'Food_Name': ['Apple', 'Chicken', 'Broccoli'],
'Calories': [95, 231, 50]})

encoder = LabelEncoder()
encoder.fit(temp['Food_Name'])
labels = encoder.transform(temp['Food_Name'])
print(list(temp['Food_Name']), "==>", labels)
1
2
3
4
5
6
7
8
9
10
11
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

temp = pd.DataFrame({'Food_Name': ['Apple', 'Chicken', 'Broccoli'],
'Calories': [95, 231, 50]})

encoder = OrdinalEncoder()
labels = encoder.fit_transform(temp[['Food_Name']])
print(list(temp['Food_Name']), "==>", labels.tolist())


1
2
3
4
5
6
7
8
9
10
# import pandas as pd
# temp = pd.DataFrame({'Food_Name': ['Apple', 'Chicken', 'Broccoli'],
# 'Calories': [95, 231, 50]})

# temp[['Food_No']] = temp.Food_Name.replace(['Chicken', 'Broccoli', 'Apple'],[1, 2, 3])

# print(temp[['Food_Name', 'Food_No']])

#ValueError: Columns must be same length as key

1
2
3
4
5
6
7
8
import pandas as pd

temp = pd.DataFrame({'Food_Name': ['Apple', 'Chicken', 'Broccoli'],
'Calories': [95, 231, 50]})

temp = pd.get_dummies(temp)
print(temp)
print(temp.shape)
1
2
all_df = pd.get_dummies(all_df).reset_index(drop=True)
all_df.shape

머신러닝 모형 학습 및 평가

데이터셋 분리 및 교차 검증


1
2
3
X = all_df.iloc[:len(y), :]
X_test = all_df.iloc[len(y):, :]
X.shape, y.shape, X_test.shape
1
2
3
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape



평가지표

MAE


1
2
3
4
5
6
7
8
9
10
import numpy as np

def mean_absolute_error(y_true, y_pred):

error = 0
for yt, yp in zip(y_true, y_pred):
error = error + np.abs(yt-yp)

mae = error / len(y_true)
return mae
1
2
3
4
5
6
7
8
9
10
import numpy as np

def mean_squared_error(y_true, y_pred):

error = 0
for yt, yp in zip(y_true, y_pred):
error = error + (yt - yp) ** 2

mse = error / len(y_true)
return mse



RMSE


1
2
3
4
5
6
7
8
9
10
11
import numpy as np

def root_rmse_squared_error(y_true, ypred):
error = 0

for yt, yp in zip(y_true, y_pred):
error = error + (yt - yp) ** 2

mse = error / len(y_true)
rmse = np.round(np.sqrt(mse), 3)
return rmse

Test1

1
2
3
4
5
6
y_true = [400, 300, 800]
y_pred = [380, 320, 777]

print("MAE:", mean_absolute_error(y_true, y_pred))
print("MSE:", mean_squared_error(y_true, y_pred))
print("RMSE:", root_rmse_squared_error(y_true, y_pred))

Test2

1
2
3
4
5
6
y_true = [400, 300, 800, 900]
y_pred = [380, 320, 777, 600]

print("MAE:", mean_absolute_error(y_true, y_pred))
print("MSE:", mean_squared_error(y_true, y_pred))
print("RMSE:", root_rmse_squared_error(y_true, y_pred))



RMSE with Sklean


1
2
3
4
from sklearn.metrics import mean_squared_error

def rmsle(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))



모형 정의 및 검증 평가


1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression

def cv_rmse(model, n_folds=5):
cv = KFold(n_splits=n_folds, random_state=42, shuffle=True)
rmse_list = np.sqrt(-cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=cv))
print('CV RMSE value list:', np.round(rmse_list, 4))
print('CV RMSE mean value:', np.round(np.mean(rmse_list), 4))
return (rmse_list)

n_folds = 5
rmse_scores = {}
lr_model = LinearRegression()
1
2
3
score = cv_rmse(lr_model, n_folds)
print("linear regression - mean: {:.4f} (std: {:.4f})".format(score.mean(), score.std()))
rmse_scores['linear regression'] = (score.mean(), score.std())

첫번째 최종 예측 값 제출

1
2
3
4
5
6
7
8
9
from sklearn.model_selection import cross_val_predict

X = all_df.iloc[:len(y), :]
X_test = all_df.iloc[len(y):, :]
X.shape, y.shape, X_test.shape

lr_model_fit = lr_model.fit(X, y)
final_preds = np.floor(np.expm1(lr_model_fit.predict(X_test)))
print(final_preds)
1
2
3
4
submission = pd.read_csv("sample_submission.csv")
submission.iloc[:,1] = final_preds
print(submission.head())
submission.to_csv("The_first_regression.csv", index=False)

모형 알고리즘 추가

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

# LinearRegresison
lr_model = LinearRegression()

# Tree Decision
tree_model = DecisionTreeRegressor()

# Random Forest Regressor
rf_model = RandomForestRegressor()

# Gradient Boosting Regressor
gbr_model = GradientBoostingRegressor()
1
2
3
score = cv_rmse(lr_model, n_folds)
print("linear regression - mean: {:.4f} (std: {:.4f})".format(score.mean(), score.std()))
rmse_scores['linear regression'] = (score.mean(), score.std())
1
2
3
score = cv_rmse(tree_model, n_folds)
print("Decision Tree Regressor - mean: {:.4f} (std: {:.4f})".format(score.mean(), score.std()))
rmse_scores['Decision Tree Regressor'] = (score.mean(), score.std())
1
2
3
score = cv_rmse(rf_model, n_folds)
print("RandomForest Regressor - mean: {:.4f} (std: {:.4f})".format(score.mean(), score.std()))
rmse_scores['RandomForest Regressor'] = (score.mean(), score.std())
1
2
3
score = cv_rmse(gbr_model, n_folds)
print("Gradient Boosting Regressor - mean: {:.4f} (std: {:.4f})".format(score.mean(), score.std()))
rmse_scores['Gradient Boosting Regressor'] = (score.mean(), score.std())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
fig, ax = plt.subplots(figsize=(10, 6))

ax = sns.pointplot(x=list(rmse_scores.keys()), y=[score for score, _ in rmse_scores.values()], markers=['o'], linestyles=['-'], ax=ax)
for i, score in enumerate(rmse_scores.values()):
ax.text(i, score[0] + 0.002, '{:.6f}'.format(score[0]), horizontalalignment='left', size='large', color='black', weight='semibold')

ax.set_ylabel('Score (RMSE)', size=20, labelpad=12.5)
ax.set_xlabel('Model', size=20, labelpad=12.5)
ax.tick_params(axis='x', labelsize=13.5, rotation=10)
ax.tick_params(axis='y', labelsize=12.5)
ax.set_ylim(0, 0.25)
ax.set_title('Rmse Scores of Models without Blended_Predictions', size=20)

fig.show()
  • RMSE 가 적은 것이 좋다. : 예측이 잘 된 Model이라고 할 수 있다.
1
2
3
4
5
6
7
8
9
10
11
lr_model_fit = lr_model.fit(X, y)
tree_model_fit = tree_model.fit(X, y)
rf_model_fit = rf_model.fit(X, y)
gbr_model_fit = gbr_model.fit(X, y)

def blended_learning_predictions(X):
blended_score = (0.3 * lr_model_fit.predict(X)) + \
(0.1 * tree_model_fit.predict(X)) + \
(0.3 * gbr_model_fit.predict(X)) + \
(0.3* rf_model_fit.predict(X))
return blended_score
1
2
3
4
blended_score = rmsle(y, blended_learning_predictions(X))
rmse_scores['blended'] = (blended_score, 0)
print('RMSLE score on train data:')
print(blended_score)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
fig, ax = plt.subplots(figsize=(10, 6))

ax = sns.pointplot(x=list(rmse_scores.keys()), y=[score for score, _ in rmse_scores.values()], markers=['o'], linestyles=['-'], ax=ax)
for i, score in enumerate(rmse_scores.values()):
ax.text(i, score[0] + 0.002, '{:.6f}'.format(score[0]), horizontalalignment='left', size='large', color='black', weight='semibold')

ax.set_ylabel('Score (RMSE)', size=20, labelpad=12.5)
ax.set_xlabel('Model', size=20, labelpad=12.5)
ax.tick_params(axis='x', labelsize=13.5, rotation=10)
ax.tick_params(axis='y', labelsize=12.5)
ax.set_ylim(0, 0.25)

ax.set_title('Rmse Scores of Models with Blended_Predictions', size=20)

fig.show()
1
2
submission.iloc[:,1] = np.floor(np.expm1(blended_learning_predictions(X_test)))
submission.to_csv("The_second_regression.csv", index=False)

submission_House

Regression in python(01)

Chepter 5 _파이썬 머신러닝 완벽 가이드

ref. & copyright(c) Book



회귀

  • Regression: 여러개의 독립변수와 한개의 종속변수 간의 상관관계를 모델링 하는 기법
  • Regression conefficients : 독립변수의 값에 영향을 미치는 회기 계수로 선형 회기 식의 기울기에 해당

  • 러닝머신의 관점

    • 독립변수 : 피처
    • 종속변수 : 결정값

= > 주어진 피처와 결정값 데이터 기반에서 학습을 통해 최적의 **회귀계수** 를 찾아 내는 것이 목표



✌지도학습 2가지 유형

  1. CLASSIFICATION + category, 이산 값 일때
  2. REGRESSION + 숫자, 연속 값 일때

⚡ 회귀의 4가지 유형

  1. 독립변수 개수 - 단일 회귀 - 다중 회귀
  2. 회귀 계수의 결합 - 선형 회귀 : 실제 값과 예측 값의 차이 (오류의 제곱값)를 최소화 하는 직선형 회귀선을 최적화 하는 방식
    • Regularization(규제방법) : 일반적 선형 회귀의 과적합 문제를 해결 하기 위해 회귀 계수를 조정 하는것 (패널티 값 적용) - 비선형 회귀
  • 일반선형회귀 : 예측값과 실제값의 RSS를 최소화 할 수 있도록 회귀계수 최적화 (Regularization X)
  • Ridge(릿지) : 선형 회귀 + L2 Regularization
    • L2 : 상대적으로 큰 회귀 계수 값의 예측 영향도를 감소 시키기 위해 회귀 계수값을 더 작게 만듦.
  • Lasso(라쏘) :
  • ElasticNet(엘라스틱넷) :
  • Rogistic Regression(로지스틱 회귀) :

Ref. scikit-learn

책에 나온 회귀들

딥러닝을 하고 싶다면 볼 것.



y = 4x + 6 + error 시뮬레이션 데이터 값 생성

1
2
3
4
5
6
7
8
9
10
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(123) # 실험 재현성

X = 2 * np.random.rand(100, 1) # 100개의 랜덤값 만들기
y = 4 * X + 6 + np.random.rand(100, 1)

plt.scatter(X, y)

산점도 그래프

1
X.shape, y.shape
((100, 1), (100, 1))

경사하강법으로 최적의 기울기 찾기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# w1과, w0를 업데이트할 w1_update, w0_update 값 반환

def get_weight_updates(w1, w0, X, y, learning_rate=0.01):
N = len(y)

# w1_update, w0_update 초기화
w1_update = np.zeros_like(w1)
w0_update = np.zeros_like(w0)

# 예측 배열 계산하고, 예측값과 실젯 값의 차이 계산
y_pred = np.dot(X, w1.T) + w0
diff = y - y_pred # 실제갓, 예측값 == 오차

# w0_update를 dot 행렬 연산으로 구하기 위해 모두 1 값을 가진 행렬 생성
w0_factors = np.ones((N, 1))

# w1과 w0을 업데이트할 w1_update, w0_update 계산
w1_update = -(2/N) * learning_rate * (np.dot(X.T, diff))
w0_update = -(2/N) * learning_rate * (np.dot(w0_factors.T, diff))

return w1_update, w0_update
1
2
3
4
5
6
7
8
9
10
11
12
13
w0 = np.zeros((1, 1))
w1 = np.zeros((1, 1))

y_pred = np.dot(X, w1.T) + w0
diff = y-y_pred
print(diff.shape)

w0_factors = np.ones((100, 1))
w1_update = -(2/100) * 0.01 * (np.dot(X.T, diff))
w0_update = -(2/100) * 0.01 * (np.dot(w0_factors.T, diff))

print(w1_update.shape, w0_update.shape)
print(w1, w0)
(100, 1)
(1, 1) (1, 1)
[[0.]] [[0.]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 입력 인자 반복문 코드 

def gradient_descent_steps(X, y, iters = 100000):

# w0와 w1을 모두 0으로 초기화
w0 = np.zeros((1, 1))
w1 = np.zeros((1, 1))

# iters 만큼 반복 수행 # get_weight_updates
for ind in range(iters):
w1_update, w0_update = get_weight_updates(w1, w0, X, y, learning_rate=0.01)
w1 = w1 - w1_update
w0 = w0 - w0_update

return w1, w0
  • 예측 오차 비용 계산하는 함수 생성 및 경사 하강법 수행
1
2
3
4
5
6
7
8
9
10
11
def get_cost(y, y_pred):
N = len(y)

cost = np.sum(np.square(y - y_pred)) / N
return cost

w1, w0 = gradient_descent_steps(X, y, iters = 100000)
print("w1:{0:.4f}, w0:{1:.4f}".format(w1[0, 0], w0[0, 0]))

y_pred = w1[0,0] * X + w0
print("Total Cost:{0:.4f}".format(get_cost(y, y_pred)))
w1:3.9462, w0:6.5590
Total Cost:0.0803
1
2
plt.scatter(X, y)
plt.plot(X, y_pred, color = "r")

산점도 그래프_Line

1
2
3
4
5
import pandas as pd


bostonDF = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv")
bostonDF.head()

EDA

  • 종속변수가 기준, y값, medv
1
2
3
4
5
6
7
8
9
10
11
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize = (16, 8), ncols = 4, nrows = 2)
lm_features = ["rm", "zn", "indus", "nox", "age", "ptratio", "lstat", "rad"]

for i, feature in enumerate(lm_features):
row = int(i/4)
col = i%4
print("row is {}, col is {}".format(row, col))
sns.regplot(x = feature, y = "medv", data = bostonDF, ax = ax[row][col])

Multi Graphes

  • 두 연속형 변수를 활용한 산점도나 회귀식 가능.
  • 박스플롯 (x: 명목형, y: medv)

rm 3.4
chas 3.0
rad 0.4
zn 0.1
b 0.0
tax -0.0
age 0.0
indus 0.0
crim -0.1
lstat -0.6
ptratio -0.9
dis -1.7
nox -19.8


1
2
3
4
5
6
7
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression # model

y_target = bostonDF["medv"] # 종속변수, Y
X_data = bostonDF.drop(['medv', 'rad', 'zn', 'b', 'tax', 'age', 'indus', 'crim', 'lstat'], axis = 1, inplace = False) # 독립변수

y_target.shape, X_data.shape
((506,), (506, 5))

데이터셋 분리

  • 예측, 시뮬레이션, 가상의 데이터를 가지고 예측 & 시뮬레이션
  • 예측한 결괏값 vs 실젯값 비교
1
2
3
4
# 임의 샘플링

X_train, X_test, y_train, y_test = train_test_split(X_data, y_target, test_size = 0.3, random_state=156)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((354, 5), (152, 5), (354,), (152,))

ML 모형 만들기

1
2
3
4
5
6
lr = LinearRegression()
lr.fit(X_train, y_train)


y_preds = lr.predict(X_test)
y_preds
array([26.78074859, 16.40377991, 34.38443472, 19.13328473, 32.89690238,
       19.25298249, 28.32071818, 22.76654888,  9.87108567, 14.66339227,
       21.55844556, 17.27788854, 28.55574467, 38.50512646, 23.60848806,
       24.03347202, 23.82317119, 15.9119451 , 28.65132167, 20.98388455,
       20.29188703, 18.37003455, 18.58675839, 14.89143225, 35.24799305,
        7.70600921, 19.39133905, 15.97963635, 16.90296718, 15.484303  ,
       29.67753869, 17.58268684, 16.91992352, 22.47407959, 16.57706526,
       18.5381101 , 13.34337954, 24.11893098, 15.48185399, 24.3234222 ,
       36.24776797, 19.60882283, 20.95016211,  6.85667164, 20.32077896,
       23.05614583, 24.65371876, 35.25609168, 22.32959594, 25.96437918,
       27.29101785, 43.32992941, 41.76994078, 19.34288261, 24.8690423 ,
       25.99270875, 20.76285715, 33.13792328, 25.00439224, 16.82906893,
       22.80895172, 23.72489982, 24.53360315, 11.82722067, 17.55728132,
       37.43371362, 33.37256916, 25.65966256, 20.90725715, 21.09529467,
       15.22097444, 30.6234335 , 37.42143489, 26.22092177, 16.71532104,
       32.62735407, 23.41004013, 23.86575538, 18.75430877, 15.9914079 ,
       30.87778491, 16.04423898, 19.01496945, 20.04269634, 28.30832805,
       15.1948795 , 30.47430322, 33.93480059, 23.87721263, 29.7167635 ,
       29.85142798, 19.10737457, 28.49523963, 27.69846662, 25.49534489,
       24.59255802, 12.34870184, 26.65951587, 31.26197918, 17.86101862,
       27.3059424 , 18.18058484, 15.67184217, 13.17304165, 17.91281425,
       23.48894551, 24.53921273, 28.14530028, 16.05340908, 24.22120622,
       21.94517346, 26.62930956, 11.39298015, 18.53099857, 22.75407122,
       33.6679728 , 23.35342973, 20.85267956, 19.69347759, 28.12264641,
       28.56541499, 17.91759633, 27.83520695, 33.8011824 , 21.75436813,
       26.6360736 , 14.70682076, 19.99114889, 21.81029849, 31.72247354,
       21.33041025, 23.52438417, 35.55842163, 20.54294729, 38.34696416,
       19.25750865, 17.07595035, 18.31764392, 17.66658651, 23.12171447,
       19.58446231, 19.90774119, 14.84809066, 19.50652744, 38.83812958,
       15.26095952, 28.56874885, 17.62298514, 22.46794555, 23.28435884,
       18.8439135 , 31.16286012])

모형 평가

1
2
3
4
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_preds)

print("MSE: {0:.3f}".format(mse))
MSE: 21.369

y = 상수값 + rm 기울기 x rm의 값 +

1
2
3
4
5
6
7
import numpy as np 

print("절편 값:", lr.intercept_) # 절편 값
print("회귀 계수값", np.round(lr.coef_, 1))

coeff_df = pd.Series(data=np.round(lr.coef_, 1), index = X_data.columns)
coeff_df.sort_values(ascending=False)
절편 값: 26.830373506191982
회귀 계수값 [  4.3 -33.1   6.5  -1.1  -1.2]





rm          6.5
chas        4.3
dis        -1.1
ptratio    -1.2
nox       -33.1
dtype: float64

아직 배우지 않았지만, 유용한 기능

1
!pip install pycaret
Collecting pycaret
  Downloading pycaret-2.3.5-py3-none-any.whl (288 kB)
     |████████████████████████████████| 288 kB 32.5 MB/s 
[?25hRequirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.1.0)
Requirement already satisfied: textblob in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.15.3)
Collecting pandas-profiling>=2.8.0
  Downloading pandas_profiling-3.1.0-py2.py3-none-any.whl (261 kB)
     |████████████████████████████████| 261 kB 53.7 MB/s 
[?25hCollecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
     |████████████████████████████████| 1.7 MB 42.3 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Requirement already satisfied: spacy<2.4.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (2.2.4)
Collecting scikit-learn==0.23.2
  Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
     |████████████████████████████████| 6.8 MB 58.0 MB/s 
[?25hRequirement already satisfied: ipywidgets in /usr/local/lib/python3.7/dist-packages (from pycaret) (7.6.5)
Requirement already satisfied: cufflinks>=0.17.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.17.3)
Collecting scikit-plot
  Downloading scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Requirement already satisfied: yellowbrick>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.3.post1)
Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.2.5)
Collecting umap-learn
  Downloading umap-learn-0.5.2.tar.gz (86 kB)
     |████████████████████████████████| 86 kB 6.0 MB/s 
[?25hCollecting Boruta
  Downloading Boruta-0.3-py3-none-any.whl (56 kB)
     |████████████████████████████████| 56 kB 4.7 MB/s 
[?25hRequirement already satisfied: wordcloud in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.5.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.11.2)
Requirement already satisfied: IPython in /usr/local/lib/python3.7/dist-packages (from pycaret) (5.5.0)
Requirement already satisfied: gensim<4.0.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.6.0)
Collecting lightgbm>=2.3.1
  Downloading lightgbm-3.3.1-py3-none-manylinux1_x86_64.whl (2.0 MB)
     |████████████████████████████████| 2.0 MB 47.4 MB/s 
[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.1.5)
Collecting mlxtend>=0.17.0
  Downloading mlxtend-0.19.0-py2.py3-none-any.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 60.5 MB/s 
[?25hRequirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.2.2)
Requirement already satisfied: numpy==1.19.5 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.19.5)
Collecting pyod
  Downloading pyod-0.9.5.tar.gz (113 kB)
     |████████████████████████████████| 113 kB 58.7 MB/s 
[?25hRequirement already satisfied: plotly>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from pycaret) (4.4.1)
Collecting mlflow
  Downloading mlflow-1.22.0-py3-none-any.whl (15.5 MB)
     |████████████████████████████████| 15.5 MB 50.3 MB/s 
[?25hRequirement already satisfied: scipy<=1.5.4 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.4.1)
Collecting imbalanced-learn==0.7.0
  Downloading imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
     |████████████████████████████████| 167 kB 62.1 MB/s 
[?25hCollecting kmodes>=0.10.1
  Downloading kmodes-0.11.1-py2.py3-none-any.whl (19 kB)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==0.23.2->pycaret) (3.0.0)
Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (0.3.0)
Requirement already satisfied: setuptools>=34.4.1 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (57.4.0)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (1.15.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim<4.0.0->pycaret) (5.2.1)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (5.1.1)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (0.8.1)
Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (2.6.1)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (1.0.18)
Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (4.4.2)
Requirement already satisfied: pexpect in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (4.8.0)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (0.7.5)
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (1.0.2)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (3.5.2)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (4.10.1)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (5.1.3)
Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (0.2.0)
Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.1.1)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.3.5)
Requirement already satisfied: wheel in /usr/local/lib/python3.7/dist-packages (from lightgbm>=2.3.1->pycaret) (0.37.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (1.3.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (3.0.6)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (2.6.0)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (4.9.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->pycaret) (2018.9)
Requirement already satisfied: tqdm>=4.48.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (4.62.3)
Collecting tangled-up-in-unicode==0.1.0
  Downloading tangled_up_in_unicode-0.1.0-py3-none-any.whl (3.1 MB)
     |████████████████████████████████| 3.1 MB 47.3 MB/s 
[?25hRequirement already satisfied: markupsafe~=2.0.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (2.0.1)
Collecting pydantic>=1.8.1
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
     |████████████████████████████████| 10.1 MB 37.6 MB/s 
[?25hCollecting htmlmin>=0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
Collecting multimethod>=1.4
  Downloading multimethod-1.6-py3-none-any.whl (9.4 kB)
Collecting PyYAML>=5.0.0
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
     |████████████████████████████████| 596 kB 40.0 MB/s 
[?25hCollecting phik>=0.11.1
  Downloading phik-0.12.0-cp37-cp37m-manylinux2010_x86_64.whl (675 kB)
     |████████████████████████████████| 675 kB 62.8 MB/s 
[?25hRequirement already satisfied: jinja2>=2.11.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (2.11.3)
Collecting visions[type_image_path]==0.7.4
  Downloading visions-0.7.4-py3-none-any.whl (102 kB)
     |████████████████████████████████| 102 kB 12.8 MB/s 
[?25hCollecting joblib
  Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
     |████████████████████████████████| 303 kB 71.5 MB/s 
[?25hCollecting requests>=2.24.0
  Downloading requests-2.26.0-py2.py3-none-any.whl (62 kB)
     |████████████████████████████████| 62 kB 995 kB/s 
[?25hRequirement already satisfied: missingno>=0.4.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (0.5.0)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (21.2.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (2.6.3)
Collecting imagehash
  Downloading ImageHash-4.2.1.tar.gz (812 kB)
     |████████████████████████████████| 812 kB 49.7 MB/s 
[?25hRequirement already satisfied: Pillow in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (7.1.2)
Collecting scipy<=1.5.4
  Downloading scipy-1.5.4-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)
     |████████████████████████████████| 25.9 MB 1.6 MB/s 
[?25hRequirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly>=4.4.1->pycaret) (1.3.3)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython->pycaret) (0.2.5)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from pydantic>=1.8.1->pandas-profiling>=2.8.0->pycaret) (3.10.0.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (2.0.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (1.24.3)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (0.4.1)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.6)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (3.0.6)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.1.3)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (7.4.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (2.0.6)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (0.8.2)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.0)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.5)
Requirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<2.4.0->pycaret) (4.8.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<2.4.0->pycaret) (3.6.0)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.3.1)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.6.1)
Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.12.1)
Requirement already satisfied: Send2Trash in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.8.0)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.7/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets->pycaret) (22.3.0)
Requirement already satisfied: ptyprocess in /usr/local/lib/python3.7/dist-packages (from terminado>=0.8.1->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.7.0)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.7/dist-packages (from imagehash->visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (1.2.0)
Collecting docker>=4.0.0
  Downloading docker-5.0.3-py2.py3-none-any.whl (146 kB)
     |████████████████████████████████| 146 kB 70.7 MB/s 
[?25hRequirement already satisfied: entrypoints in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (0.3)
Requirement already satisfied: protobuf>=3.7.0 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (3.17.3)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.3.0)
Collecting databricks-cli>=0.8.7
  Downloading databricks-cli-0.16.2.tar.gz (58 kB)
     |████████████████████████████████| 58 kB 5.9 MB/s 
[?25hRequirement already satisfied: click>=7.0 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (7.1.2)
Collecting querystring-parser
  Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)
Collecting alembic<=1.4.1
  Downloading alembic-1.4.1.tar.gz (1.1 MB)
     |████████████████████████████████| 1.1 MB 59.1 MB/s 
[?25hCollecting prometheus-flask-exporter
  Downloading prometheus_flask_exporter-0.18.6-py3-none-any.whl (17 kB)
Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.4.27)
Requirement already satisfied: Flask in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.1.4)
Collecting gitpython>=2.1.0
  Downloading GitPython-3.1.24-py3-none-any.whl (180 kB)
     |████████████████████████████████| 180 kB 58.3 MB/s 
[?25hCollecting gunicorn
  Downloading gunicorn-20.1.0-py3-none-any.whl (79 kB)
     |████████████████████████████████| 79 kB 8.7 MB/s 
[?25hRequirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (21.3)
Requirement already satisfied: sqlparse>=0.3.1 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (0.4.2)
Collecting Mako
  Downloading Mako-1.1.6-py2.py3-none-any.whl (75 kB)
     |████████████████████████████████| 75 kB 4.4 MB/s 
[?25hCollecting python-editor>=0.3
  Downloading python_editor-1.0.4-py3-none-any.whl (4.9 kB)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.7/dist-packages (from databricks-cli>=0.8.7->mlflow->pycaret) (0.8.9)
Collecting websocket-client>=0.32.0
  Downloading websocket_client-1.2.3-py3-none-any.whl (53 kB)
     |████████████████████████████████| 53 kB 2.1 MB/s 
[?25hCollecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
     |████████████████████████████████| 63 kB 1.9 MB/s 
[?25hCollecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy->mlflow->pycaret) (1.1.2)
Requirement already satisfied: Werkzeug<2.0,>=0.15 in /usr/local/lib/python3.7/dist-packages (from Flask->mlflow->pycaret) (1.0.1)
Requirement already satisfied: itsdangerous<2.0,>=0.24 in /usr/local/lib/python3.7/dist-packages (from Flask->mlflow->pycaret) (1.1.0)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.4)
Requirement already satisfied: bleach in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (4.1.0)
Requirement already satisfied: testpath in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.7.1)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.5.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.1)
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.7/dist-packages (from prometheus-flask-exporter->mlflow->pycaret) (0.12.0)
Requirement already satisfied: future in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (0.16.0)
Collecting pyLDAvis
  Downloading pyLDAvis-3.3.0.tar.gz (1.7 MB)
     |████████████████████████████████| 1.7 MB 37.5 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
  Downloading pyLDAvis-3.2.2.tar.gz (1.7 MB)
     |████████████████████████████████| 1.7 MB 45.7 MB/s 
[?25hRequirement already satisfied: numexpr in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (2.7.3)
Collecting funcy
  Downloading funcy-1.16-py2.py3-none-any.whl (32 kB)
Requirement already satisfied: numba>=0.35 in /usr/local/lib/python3.7/dist-packages (from pyod->pycaret) (0.51.2)
Requirement already satisfied: statsmodels in /usr/local/lib/python3.7/dist-packages (from pyod->pycaret) (0.10.2)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.35->pyod->pycaret) (0.34.0)
Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from statsmodels->pyod->pycaret) (0.5.2)
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.5.tar.gz (1.1 MB)
     |████████████████████████████████| 1.1 MB 49.9 MB/s 
[?25hBuilding wheels for collected packages: htmlmin, imagehash, alembic, databricks-cli, pyLDAvis, pyod, umap-learn, pynndescent
  Building wheel for htmlmin (setup.py) ... [?25l[?25hdone
  Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27098 sha256=d7dfcc5cb8473dd5eae3fcf51c538f92f876faa04e78c8b36d9c790b9fac7e10
  Stored in directory: /root/.cache/pip/wheels/70/e1/52/5b14d250ba868768823940c3229e9950d201a26d0bd3ee8655
  Building wheel for imagehash (setup.py) ... [?25l[?25hdone
  Created wheel for imagehash: filename=ImageHash-4.2.1-py2.py3-none-any.whl size=295207 sha256=8b1e1a54f9880fb8de0530e8e168811d3264000c0375d179b04677d7db738f6f
  Stored in directory: /root/.cache/pip/wheels/4c/d5/59/5e3e297533ddb09407769762985d134135064c6831e29a914e
  Building wheel for alembic (setup.py) ... [?25l[?25hdone
  Created wheel for alembic: filename=alembic-1.4.1-py2.py3-none-any.whl size=158172 sha256=3a382d7a8aa3f735be58614dc83527e0801ccb0bc893eb96cc388ee8f0a5dd91
  Stored in directory: /root/.cache/pip/wheels/be/5d/0a/9e13f53f4f5dfb67cd8d245bb7cdffe12f135846f491a283e3
  Building wheel for databricks-cli (setup.py) ... [?25l[?25hdone
  Created wheel for databricks-cli: filename=databricks_cli-0.16.2-py3-none-any.whl size=106811 sha256=ada21177391b9688188e6f778b0ec6b6001615c2b2f13bef53090805b2f183bf
  Stored in directory: /root/.cache/pip/wheels/f4/5c/ed/e1ce20a53095f63b27b4964abbad03e59cf3472822addf7d29
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.2.2-py2.py3-none-any.whl size=135618 sha256=29ef50e1603fe00d18a256b833c7feddebc16ef3ac82f37f109f991b0f95b4b0
  Stored in directory: /root/.cache/pip/wheels/f8/b1/9b/560ac1931796b7303f7b517b949d2d31a4fbc512aad3b9f284
  Building wheel for pyod (setup.py) ... [?25l[?25hdone
  Created wheel for pyod: filename=pyod-0.9.5-py3-none-any.whl size=132699 sha256=851491ca675bc8eb4d9ecfb52396f362de25c1443531f442a9528c0b9b3f7b21
  Stored in directory: /root/.cache/pip/wheels/3d/bb/b7/62b60fb451b33b0df1ab8006697fba7a6a49709a629055cf77
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.2-py3-none-any.whl size=82709 sha256=f4bae757148b4cf4930e495a816ecb3f6fcc3a16d1014c85ce052bb2acccb378
  Stored in directory: /root/.cache/pip/wheels/84/1b/c6/aaf68a748122632967cef4dffef68224eb16798b6793257d82
  Building wheel for pynndescent (setup.py) ... [?25l[?25hdone
  Created wheel for pynndescent: filename=pynndescent-0.5.5-py3-none-any.whl size=52603 sha256=e19d78d031a739792e30a3bf2d93865296b6eb66226835f66a8287b1330882f1
  Stored in directory: /root/.cache/pip/wheels/af/e9/33/04db1436df0757c42fda8ea6796d7a8586e23c85fac355f476
Successfully built htmlmin imagehash alembic databricks-cli pyLDAvis pyod umap-learn pynndescent
Installing collected packages: tangled-up-in-unicode, smmap, scipy, multimethod, joblib, websocket-client, visions, scikit-learn, requests, python-editor, Mako, imagehash, gitdb, querystring-parser, PyYAML, pynndescent, pydantic, prometheus-flask-exporter, phik, htmlmin, gunicorn, gitpython, funcy, docker, databricks-cli, alembic, umap-learn, scikit-plot, pyod, pyLDAvis, pandas-profiling, mlxtend, mlflow, lightgbm, kmodes, imbalanced-learn, Boruta, pycaret
  Attempting uninstall: scipy
    Found existing installation: scipy 1.4.1
    Uninstalling scipy-1.4.1:
      Successfully uninstalled scipy-1.4.1
  Attempting uninstall: joblib
    Found existing installation: joblib 1.1.0
    Uninstalling joblib-1.1.0:
      Successfully uninstalled joblib-1.1.0
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.1
    Uninstalling scikit-learn-1.0.1:
      Successfully uninstalled scikit-learn-1.0.1
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: PyYAML
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
  Attempting uninstall: pandas-profiling
    Found existing installation: pandas-profiling 1.4.1
    Uninstalling pandas-profiling-1.4.1:
      Successfully uninstalled pandas-profiling-1.4.1
  Attempting uninstall: mlxtend
    Found existing installation: mlxtend 0.14.0
    Uninstalling mlxtend-0.14.0:
      Successfully uninstalled mlxtend-0.14.0
  Attempting uninstall: lightgbm
    Found existing installation: lightgbm 2.2.3
    Uninstalling lightgbm-2.2.3:
      Successfully uninstalled lightgbm-2.2.3
  Attempting uninstall: imbalanced-learn
    Found existing installation: imbalanced-learn 0.8.1
    Uninstalling imbalanced-learn-0.8.1:
      Successfully uninstalled imbalanced-learn-0.8.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed Boruta-0.3 Mako-1.1.6 PyYAML-6.0 alembic-1.4.1 databricks-cli-0.16.2 docker-5.0.3 funcy-1.16 gitdb-4.0.9 gitpython-3.1.24 gunicorn-20.1.0 htmlmin-0.1.12 imagehash-4.2.1 imbalanced-learn-0.7.0 joblib-1.0.1 kmodes-0.11.1 lightgbm-3.3.1 mlflow-1.22.0 mlxtend-0.19.0 multimethod-1.6 pandas-profiling-3.1.0 phik-0.12.0 prometheus-flask-exporter-0.18.6 pyLDAvis-3.2.2 pycaret-2.3.5 pydantic-1.8.2 pynndescent-0.5.5 pyod-0.9.5 python-editor-1.0.4 querystring-parser-1.2.4 requests-2.26.0 scikit-learn-0.23.2 scikit-plot-0.3.7 scipy-1.5.4 smmap-5.0.0 tangled-up-in-unicode-0.1.0 umap-learn-0.5.2 visions-0.7.4 websocket-client-1.2.3
1
2
from pycaret.utils import enable_colab
enable_colab()
Colab mode enabled.
1
2
from pycaret.datasets import get_data
dataset = get_data('diamond')
1
2
3
4
5
6
7
8
data = dataset.sample(frac=0.9, random_state=786)
data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (5400, 8)
Unseen Data For Predictions: (600, 8)
1
2
from pycaret.regression import *
exp_reg101 = setup(data = data, target = 'Price', session_id=123)

Make Timer function in python

시간 측정 decorator 함수 만들기



import time

1
import time

실행 시간을 확인 하는 함수 만들기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def timer(func):
""" 함수 실행 시간 확인
:param func: check할 함수 넣을거임
:return: 걸린 시간
"""
def wrapper(*args, **kwargs):
#현재 시간
time_start = time.time()

#decorated function 불러오기
result = func(*args, **kwargs)
time_total = time.time() - time_start
#
print("{}, Total time is {: .2f} sec.".format(func.__name__, time_total))

return result
return wrapper
  • 가변 매개 변수 args(*)

    • 함수를 정의할때 앞에 *가 붙어 있으면, 정해지지 않은 수의 매개변수를 받겠다는 의미
    • 가변 매개변수는 입력 받은 인수를 튜플형식으로 packing 한다.
    • 일반적으로 *args (arguments의 약자로 관례적으로 사용) 를 사용한다.
    • 다른 매개 변수와 혼용가능
  • 키워드 매개변수 kwargs(**)

    • 함수에서 정의되지 않은 매개변수를 받을 때 사용되며, 딕셔너리 형식으로 전달.
    • 일반 매개변수, 가변 매개변수와 함께 일는 경우 순서를 지켜야함 (일반>가변>키워드 순)
    • **kwargs (Keyword arguments의 약자로 관례적으로 사용 )



decorator 함수를 이용하여 시간 확인 함수 설정, 실행

1
2
3
4
5
6
7
8

@timer
def check_time(num):
time.sleep(num)

if __name__ == "__main__":
check_time(1.5)

out

check_time, Total time is 1.50 sec.







관련 이론들을 아래에 적어 놓았다.

  • timestamp :python에서 time은 1970년 1월 1일 0시 0분 0초를 기준으로 경과한 시간을 나타냄
  • time_struct class
    • timestamp가 주어 졌을때 날짜와 시간을 알아내기 위한 API 제공
name value Ex.
tm_year year 1993, 2021
tm_mon month 1~12
tm_mday day 1~31
tm_hour hour 0~23
tm_min minute 0~59
tm_sec second 0~61
tm_wday 요일 0~6 (0 : MON)
tm_yday 연중 경과일 1~366
tm_isdst summertime 0: unapply 1: apply



time() 함수


  • 현재 timestamp 얻기

in

1
2
3
secs = time.time()
print(secs)

out

1638870356.8049076

  • unix timestamp는 소수로 값을 return, 정수 부분이 초 단위.

부가적인 time 함수들

  1. gmtime() : GMT 기준의 time_struct type으로 변환

    in

    1
    2
    tm = time.gmtime(secs)
    print(tm)

    out

    time.struct_time(tm_year=2021, tm_mon=12, tm_mday=7, tm_hour=9, tm_min=53, tm_sec=5, tm_wday=1, tm_yday=341, tm_isdst=0)

  1. localtime() : 현지 시간대 기준의 time_struct type으로 변환

    in

    1
    2
    3
    4
    5
    6
    7
    tm = time.localtime(secs)
    print("year:", tm.tm_year)
    print("month:", tm.tm_mon)
    print("day:", tm.tm_mday)
    print("hour:", tm.tm_hour)
    print("minute:", tm.tm_min)
    print("second:", tm.tm_sec)

    out

    year: 2021
    month: 12
    day: 7
    hour: 18
    minute: 53
    second: 5

  2. ctime() : 요일 월 일 시:분:초 연도 형식으로 출력

    in

    1
    string = time.ctime(secs)

    out

    Tue Dec 7 18:56:03 2021

  1. strftime() : strftime 과 같은 특정 형식으로 변환가능
    • parameter로 time_struct type data를 받기 때문에 위의 함수들을 사용해서 data를 strftime()으로 넘겨야 함.

      in

      1
      2
      3
      tmt = time.localtime(secs)
      string = time.strftime('%Y-%m-%d %I:%M:%S %p', tmt)
      print(string)

      out

      2021-12-07 07:00:54 PM

인자를 쓰는동안 secs로 계속 썻더니 시간이 계속 올라가고있다 하하하 ^0^

  1. strptime() : strftime
    과 같은 특정 포멧의 시간을 time_struct type 으로 변경.

    in

    1
    2
    3
    string = '2021-12-07 07:00:54 PM'
    tmm = time.strptime(string, '%Y-%m-%d %I:%M:%S %p')
    print(tmm)

    out

    time.struct_time(tm_year=2021, tm_mon=12, tm_mday=7, tm_hour=19, tm_min=0,
    tm_sec=54, tm_wday=1, tm_yday=341, tm_isdst=-1)

  2. sleep() : 일정 시간 동안 시간 지연 시키기

    in

    1
    2
    3
    print("Start-->")
    time.sleep(1.5)
    print("<--End")

    out

    Start–>
    <–End

  • time.sleep(sec) : 초 단위로 시간을 지연 시킨다.

Python 함수 실행하기

day 1 Lecture (02)

  • 사용자 함수 만들고 실행 하기




personal function 만들기

  • python에서 함수 만들기

사용자 정의 함수

<형식>

1
2
3
4
5
def func_name(parameter):
# ...
# do something codes
# ...
return parameter

사용자가 직접 만들어 사용하는 함술로 매개변수 (parameter: 함수에 입력으로 전달 된 값을 받는 변수)와
인자(argument: 함수를 호출 할 때 전달 하는 입력값)이 있다.

1
2
3
4
5
6
7
#/c/Users/brill/Desktop/PyThon_Function/venv/Scripts/python
# -*- coding : utf-8 -*-

def cnt_letter():
"""안에 있는 문자를 세는 함수입니다. """ # 함수를 설명 하는 문구
print("hi")
return None
  • python 문서를 만들때 어떤 위치 (directory)에서 작업 했는지 제일 위에 써 줘야 한다.
  • 또한 cording을 어떤 언어로 했는지 확인 해 줘야 한다. (한글 : UTF-8)
  • def : definition 함수를 정의한다. 뒤에 cnt_letter라는 함수 이름을 써준다.
  • def cnt_letter(): ()안에 args를 넣어 주면 된다.
  • Doc : “”” 여기 “”” 여기 안에 이 함수가 어떤 함수인지 설명을 써 주어야 한다.
    • 혼자 일 하는 것이 아니기 때문에 함수를 공통적으로 사용 하기때문.
  • print : print
  • return : return None



함수 실행하기

1
2
3
4
if __name__ == "__main__":
print(cnt_letter())
print(help(cnt_letter()))
"""도움말 : 어떤 함수인지 확인 할 수 있다. """
  • 함수를 실행한다.

  • name 은 글로벌 변수로 보통 __main__으로 할당된다.

  • import 할 때와 구별 하려고 사용한다고 하는데 잘 모르겠다. (외워)

  • cnt_letter() 함수를 print 한다.

  • help 함수는 python 내장 함수인데,

class NoneType(object)

| Methods defined here:

|

| bool(self, /)

| self != 0

|

| repr(self, /)

| Return repr(self).

|

| ———————————————————–

| Static methods defined here:

|

| new(*args, **kwargs) from builtins.type

| Create and return a new object. See help(type) for accurate signature.

None

실행창에 위와 같은 도움을 준다.

실행이 성공하면 아래와 같은 결과값이 나온다.

Process finished with exit code 0

오늘 들은 강의보다 100배 나은 tistory가 있어서 공유 해본다. .. 아… 내 인생

사용자 정의 함수

MarkDown Table, hr, `code`

MarkDown Table

마크다운 문법에서 테이블 그리기

  • pipe로 table을 쉽게 생성 할 수 있다.

1
2
3
4
5
6
|  종류     |      기호      |      ex       |  exp  |
|:-----------:|:-------------:|:-------------:|:------:|
| assignment | = | a = 10 | 10에 a 를 바인딩 |
| assignmented assignment | **=, +=, -=, *=, //=, %=. <<=, >>=, &=, &#124;=, ^=, @= | a+= 10 |a에 10을 더한 결과 객체에 a를 바인딩 |


대충 칸에 맞게 순서만 맞으면
아래와 같이 예쁜 표를 만들 수 있다.

종류 기호 ex exp
assignment = a = 10 10에 a 를 바인딩
assignmented assignment **=, +=, -=, *=, //=, %=. <<=, >>=, &=, |=, ^=, @= a+= 10 a에 10을 더한 결과 객체에 a를 바인딩



table 안에 정렬

1
2
3
4
<!-- 하이픈 갯수로 크기 조절 -->
|왼쪽정렬 | 중간정렬|오른쪽정렬|
|:---------|:---------------:|---------:|

왼쪽정렬 중간정렬 오른쪽정렬
왼쪽정렬 중간정렬 오른쪽정렬




pipe 보이기

pipe는 Enrer Key 위에 있다.

1
&#124 

하지만, Markdown형식에서 보이지 않으닌까 \ |

만약 사용 하고 싶으면, &#124 를 사용 하도록 한다. ^0^

table만드는 것이 귀찮다면 아래 링크를 타고 가보쟈

table 쉽게 만드는 곳




code 강조하기

inline code 강조하는 방법

1
2
3
4
<!--Tab key 위쪽, 1 왼쪽에 위치한 Grave 키 이용-->

`code`

code 이것이 가능하다니, 이제야 알았다. !!!




수평선

1
2
3
4
5
6
7
8
---
(hyphen X 3)

***
(asterisk X 3)

___
(Underscore X 3)

(hyphen X 3)


(asterisk X 3)


(Underscore X 3)

Ref.

name & namespace

Name

  • name : identifier, variable, reference
  • object의 구분을 위해 사용

object

  • object는 value, type, 주소 의 3원소를 가지고 있다.

Ex)

1
2
3
4
5
a = 10
b = 0.12

print(a + b)

Memory에 저장된 형태

object(10, int, ref_cnt=1)0x100

= 100번째 저장된(주소) object는 10(value), int(type)의 속성을 갖는다.

** 하나의 object는 여러개의 이름(name or ref.)를 가질 수 있다.

** ref_cnt는 reference count를 의미한다.

name binding

- assignment
- import
- class 정의
- function 정의 ...

assignment

  • dictionary (key : value)
  • name binding (name : address)

객체(object)에 이름과 주소를 할당

이름 작성 규칙

  • 하나의 단어로 작성, 문자(한글), 숫자, _ 가능(띄워쓰기 불가능)
  • Keyword(+ 다른곳의 함수 이름)은 안쓰는 것이 좋고 쓰지만자
  • 숫자로 시작할 수 없으며, 대소문자 구분가능
  • _로 시작하는 이름은 class에서 쓰이므로 지양

name space

  • namespace는 이름 관리를 위한 dict container이다.
  • 모든 class, instance, fuction은 자신의 namespace를 가지고 있다.

built-in namespace

python에서 제공하는 함수, class, instance 등이 들어있는 곳.

Built-in Functions

global name space

내가 만든 함수, 인수, class들이 들어 있는 곳

1
2
3
4
5
a = 100
b = a
c = a
a = 101
print(a, b, c)

◎ namespace (global)

a : 0x100
b : 0x100
c : 0x100

◎ in memory

object(100, int ref_cnt=3) 0x100
object(101, int ref_cnt=1) 0x200

  • 그렇다면, 왜 ref_cnt 를 하는 것일까 ??
    • 사용되지 않는 객체를 삭제 하기 위해
    • 하나의 객체가 여러개의 이름을 가질 수있다.
    • 하지만, 하나의 이름은 하나의 객체만 가질 수 있다.

In

1
2
3
4
5
6
7
8
9
10
11
import sys

a= "Python"
b= a
c= a
a= "python"
print(f'a=[a], b=[b], c=[c]')

sys.getrefcount(a)
sys.getrefcount(b)
sys.getrefcount(c)

Out

a=[a], b=[b], c=[c]

4

5

5

키워드

pprint 안의 함수 pprint를 사용하여 키워드 리스트를 출력해 본다.

in

1
2
3
4
import keyword
import pprint as pp

pp.pprint(keyword.kwlist)

out

[‘False’,
‘None’,
‘True’,
‘_peg_parser_‘,
‘and’,
‘as’,
‘assert’,
‘async’,
‘await’,
‘break’,
‘class’,
‘continue’,
‘def’,
‘del’,
‘elif’,
‘else’,
‘except’,
‘finally’,
‘for’,
‘from’,
‘global’,
‘if’,
‘import’,
‘in’,
‘is’,
‘lambda’,
‘nonlocal’,
‘not’,
‘or’,
‘pass’,
‘raise’,
‘return’,
‘try’,
‘while’,
‘with’,
‘yield’]

assignment

  • assignment는 Expression이 아닌 statment이다.

    • Expression은 한개의 객체로 Evaluate될 수 있는것
    • 이름에 binding할 수 있다. (syntax에서 사용 할 수 있는 위치가 다르기 때문)
    • ver 3.8~ assignment expression (:=)이 추가되어 제공
  • assignment의 종류

    종류 기호 ex exp
    assignment = a = 10 10에 a 를 바인딩
    assignmented assignment **=, +=, -=, *=, //=, %=. <<=, >>=, &=, |=, ^=, @= a+= 10 a에 10을 더한 결과 객체에 a를 바인딩

덧셈의 ref_cnt

1
2
3
4
5
6
print("**********")
i = 10
print(id(i))
i += 1
print(id(i))
print("**********")

i가 10이때, i += 1 을 하면 11이라는 것이 만들어 진다.

이 11은 새로운 객체이다.

assignment_i

새로운 객체가 생성 되기 때문에 id가 달라진다. (memory의 adress)

pack & unPack

  • pack : (,) 콤마를 이용하여 Tuple 객체 하나 생성
  • unpack : 1개의 묶음에 있는 여러개의 객체 아이템이 분리되어 각각의 이름에 바인딩 됨.

이와 같은 pack과 unpack을 이용 하면, 여러 이름에 여러 값을 부여하기 쉽다.

in

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
data_int = 1
data_Tuple = 1,

data = 10, 20, 30
first, second, third = data
print(first, second, third)

def function():
a = 10
b = 20
print(locals())
del b
print(locals())

function()

Out

10 20 30


{‘a’: 10, ‘b’: 20}

{‘a’: 10}

  • del (변수) : python에 있는 좋은 기능
  • 변수를 삭제 할 수 있다.
  • locals() : local안에 있는 변수를 확인 할 수 있다.
  • type() : 변수의 data type을 확인 할 수 있다.

Ref. youtube, 1hr