Text Mining in Python

개요

  • 빅데이터 분석 및 시각화 & 텍스트 마이닝





평가

  • 다음은 네이버 쇼핑 리뷰 감성 분류하기 예제입니다.
  • 빈칸에 # 코드 입력란에 적당한 코드를 작성하시기를 바랍니다.
  • 각 빈칸당 10점입니다.

Colab에 Mecab 설치

1
2
3
4
# Colab에 Mecab 설치
!git clone https://github.com/SOMJANG/Mecab-ko-for-Google-Colab.git
%cd Mecab-ko-for-Google-Colab
!bash install_mecab-ko_on_colab190912.sh
Cloning into 'Mecab-ko-for-Google-Colab'...
remote: Enumerating objects: 91, done.
remote: Total 91 (delta 0), reused 0 (delta 0), pack-reused 91
Unpacking objects: 100% (91/91), done.
/content/Mecab-ko-for-Google-Colab
Installing konlpy.....
Collecting konlpy
  Downloading konlpy-0.5.2-py2.py3-none-any.whl (19.4 MB)
     |████████████████████████████████| 19.4 MB 2.4 MB/s 
[?25hCollecting JPype1>=0.7.0
  Downloading JPype1-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (448 kB)
     |████████████████████████████████| 448 kB 23.5 MB/s 
[?25hRequirement already satisfied: lxml>=4.1.0 in /usr/local/lib/python3.7/dist-packages (from konlpy) (4.2.6)
Collecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: tweepy>=3.7.0 in /usr/local/lib/python3.7/dist-packages (from konlpy) (3.10.0)
Requirement already satisfied: numpy>=1.6 in /usr/local/lib/python3.7/dist-packages (from konlpy) (1.19.5)
Collecting beautifulsoup4==4.6.0
  Downloading beautifulsoup4-4.6.0-py3-none-any.whl (86 kB)
     |████████████████████████████████| 86 kB 2.4 MB/s 
[?25hRequirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from JPype1>=0.7.0->konlpy) (3.10.0.2)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from tweepy>=3.7.0->konlpy) (1.3.0)
Requirement already satisfied: requests[socks]>=2.11.1 in /usr/local/lib/python3.7/dist-packages (from tweepy>=3.7.0->konlpy) (2.23.0)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from tweepy>=3.7.0->konlpy) (1.15.0)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->tweepy>=3.7.0->konlpy) (3.1.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (2021.10.8)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (2.10)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.7/dist-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (1.7.1)
Installing collected packages: JPype1, colorama, beautifulsoup4, konlpy
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.6.3
    Uninstalling beautifulsoup4-4.6.3:
      Successfully uninstalled beautifulsoup4-4.6.3
Successfully installed JPype1-1.3.0 beautifulsoup4-4.6.0 colorama-0.4.4 konlpy-0.5.2
Done
Installing mecab-0.996-ko-0.9.2.tar.gz.....
Downloading mecab-0.996-ko-0.9.2.tar.gz.......
from https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz
--2021-12-15 08:19:45--  https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz
Resolving bitbucket.org (bitbucket.org)... 104.192.141.1, 2406:da00:ff00::22c0:3470, 2406:da00:ff00::22e9:9f55, ...
Connecting to bitbucket.org (bitbucket.org)|104.192.141.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://bbuseruploads.s3.amazonaws.com/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz?Signature=Djk%2BX4VYfoZUGHzDRgTrcVVdFvE%3D&Expires=1639557778&AWSAccessKeyId=AKIA6KOSE3BNJRRFUUX6&versionId=null&response-content-disposition=attachment%3B%20filename%3D%22mecab-0.996-ko-0.9.2.tar.gz%22&response-content-encoding=None [following]
--2021-12-15 08:19:46--  https://bbuseruploads.s3.amazonaws.com/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz?Signature=Djk%2BX4VYfoZUGHzDRgTrcVVdFvE%3D&Expires=1639557778&AWSAccessKeyId=AKIA6KOSE3BNJRRFUUX6&versionId=null&response-content-disposition=attachment%3B%20filename%3D%22mecab-0.996-ko-0.9.2.tar.gz%22&response-content-encoding=None
Resolving bbuseruploads.s3.amazonaws.com (bbuseruploads.s3.amazonaws.com)... 52.216.113.163
Connecting to bbuseruploads.s3.amazonaws.com (bbuseruploads.s3.amazonaws.com)|52.216.113.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1414979 (1.3M) [application/x-tar]
Saving to: ‘mecab-0.996-ko-0.9.2.tar.gz’

mecab-0.996-ko-0.9. 100%[===================>]   1.35M  1.07MB/s    in 1.3s    

2021-12-15 08:19:48 (1.07 MB/s) - ‘mecab-0.996-ko-0.9.2.tar.gz’ saved [1414979/1414979]

Done
Unpacking mecab-0.996-ko-0.9.2.tar.gz.......
Done
Change Directory to mecab-0.996-ko-0.9.2.......
installing mecab-0.996-ko-0.9.2.tar.gz........
configure
make
make check
make install
ldconfig
Done
Change Directory to /content
Downloading mecab-ko-dic-2.1.1-20180720.tar.gz.......
from https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
--2021-12-15 08:21:19--  https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
Resolving bitbucket.org (bitbucket.org)... 104.192.141.1, 2406:da00:ff00::6b17:d1f5, 2406:da00:ff00::22cd:e0db, ...
Connecting to bitbucket.org (bitbucket.org)|104.192.141.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://bbuseruploads.s3.amazonaws.com/a4fcd83e-34f1-454e-a6ac-c242c7d434d3/downloads/b5a0c703-7b64-45ed-a2d7-180e962710b6/mecab-ko-dic-2.1.1-20180720.tar.gz?Signature=ZNAR2x6%2FNWxJ4p%2BOkG%2BjdG77Dqk%3D&Expires=1639558279&AWSAccessKeyId=AKIA6KOSE3BNJRRFUUX6&versionId=tzyxc1TtnZU_zEuaaQDGN4F76hPDpyFq&response-content-disposition=attachment%3B%20filename%3D%22mecab-ko-dic-2.1.1-20180720.tar.gz%22&response-content-encoding=None [following]
--2021-12-15 08:21:19--  https://bbuseruploads.s3.amazonaws.com/a4fcd83e-34f1-454e-a6ac-c242c7d434d3/downloads/b5a0c703-7b64-45ed-a2d7-180e962710b6/mecab-ko-dic-2.1.1-20180720.tar.gz?Signature=ZNAR2x6%2FNWxJ4p%2BOkG%2BjdG77Dqk%3D&Expires=1639558279&AWSAccessKeyId=AKIA6KOSE3BNJRRFUUX6&versionId=tzyxc1TtnZU_zEuaaQDGN4F76hPDpyFq&response-content-disposition=attachment%3B%20filename%3D%22mecab-ko-dic-2.1.1-20180720.tar.gz%22&response-content-encoding=None
Resolving bbuseruploads.s3.amazonaws.com (bbuseruploads.s3.amazonaws.com)... 54.231.82.195
Connecting to bbuseruploads.s3.amazonaws.com (bbuseruploads.s3.amazonaws.com)|54.231.82.195|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49775061 (47M) [application/x-tar]
Saving to: ‘mecab-ko-dic-2.1.1-20180720.tar.gz’

mecab-ko-dic-2.1.1- 100%[===================>]  47.47M  13.0MB/s    in 4.5s    

2021-12-15 08:21:25 (10.5 MB/s) - ‘mecab-ko-dic-2.1.1-20180720.tar.gz’ saved [49775061/49775061]

Done
Unpacking  mecab-ko-dic-2.1.1-20180720.tar.gz.......
Done
Change Directory to mecab-ko-dic-2.1.1-20180720
Done
installing........
configure
make
make install
apt-get update
apt-get upgrade
apt install curl
apt install git
bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
Done
Successfully Installed
Now you can use Mecab
from konlpy.tag import Mecab
mecab = Mecab()
사용자 사전 추가 방법 : https://bit.ly/3k0ZH53
NameError: name 'Tagger' is not defined 오류 발생 시 런타임을 재실행 해주세요
블로그에 해결 방법을 남겨주신 tana님 감사합니다.

네이버 쇼핑 리뷰 데이터에 대한 이해와 전처리

1
2
3
4
5
6
7
8
9
10
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import urllib.request
from collections import Counter
from konlpy.tag import Mecab
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

데이터 불러오기

1
urllib.request.urlretrieve("https://raw.githubusercontent.com/bab2min/corpus/master/sentiment/naver_shopping.txt", filename="ratings_total.txt")
('ratings_total.txt', <http.client.HTTPMessage at 0x7f7d3557f750>)
  • 해당 데이터에는 열 제목이 별도로 없음. 그래서 임의로 두 개의 열제목인 “ratings”와 “reviews” 추가
1
2
3
# (1) 데이터 불러오고, 전체 리뷰 개수 출력 # 200,000
totalDt = pd.read_table('ratings_total.txt', names=['ratings', 'reviews'])
print('전체 리뷰 개수 :',len(totalDt)) # 전체 리뷰 개수 출력
전체 리뷰 개수 : 200000
1
totalDt[:5]

ratings reviews
0 5 배공빠르고 굿
1 2 택배가 엉망이네용 저희집 밑에층에 말도없이 놔두고가고
2 5 아주좋아요 바지 정말 좋아서2개 더 구매했어요 이가격에 대박입니다. 바느질이 조금 ...
3 2 선물용으로 빨리 받아서 전달했어야 하는 상품이었는데 머그컵만 와서 당황했습니다. 전...
4 5 민트색상 예뻐요. 옆 손잡이는 거는 용도로도 사용되네요 ㅎㅎ
  • 훈련 데이터와 테스트 데이터 분리하기
1
2
totalDt['label'] = np.select([totalDt.ratings > 3], [1], default=0)
totalDt[:5]

ratings reviews label
0 5 배공빠르고 굿 1
1 2 택배가 엉망이네용 저희집 밑에층에 말도없이 놔두고가고 0
2 5 아주좋아요 바지 정말 좋아서2개 더 구매했어요 이가격에 대박입니다. 바느질이 조금 ... 1
3 2 선물용으로 빨리 받아서 전달했어야 하는 상품이었는데 머그컵만 와서 당황했습니다. 전... 0
4 5 민트색상 예뻐요. 옆 손잡이는 거는 용도로도 사용되네요 ㅎㅎ 1
  • 각 열에 대해서 중복을 제외한 샘플의 수 카운트
1
totalDt['ratings'].nunique(), totalDt['reviews'].nunique(), totalDt['label'].nunique()
(4, 199908, 2)
  • ratings열의 경우 1, 2, 4, 5라는 네 가지 값을 가지고 있습니다. reviews열에서 중복을 제외한 경우 199,908개입니다. 현재 20만개의 리뷰가 존재하므로 이는 현재 갖고 있는 데이터에 중복인 샘플들이 있다는 의미입니다. 중복인 샘플들을 제거해줍니다.
1
2
3
# (2) review열에서 중복 데이터 제거 drop_duplicates() 함수 활용
totalDt.drop_duplicates(subset=['reviews'], inplace=True)
print('총 샘플의 수 :',len(totalDt))
총 샘플의 수 : 199908
  • NULL 값 유무 확인
1
print(totalDt.isnull().values.any())
False
  • 훈련 데이터와 테스트 데이터를 3:1 비율로 분리
1
2
3
train_data, test_data = train_test_split(totalDt, test_size = 0.25, random_state = 42)
print('훈련용 리뷰의 개수 :', len(train_data))
print('테스트용 리뷰의 개수 :', len(test_data))
훈련용 리뷰의 개수 : 149931
테스트용 리뷰의 개수 : 49977

레이블의 분포 확인

1
2
3
4
5
6
7
8
9
10
11
12
13
# (3) label 1, 0 막대그래프 그리기
import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(1,1,figsize=(7,5))
width = 0.15

plot_Dt= train_data['label'].value_counts().plot(kind = 'bar', color='orange', edgecolor='black').legend()

plt.title('train_data',fontsize=20) ## 타이틀 출력
plt.ylabel('Count',fontsize=10) ## y축 라벨 출력
plt.show()

train_data

1
print(train_data.groupby('label').size().reset_index(name = 'count'))
   label  count
0      0  74918
1      1  75013
  • 두 레이블 모두 약 7만 5천개로 50:50 비율을 가짐

데이터 정제하기

  • 정규 표현식을 사용하여 한글을 제외하고 모두 제거해줍니다.
1
2
3
4
5
# 한글과 공백을 제외하고 모두 제거
# (4) 한글 및 공백 제외한 모든 글자 제거
train_data['reviews'] = train_data['reviews'].str.replace("[^ㄱ-ㅎㅏ-ㅣ가-힣 ]","")
train_data['reviews'].replace('', np.nan, inplace=True)
print(train_data.isnull().sum())
ratings    0
reviews    0
label      0
dtype: int64
  • 테스트 데이터에 대해서도 같은 과정을 거칩니다.
1
2
3
4
5
6
7
8
9
10
# (5) 데스트 데이터에 적용하기
# 코드 1 중복 제거
# 코드 2 정규 표현식 수행
# 코드 3 공백은 Null 값으로 변경
# 코드 4 Null 값 제거
test_data.drop_duplicates(subset = ['reviews'], inplace=True) # 중복 제거
test_data['reviews'] = test_data['reviews'].str.replace("[^ㄱ-ㅎㅏ-ㅣ가-힣 ]","") # 정규 표현식 수행
test_data['reviews'].replace('', np.nan, inplace=True) # 공백은 Null 값으로 변경
test_data = test_data.dropna(how='any') # Null 값 제거
print('전처리 후 테스트용 샘플의 개수 :',len(test_data))
전처리 후 테스트용 샘플의 개수 : 49977

토큰화

  • 형태소 분석기 Mecab을 사용하여 토큰화 작업을 수행한다.
1
2
3
# (6) Mecab 클래스 호출하기
mecab = Mecab()
print(mecab.morphs('와 이런 것도 상품이라고 차라리 내가 만드는 게 나을 뻔'))
['와', '이런', '것', '도', '상품', '이', '라고', '차라리', '내', '가', '만드', '는', '게', '나을', '뻔']
  • 불용어를 지정하여 필요없는 토큰들을 제거하도록 한다.
1
2
# (7) 불용어 만들기
stopwords = ['도', '는', '다', '의', '가', '이', '은', '한', '에', '하', '고', '을', '를', '인', '듯', '과', '와', '네', '들', '듯', '지', '임', '게']
  • 훈련 데이터와 테스트 데이터에 대해서 동일한 과정을 거친다.
1
2
train_data['tokenized'] = train_data['reviews'].apply(mecab.morphs)
train_data['tokenized'] = train_data['tokenized'].apply(lambda x: [item for item in x if item not in stopwords])
1
2
test_data['tokenized'] = test_data['reviews'].apply(mecab.morphs)
test_data['tokenized'] = test_data['tokenized'].apply(lambda x: [item for item in x if item not in stopwords])

단어와 길이 분포 확인하기

긍정 리뷰에는 주로 어떤 단어들이 많이 등장하고, 부정 리뷰에는 주로 어떤 단어들이 등장하는지 두 가지 경우에 대해서 각 단어의 빈도수를 계산해보겠습니다. 각 레이블에 따라서 별도로 단어들의 리스트를 저장해줍니다.

1
2
3
4
negative_W = np.hstack(train_data[train_data.label == 0]['tokenized'].values)
positive_W = np.hstack(train_data[train_data.label == 1]['tokenized'].values)
negative_W
positive_W
array(['적당', '만족', '합니다', ..., '잘', '삿', '어요'], dtype='<U25')
  • Counter()를 사용하여 각 단어에 대한 빈도수를 카운트한다. 우선 부정 리뷰에 대해서 빈도수가 높은 상위 20개 단어 출력
1
2
negative_word_count = Counter(negative_W)
print(negative_word_count.most_common(20))
[('네요', 31799), ('는데', 20295), ('안', 19718), ('어요', 14849), ('있', 13200), ('너무', 13058), ('했', 11783), ('좋', 9812), ('배송', 9677), ('같', 8997), ('구매', 8876), ('어', 8869), ('거', 8854), ('없', 8670), ('아요', 8642), ('습니다', 8436), ('그냥', 8355), ('되', 8345), ('잘', 8029), ('않', 7984)]

‘네요’, ‘는데’, ‘안’, ‘않’, ‘너무’, ‘없’ 등과 같은 단어들이 부정 리뷰에서 주로 등장합니다. 긍정 리뷰에 대해서도 동일하게 출력해봅시다.

1
2
positive_word_count = Counter(positive_W)
print(positive_word_count.most_common(20))
[('좋', 39488), ('아요', 21184), ('네요', 19895), ('어요', 18686), ('잘', 18602), ('구매', 16171), ('습니다', 13320), ('있', 12391), ('배송', 12275), ('는데', 11670), ('했', 9818), ('합니다', 9801), ('먹', 9635), ('재', 9273), ('너무', 8397), ('같', 7868), ('만족', 7261), ('거', 6482), ('어', 6294), ('쓰', 6292)]

‘좋’, ‘아요’, ‘네요’, ‘잘’, ‘너무’, ‘만족’ 등과 같은 단어들이 주로 많이 등장합니다. 두 가지 경우에 대해서 각각 길이 분포를 확인해봅시다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# (8) 긍정 리뷰와 부정 리뷰 히스토그램 작성하기

fig,(ax1,ax2) = plt.subplots(1,2,figsize=(9,5))
text_len = train_data[train_data['label']==1]['tokenized'].map(lambda x: len(x))
ax1.hist(text_len, color='pink', edgecolor='black')
ax1.set_title('Positive Reviews')
ax1.set_xlabel('length of samples')
ax1.set_ylabel('number of samples')
print('긍정 리뷰의 평균 길이 :', np.mean(text_len))

text_len = train_data[train_data['label']==0]['tokenized'].map(lambda x: len(x))
ax2.hist(text_len, color='skyblue', edgecolor='black')
ax2.set_title('부정 리뷰')
ax2.set_title('Negative Reviews')
fig.suptitle('Words in texts')
ax2.set_xlabel('length of samples')
ax2.set_ylabel('number of samples')
print('부정 리뷰의 평균 길이 :', np.mean(text_len))
plt.show()

긍정 리뷰의 평균 길이 : 13.5877381253916
부정 리뷰의 평균 길이 : 17.02948557089084

Review_Histogram

  • 긍정 리뷰보다는 부정 리뷰가 좀 더 길게 작성된 경향이 있는 것 같다.
1
2
3
4
X_train = train_data['tokenized'].values
y_train = train_data['label'].values
X_test= test_data['tokenized'].values
y_test = test_data['label'].values

정수 인코딩

  • 이제 기계가 텍스트를 숫자로 처리할 수 있도록 훈련 데이터와 테스트 데이터에 정수 인코딩을 수행해야 합니다. 우선, 훈련 데이터에 대해서 단어 집합(vocaburary)을 만들어봅시다.
1
2
3
# (9) 정수 인코딩 클래스 호출 및 X_train 데이터에 적합하기
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

단어 집합이 생성되는 동시에 각 단어에 고유한 정수가 부여되었습니다. 이는 tokenizer.word_index를 출력하여 확인 가능합니다. 등장 횟수가 1회인 단어들은 자연어 처리에서 배제하고자 합니다. 이 단어들이 이 데이터에서 얼만큼의 비중을 차지하는지 확인해봅시다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
threshold = 2
total_cnt = len(tokenizer.word_index) # 단어의 수
rare_cnt = 0 # 등장 빈도수가 threshold보다 작은 단어의 개수를 카운트
total_freq = 0 # 훈련 데이터의 전체 단어 빈도수 총 합
rare_freq = 0 # 등장 빈도수가 threshold보다 작은 단어의 등장 빈도수의 총 합

# 단어와 빈도수의 쌍(pair)을 key와 value로 받는다.
for key, value in tokenizer.word_counts.items():
total_freq = total_freq + value

# 단어의 등장 빈도수가 threshold보다 작으면
if(value < threshold):
rare_cnt = rare_cnt + 1
rare_freq = rare_freq + value

print('단어 집합(vocabulary)의 크기 :',total_cnt)
print('등장 빈도가 %s번 이하인 희귀 단어의 수: %s'%(threshold - 1, rare_cnt))
print("단어 집합에서 희귀 단어의 비율:", (rare_cnt / total_cnt)*100)
print("전체 등장 빈도에서 희귀 단어 등장 빈도 비율:", (rare_freq / total_freq)*100)
단어 집합(vocabulary)의 크기 : 39998
등장 빈도가 1번 이하인 희귀 단어의 수: 18213
단어 집합에서 희귀 단어의 비율: 45.53477673883694
전체 등장 빈도에서 희귀 단어 등장 빈도 비율: 0.7935698749320282

단어가 약 40,000개가 존재합니다. 등장 빈도가 threshold 값인 2회 미만. 즉, 1회인 단어들은 단어 집합에서 약 45%를 차지합니다. 하지만, 실제로 훈련 데이터에서 등장 빈도로 차지하는 비중은 매우 적은 수치인 약 0.8%밖에 되지 않습니다. 아무래도 등장 빈도가 1회인 단어들은 자연어 처리에서 별로 중요하지 않을 듯 합니다. 그래서 이 단어들은 정수 인코딩 과정에서 배제시키겠습니다.

등장 빈도수가 1인 단어들의 수를 제외한 단어의 개수를 단어 집합의 최대 크기로 제한하겠습니다.

1
2
3
4
# 전체 단어 개수 중 빈도수 2이하인 단어 개수는 제거.
# 0번 패딩 토큰과 1번 OOV 토큰을 고려하여 +2
vocab_size = total_cnt - rare_cnt + 2
print('단어 집합의 크기 :',vocab_size)
단어 집합의 크기 : 21787

이제 단어 집합의 크기는 21,787개입니다. 이를 토크나이저의 인자로 넘겨주면, 토크나이저는 텍스트 시퀀스를 숫자 시퀀스로 변환합니다. 이러한 정수 인코딩 과정에서 이보다 큰 숫자가 부여된 단어들은 OOV로 변환하겠습니다.

1
2
3
4
5
6
7
8
9
# (10) 토크나이저 클래스 호출 및 OOV 변환 코드 작성
# 코드 1
# 코드 2

tokenizer = Tokenizer(vocab_size, oov_token = 'OOV')
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

정수 인코딩이 진행되었는지 확인하고자 X_train과 X_test에 대해서 상위 3개의 샘플만 출력합니다.

1
print(X_train[:3])
[[67, 2060, 299, 14259, 263, 73, 6, 236, 168, 137, 805, 2951, 625, 2, 77, 62, 207, 40, 1343, 155, 3, 6], [482, 409, 52, 8530, 2561, 2517, 339, 2918, 250, 2357, 38, 473, 2], [46, 24, 825, 105, 35, 2372, 160, 7, 10, 8061, 4, 1319, 29, 140, 322, 41, 59, 160, 140, 7, 1916, 2, 113, 162, 1379, 323, 119, 136]]
1
print(X_test[:3])
[[14, 704, 767, 116, 186, 252, 12], [339, 3904, 62, 3816, 1651], [11, 69, 2, 49, 164, 3, 27, 15, 6, 1, 513, 289, 17, 92, 110, 564, 59, 7, 2]]

패딩

이제 서로 다른 길이의 샘플들의 길이를 동일하게 맞춰주는 패딩 작업을 진행해보겠습니다. 전체 데이터에서 가장 길이가 긴 리뷰와 전체 데이터의 길이 분포를 알아보겠습니다.

1
2
3
4
5
6
print('리뷰의 최대 길이 :',max(len(l) for l in X_train))
print('리뷰의 평균 길이 :',sum(map(len, X_train))/len(X_train))
plt.hist([len(s) for s in X_train], bins=35, label='bins=35', color="skyblue")
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()
리뷰의 최대 길이 : 85
리뷰의 평균 길이 : 15.307521459871541

LengthOfReview

리뷰의 최대 길이는 85, 평균 길이는 약 15입니다.

그리고 그래프로 봤을 때, 전체적으로는 60이하의 길이를 가지는 것으로 보입니다.

1
2
3
4
5
6
7
8
def below_threshold_len(max_len, nested_list):
count = 0
for sentence in nested_list:
if(len(sentence) <= max_len):
count = count + 1
print('전체 샘플 중 길이가 %s 이하인 샘플의 비율: %s'%(max_len, (count / len(nested_list))*100))


  • 최대 길이가 85이므로 만약 80으로 패딩할 경우, 몇 개의 샘플들을 온전히 보전할 수 있는지 확인해봅시다.
1
2
max_len = 80
below_threshold_len(max_len, X_train)
전체 샘플 중 길이가 80 이하인 샘플의 비율: 99.99933302652553

훈련용 리뷰의 99.99%가 80이하의 길이를 가집니다. 훈련용 리뷰를 길이 80으로 패딩하겠습니다.

1
2
X_train = pad_sequences(X_train, maxlen = max_len)
X_test = pad_sequences(X_test, maxlen = max_len)

GRU로 네이버 쇼핑 리뷰 감성 분류하기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from tensorflow.keras.layers import Embedding, Dense, GRU
from tensorflow.keras.models import Sequential
from tensorflow.keras.models import load_model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

embedding_dim = 100
hidden_units = 128

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim))
model.add(GRU(hidden_units))
model.add(Dense(1, activation='sigmoid'))

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=4)
mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(X_train, y_train, epochs=15, callbacks=[es, mc], batch_size=64, validation_split=0.2)

def sentiment_predict(new_sentence):
new_sentence = re.sub(r'[^ㄱ-ㅎㅏ-ㅣ가-힣 ]','', new_sentence)
new_sentence = mecab.morphs(new_sentence) # 토큰화
new_sentence = [word for word in new_sentence if not word in stopwords] # 불용어 제거
encoded = tokenizer.texts_to_sequences([new_sentence]) # 정수 인코딩
pad_new = pad_sequences(encoded, maxlen = max_len) # 패딩

score = float(model.predict(pad_new)) # 예측
if(score > 0.5):
print("{:.2f}% 확률로 긍정 리뷰입니다.".format(score * 100))
else:
print("{:.2f}% 확률로 부정 리뷰입니다.".format((1 - score) * 100))
Epoch 1/15
1875/1875 [==============================] - ETA: 0s - loss: 0.2725 - acc: 0.8967
Epoch 00001: val_acc improved from -inf to 0.91916, saving model to best_model.h5
1875/1875 [==============================] - 54s 25ms/step - loss: 0.2725 - acc: 0.8967 - val_loss: 0.2301 - val_acc: 0.9192
Epoch 2/15
1875/1875 [==============================] - ETA: 0s - loss: 0.2158 - acc: 0.9213
Epoch 00002: val_acc improved from 0.91916 to 0.92240, saving model to best_model.h5
1875/1875 [==============================] - 43s 23ms/step - loss: 0.2158 - acc: 0.9213 - val_loss: 0.2137 - val_acc: 0.9224
Epoch 3/15
1875/1875 [==============================] - ETA: 0s - loss: 0.1985 - acc: 0.9289
Epoch 00003: val_acc improved from 0.92240 to 0.92637, saving model to best_model.h5
1875/1875 [==============================] - 44s 24ms/step - loss: 0.1985 - acc: 0.9289 - val_loss: 0.2060 - val_acc: 0.9264
Epoch 4/15
1873/1875 [============================>.] - ETA: 0s - loss: 0.1878 - acc: 0.9332
Epoch 00004: val_acc did not improve from 0.92637
1875/1875 [==============================] - 43s 23ms/step - loss: 0.1878 - acc: 0.9332 - val_loss: 0.2031 - val_acc: 0.9260
Epoch 5/15
1874/1875 [============================>.] - ETA: 0s - loss: 0.1783 - acc: 0.9369
Epoch 00005: val_acc improved from 0.92637 to 0.92670, saving model to best_model.h5
1875/1875 [==============================] - 46s 24ms/step - loss: 0.1783 - acc: 0.9369 - val_loss: 0.2030 - val_acc: 0.9267
Epoch 6/15
1873/1875 [============================>.] - ETA: 0s - loss: 0.1698 - acc: 0.9405
Epoch 00006: val_acc improved from 0.92670 to 0.92764, saving model to best_model.h5
1875/1875 [==============================] - 44s 24ms/step - loss: 0.1697 - acc: 0.9405 - val_loss: 0.2055 - val_acc: 0.9276
Epoch 7/15
1873/1875 [============================>.] - ETA: 0s - loss: 0.1611 - acc: 0.9436
Epoch 00007: val_acc did not improve from 0.92764
1875/1875 [==============================] - 44s 24ms/step - loss: 0.1610 - acc: 0.9437 - val_loss: 0.2098 - val_acc: 0.9244
Epoch 8/15
1875/1875 [==============================] - ETA: 0s - loss: 0.1526 - acc: 0.9473
Epoch 00008: val_acc did not improve from 0.92764
1875/1875 [==============================] - 44s 23ms/step - loss: 0.1526 - acc: 0.9473 - val_loss: 0.2269 - val_acc: 0.9189
Epoch 9/15
1875/1875 [==============================] - ETA: 0s - loss: 0.1435 - acc: 0.9507
Epoch 00009: val_acc did not improve from 0.92764
1875/1875 [==============================] - 44s 24ms/step - loss: 0.1435 - acc: 0.9507 - val_loss: 0.2258 - val_acc: 0.9204
Epoch 00009: early stopping
1
sentiment_predict('이 상품 진짜 싫어요... 교환해주세요')
99.03% 확률로 부정 리뷰입니다.
1
sentiment_predict('이 상품 진짜 좋아여... 강추합니다. ')
99.51% 확률로 긍정 리뷰입니다.

python_basic_Exeption

python

Exception

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# /c/Users/brill/Desktop/PyThon_Function/venv/Scripts/python
# -*- coding : UTF-8

def error01():
a=10
a/0
#ZeroDivisionError: division by zero

def error02():
a= [1, 2, 3, 4, 5]
a[10]
#IndexError: list index out of range

def error03():
a = 1000
a + "Hello"
#TypeError: unsupported operand type(s) for +: 'int' and 'str'

def error04():
a=10
a+b
#NameError: name 'b' is not defined

if __name__ == "__main__":
error01()
error02()
error03()
error04()
print("program is done")

Exeption의 종류

java 의 try catch 구문과 같음

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# /c/Users/brill/Desktop/PyThon_Function/venv/Scripts/python
# -*- coding : UTF-8

def try_func(x, idx):
try:
return 100/x[idx]
except ZeroDivisionError:
print("did't divide zero")
except IndexError:
print("not in range of Index")
except TypeError:
print("there is type Error")
except NameError:
print("it is not definated parameter")
finally:
print("무조건 실행됨")


def main():
a = [50, 60, 0, 70]
print(try_func(a,1))

# Zero Division Error
print(try_func(a,0))

# Index Error
print(try_func(a,5))

# type Error
print(try_func(a, "hi"))


if __name__ == "__main__":
main()

어떻게던 프로그램이 돌아 갈 수 있도록
만들어 주는 것이 중요하다.




  • class 정리
  1. __init__ : set_name, set_id 해 주지 않고, 통합시켜주는 역할
  2. __eq__, __ne__ : 부등호 연산자
  3. 상속, 다형성(서로다른 클래스에서 공통으로 쓰는 함수)
  4. Exception
  5. class attribute / instance attribute / instance method 차이
  6. 추상 class (안배웠음)
  7. data incapsulation



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# /c/Users/brill/Desktop/PyThon_Function/venv/Scripts/python
# -*- coding : UTF-8

class SalaryExcept(ValueError): pass # 상속
class TipExept(SalaryExcept): pass # 상속

class Employee:

MIN_SALARY = 30000
MAX_Bonus = 20000

def __init__(self, name, salary = 30000):
self.name = name
if salary< Employee.MIN_SALARY:
raise SalaryExcept("급여가 너무 낮아요!")
self.salary = salary

def give_bonus(self, amount):
if amount > Employee.MAX_Bonus:
print("보너스가 너무 많아 ")
elif self.salary + amount < Employee.MIN_SALARY :
print("보너스 지급 후의 급여도 매우 낮다. ")
else:
self.salary += amount

if __name__ == "__main__":
emp = Employee("YH", salary= 10000)

try:
emp.give_bonus(70000)
except SalaryExcept:
print("Error Salary")

try:
emp.give_bonus(-10000)
except tipExcept:
print("Error Tip")

여전히 Error가 나는 코드
나는 Exception 안됨

python_basic_Bank

python

Bank _ 계좌 만들기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# /c/Users/brill/Desktop/PyThon_Function/venv/Scripts/python
# -*- coding : UTF-8

class Human:

def __init__(self, name):
self.name = name


if __name__ == "__main__":
human01 = Human(name="A")
human02 = Human(name="A")

print(human01 == human02)
print("human 01 : ", human01)
print("human 02 : ", human02)

False
human 01 : <__main__.Human object at 0x000001686E41CC10>
human 02 : <__main__.Human object at 0x000001686E41CE50>

저장되는 장소가 다르기 때문에 다르다.

Bank _ customer ID 확인하기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# /c/Users/brill/Desktop/PyThon_Function/venv/Scripts/python
# -*- coding : UTF-8

class Bank:

#instance attribute
def __init__(self, cust_id, balance=0):
self.balance = balance
self.cust_id = cust_id

#instance methode
def withdraw(self, amount):
self.balance -= amount

def __eq__(self, other):
print("__eq()__ is called")
return self.cust_id == other.cust_id

if __name__ == "__main__":
account01 = Bank(123, 1000)
account02 = Bank(123, 1000)
account03 = Bank(456, 1000)
print(account01 == account02)
print(account02 == account03)
print(account01 == account03)

eq() is called
True
eq() is called
False
eq() is called
False



  • 부등호 연산자
    • != : ne()
    • >= : ge()
    • <= : le()
    • > : gt()
    • < : lt()

eq() 함수 사용하기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# /c/Users/brill/Desktop/PyThon_Function/venv/Scripts/python
# -*- coding : UTF-8

class Bank:

#instance attribute
def __init__(self, cust_id, balance=0):
self.balance, self.cust_id = balance, cust_id


#instance methode
def withdraw(self, amount):
self.balance -= amount

def __eq__(self, other):
print("__eq()__ is called")
return (self.cust_id == other.cust_id) and (type(self) == type(other))

class Phone:

def __init__(self, cust_id):
self.cust_id = cust_id

def __eq__(self, other):
return self.cust_id == other.cust_id


if __name__ == "__main__":
account01 = Bank(1234)
phone01 = Phone(1234)

print(account01 == phone01)

eq() is called
False

eq를 불러와서 같은지 확인 할 수 있다.

접근기록, log 기록 확인하기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# /c/Users/brill/Desktop/PyThon_Function/venv/Scripts/python
# -*- coding : UTF-8


class Bank:
def __init__(self, cust_id, name, balance = 0):
self.cust_id, self.name, self.balance = cust_id, name, balance

def __str__(self):
cust_str = """
customer:
cust_id : {cust_id}
name : {name}
balance : {balance}
""".format(cust_id = self.cust_id, name = self.name, balance= self.balance)

return cust_str

if __name__ == "__main__":
bank_cust = Bank(123, "YH")
print(bank_cust)

  • DB에 저장 되지 않지만, 로그 기록을 확인 할 수 있다.

str() and repr() 비교

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# /c/Users/brill/Desktop/PyThon_Function/venv/Scripts/python
# -*- coding : UTF-8


class Bank:
def __init__(self, cust_id, name, balance = 0):
self.cust_id, self.name, self.balance = cust_id, name, balance

def __str__(self):
cust_str = """
customer:
cust_id : {cust_id}
name : {name}
balance : {balance}
""".format(cust_id = self.cust_id, name = self.name, balance= self.balance)

return cust_str

def __repr__(self):
cust_str = "Bank({cust_id}, '{name}', {balance})".format(cust_id = self.cust_id, name = self.name, balance= self.balance)
return cust_str

if __name__ == "__main__":
bank_cust = Bank(123, "YH")
print(str(bank_cust))
print(repr(bank_cust))

difference of str() and repr()

DecisionTreeMachineLearning(03)

machine Learning Model Algoridms

  • 비 선형 모델 : KNN,
  • 선형 모델 :




Decision Tree MachineLearning

ML_DecisionTree01

Introduction

  • 과적합 : 모델의 정확도만 높이기 위해 분류 조건(depth)만 강조하여 실제 상황에서 유연하게 대처하는 능력이 떨어지게 되는 문제가 발생하게 되는것.
  • 가지치기(pruning)을 통해 유연성을 유지.
    • Max_depth를 대략적으로 잡아서 (3, 5, 10…) RMS 값 비교
    • Random search
    • 하이퍼파라미터 (grid Search)



  • 분류기준 (수식은 아래서 책에서 확인)

    1. 정보이득 :
      • 자식노드의 불순도가 낮을 수록 정보의 이득이 커진다.(효율성 Up)
      • 정보 이득이 높은 속성을 기준으로 알아서 나누어 준다.
      1. 엔트로피의 정의 :
        • 엔트로피는 높을 수록 좋다.
      2. 지니불순도 :
        • 순도는 높을 수록 좋다.
      3. 분류오차 :
        • 어떤 시나리오가 더 좋은가에 대한 계산
        • 1이 되면 균등, 완벽하게 나누어 졌다고
  • ㅇㅇ

PythonMacnineLeanting_equ

공식은 이쪽에 가면 있다.


계산은 컴퓨터가 다 해준다.

우리는 보고 좋은 분류 기준을 선택 하며 됩니다.

분류기준 1. 분류 오차

PythonMacnineLeanting_E01

분류기준 2. 지니 불순도

PythonMacnineLeanting_E02

분류기준 2. 엔트로피

PythonMacnineLeanting_E03


  • 정보이득을 최대로 하는 옵션을 찾는다.





실습

1
2
3
4
5
6
7
8
from sklearn import datasets 
import numpy as np

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

print("클래스 레이블:", np.unique(y))

클래스 레이블: [0 1 2]

1
2
3
4
5
6
7
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.3, random_state = 1
)

print("y 레이블 갯수:", np.bincount(y))

y 레이블 갯수: [50 50 50]



시각화

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt

def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

# 마커와 컬러맵을 설정합니다.
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])

# 결정 경계를 그립니다.
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())

for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0],
y=X[y == cl, 1],
alpha=0.8,
c=colors[idx],
marker=markers[idx],
label=cl,
edgecolor='black')

# 테스트 샘플을 부각하여 그립니다.
if test_idx:
X_test, y_test = X[test_idx, :], y[test_idx]

plt.scatter(X_test[:, 0],
X_test[:, 1],
c='',
edgecolor='black',
alpha=1.0,
linewidth=1,
marker='o',
s=100,
label='test set')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import matplotlib.pyplot as plt
import numpy as np

# 지니 불순도 함수
def gini(p):
return p * (1 - p) + (1 - p) * (1 - (1 - p))


# 엔트로피 함수
def entropy(p):
return - p * np.log2(p) - (1 - p) * np.log2((1 - p))

# 분류 오차
def error(p):
return 1 - np.max([p, 1 - p])

x = np.arange(0.0, 1.0, 0.01)

ent = [entropy(p) if p != 0 else None for p in x]
sc_ent = [e * 0.5 if e else None for e in ent]
err = [error(i) for i in x]

fig = plt.figure()
ax = plt.subplot(111)
for i, lab, ls, c, in zip([ent, sc_ent, gini(x), err],
['Entropy', 'Entropy (scaled)',
'Gini Impurity', 'Misclassification Error'],
['-', '-', '--', '-.'],
['black', 'lightgray', 'red', 'green', 'cyan']):
line = ax.plot(x, i, label=lab, linestyle=ls, lw=2, color=c)

ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15),
ncol=5, fancybox=True, shadow=False)

ax.axhline(y=0.5, linewidth=1, color='k', linestyle='--')
ax.axhline(y=1.0, linewidth=1, color='k', linestyle='--')
plt.ylim([0, 1.1])
plt.xlabel('p(i=1)')
plt.ylabel('Impurity Index')
plt.show()

Impurity_Index


  • 정보 이득을 최대로 하는 옵션을 찾아서
1
2
3
4
from sklearn.tree import DecisionTreeClassifier

tree_gini = DecisionTreeClassifier(criterion="gini", max_depth=3)
tree_gini.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=3)

  • depth를 3으로 해 주었기 때문에 과적합 X
1
2
3
4
5
6
7
8
9
X_combined = np.vstack((X_train, X_test))
y_combined = np.hstack((y_train, y_test))

plot_decision_regions(X_combined, y_combined, classifier=tree_gini, test_idx = range(105, 150))
plt.xlabel("petal length")
plt.ylabel("petal width")
plt.legend(loc = "upper left")
plt.tight_layout()
plt.show()

tree_gini_layout

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from pydotplus import graph_from_dot_data
from sklearn.tree import export_graphviz

dot_data = export_graphviz(tree_gini,
filled=True,
rounded=True,
class_names=['Setosa',
'Versicolor',
'Virginica'],
feature_names=['petal length',
'petal width'],
out_file=None)
graph = graph_from_dot_data(dot_data)
graph.write_png('gini_tree.png')

True

gini_tree




  • gini 로 1개 Entripy 로 1개 짜서 해야함
  • gini: default
  • Entropy : 도 해보고 비교
1
2
3
4
5
6
7
8
9
10
11
12
tree_entropy = DecisionTreeClassifier(criterion="entropy", max_depth=3)
tree_entropy.fit(X_train, y_train)

X_combined = np.vstack((X_train, X_test))
y_combined = np.hstack((y_train, y_test))

plot_decision_regions(X_combined, y_combined, classifier=tree_entropy, test_idx = range(105, 150))
plt.xlabel("petal length")
plt.ylabel("petal width")
plt.legend(loc = "upper left")
plt.tight_layout()
plt.show()

tree_Entropy_layout

  • 모형을 도식화로
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from pydotplus import graph_from_dot_data
from sklearn.tree import export_graphviz

dot_data = export_graphviz(tree_entropy,
filled=True,
rounded=True,
class_names=['Setosa',
'Versicolor',
'Virginica'],
feature_names=['petal length',
'petal width'],
out_file=None)
graph = graph_from_dot_data(dot_data)
graph.write_png('Entropy_tree.png')

Entropy_tree

entropy가 0이 되면 더이상 나눌 필요가 없다.

  • sklearn에서는 분류오차는 없다.
  • 지니 와 엔트로피 두개를 보고 더 나은 것을 선택

Entropy_gini

머신러닝 배우기

<아직 안배운 부분>

  • 스태킹 알고리즘 (앙상블)

Auto Machine Learning by pycaret(01)

AutoMachineLearning by pycaret

pycaret

gitHub_pycaret



pycaret으로 autoML 하기

  • low-code machine learning library
  • PyCaret 2.0 ver.
    • 분석가가 가야 하는 최종 도착지
    • 머신러닝 + operation (운영) : 배포 ->
      • MLflow, Airflow, Kubeflow…

gitHub and pycaret




pycaret install

1
2
3
!pip install pycaret

# !pip install pycaret==2.0

Collecting pycaret
Downloading pycaret-2.3.5-py3-none-any.whl (288 kB)
|████████████████████████████████| 288 kB 5.4 MB/s
Collecting lightgbm>=2.3.1
Downloading lightgbm-3.3.1-py3-none-manylinux1_x86_64.whl (2.0 MB)
|████████████████████████████████| 2.0 MB 54.5 MB/s
Collecting pyod
Downloading pyod-0.9.5.tar.gz (113 kB)
|████████████████████████████████| 113 kB 67.4 MB/s
Requirement already satisfied: textblob in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.15.3)
Requirement already satisfied: yellowbrick>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.3.post1)
Collecting Boruta
Downloading Boruta-0.3-py3-none-any.whl (56 kB)
|████████████████████████████████| 56 kB 4.6 MB/s
Collecting pyLDAvis
Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
|████████████████████████████████| 1.7 MB 63.6 MB/s
Installing build dependencies … done
Getting requirements to build wheel … done
Installing backend dependencies … done
Preparing wheel metadata … done
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.11.2)
Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.2.5)
Collecting imbalanced-learn==0.7.0
Downloading imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
|████████████████████████████████| 167 kB 65.5 MB/s
Requirement already satisfied: numpy==1.19.5 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.19.5)
Collecting kmodes>=0.10.1
Downloading kmodes-0.11.1-py2.py3-none-any.whl (19 kB)
Requirement already satisfied: spacy<2.4.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (2.2.4)
Collecting umap-learn
Downloading umap-learn-0.5.2.tar.gz (86 kB)
|████████████████████████████████| 86 kB 4.8 MB/s
Requirement already satisfied: IPython in /usr/local/lib/python3.7/dist-packages (from pycaret) (5.5.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.2.2)
Requirement already satisfied: wordcloud in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.5.0)
Requirement already satisfied: cufflinks>=0.17.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.17.3)
Collecting mlflow
Downloading mlflow-1.22.0-py3-none-any.whl (15.5 MB)
|████████████████████████████████| 15.5 MB 68.5 MB/s
Collecting mlxtend>=0.17.0
Downloading mlxtend-0.19.0-py2.py3-none-any.whl (1.3 MB)
|████████████████████████████████| 1.3 MB 66.9 MB/s
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.1.5)
Collecting scikit-plot
Downloading scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.1.0)
Requirement already satisfied: plotly>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from pycaret) (4.4.1)
Collecting scikit-learn==0.23.2
Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
|████████████████████████████████| 6.8 MB 37.1 MB/s
Requirement already satisfied: gensim<4.0.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.6.0)
Collecting pandas-profiling>=2.8.0
Downloading pandas_profiling-3.1.0-py2.py3-none-any.whl (261 kB)
|████████████████████████████████| 261 kB 60.5 MB/s
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.7/dist-packages (from pycaret) (7.6.5)
Requirement already satisfied: scipy<=1.5.4 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.4.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==0.23.2->pycaret) (3.0.0)
Requirement already satisfied: setuptools>=34.4.1 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (57.4.0)
Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (0.3.0)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (1.15.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim<4.0.0->pycaret) (5.2.1)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (0.7.5)
Requirement already satisfied: pexpect in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (4.8.0)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (5.1.1)
Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (2.6.1)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (0.8.1)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (1.0.18)
Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (4.4.2)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (5.1.3)
Requirement already satisfied: widgetsnbextension=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (3.5.2)
Requirement already satisfied: ipython-genutils
=0.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (0.2.0)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (4.10.1)
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (1.0.2)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.3.5)
Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.1.1)
Requirement already satisfied: wheel in /usr/local/lib/python3.7/dist-packages (from lightgbm>=2.3.1->pycaret) (0.37.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (3.0.6)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (2.8.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (0.11.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (2.6.0)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (4.9.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->pycaret) (2018.9)
Collecting visions[type_image_path]==0.7.4
Downloading visions-0.7.4-py3-none-any.whl (102 kB)
|████████████████████████████████| 102 kB 12.4 MB/s
Collecting pydantic>=1.8.1
Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
|████████████████████████████████| 10.1 MB 24.8 MB/s
Collecting tangled-up-in-unicode==0.1.0
Downloading tangled_up_in_unicode-0.1.0-py3-none-any.whl (3.1 MB)
|████████████████████████████████| 3.1 MB 22.1 MB/s
Collecting joblib
Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
|████████████████████████████████| 303 kB 60.4 MB/s
Collecting requests>=2.24.0
Downloading requests-2.26.0-py2.py3-none-any.whl (62 kB)
|████████████████████████████████| 62 kB 805 kB/s
Requirement already satisfied: tqdm>=4.48.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (4.62.3)
Collecting PyYAML>=5.0.0
Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
|████████████████████████████████| 596 kB 42.5 MB/s
Requirement already satisfied: missingno>=0.4.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (0.5.0)
Requirement already satisfied: markupsafe=2.0.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (2.0.1)
Collecting htmlmin>=0.1.12
Downloading htmlmin-0.1.12.tar.gz (19 kB)
Collecting multimethod>=1.4
Downloading multimethod-1.6-py3-none-any.whl (9.4 kB)
Collecting phik>=0.11.1
Downloading phik-0.12.0-cp37-cp37m-manylinux2010_x86_64.whl (675 kB)
|████████████████████████████████| 675 kB 41.5 MB/s
Requirement already satisfied: jinja2>=2.11.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (2.11.3)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (2.6.3)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (21.2.0)
Requirement already satisfied: Pillow in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (7.1.2)
Collecting imagehash
Downloading ImageHash-4.2.1.tar.gz (812 kB)
|████████████████████████████████| 812 kB 37.7 MB/s
Collecting scipy<=1.5.4
Downloading scipy-1.5.4-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)
|████████████████████████████████| 25.9 MB 74.1 MB/s
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly>=4.4.1->pycaret) (1.3.3)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython->pycaret) (0.2.5)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from pydantic>=1.8.1->pandas-profiling>=2.8.0->pycaret) (3.10.0.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (2021.10.8)
Requirement already satisfied: charset-normalizer
=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (2.0.8)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (2.10)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (3.0.6)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.5)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (0.8.2)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (7.4.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.6)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.1.3)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (2.0.6)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (0.4.1)
Requirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<2.4.0->pycaret) (4.8.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<2.4.0->pycaret) (3.6.0)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from widgetsnbextension=3.5.0->ipywidgets->pycaret) (5.3.1)
Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension
=3.5.0->ipywidgets->pycaret) (0.12.1)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension=3.5.0->ipywidgets->pycaret) (5.6.1)
Requirement already satisfied: Send2Trash in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension
=3.5.0->ipywidgets->pycaret) (1.8.0)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.7/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets->pycaret) (22.3.0)
Requirement already satisfied: ptyprocess in /usr/local/lib/python3.7/dist-packages (from terminado>=0.8.1->notebook>=4.4.1->widgetsnbextension=3.5.0->ipywidgets->pycaret) (0.7.0)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.7/dist-packages (from imagehash->visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (1.2.0)
Collecting querystring-parser
Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)
Requirement already satisfied: entrypoints in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (0.3)
Collecting alembic<=1.4.1
Downloading alembic-1.4.1.tar.gz (1.1 MB)
|████████████████████████████████| 1.1 MB 66.5 MB/s
Collecting gitpython>=2.1.0
Downloading GitPython-3.1.24-py3-none-any.whl (180 kB)
|████████████████████████████████| 180 kB 40.6 MB/s
Requirement already satisfied: protobuf>=3.7.0 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (3.17.3)
Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.4.27)
Requirement already satisfied: Flask in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.1.4)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.3.0)
Collecting databricks-cli>=0.8.7
Downloading databricks-cli-0.16.2.tar.gz (58 kB)
|████████████████████████████████| 58 kB 5.6 MB/s
Collecting prometheus-flask-exporter
Downloading prometheus_flask_exporter-0.18.6-py3-none-any.whl (17 kB)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (21.3)
Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (7.1.2)
Requirement already satisfied: sqlparse>=0.3.1 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (0.4.2)
Collecting gunicorn
Downloading gunicorn-20.1.0-py3-none-any.whl (79 kB)
|████████████████████████████████| 79 kB 7.6 MB/s
Collecting docker>=4.0.0
Downloading docker-5.0.3-py2.py3-none-any.whl (146 kB)
|████████████████████████████████| 146 kB 58.9 MB/s
Collecting Mako
Downloading Mako-1.1.6-py2.py3-none-any.whl (75 kB)
|████████████████████████████████| 75 kB 4.2 MB/s
Collecting python-editor>=0.3
Downloading python_editor-1.0.4-py3-none-any.whl (4.9 kB)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.7/dist-packages (from databricks-cli>=0.8.7->mlflow->pycaret) (0.8.9)
Collecting websocket-client>=0.32.0
Downloading websocket_client-1.2.3-py3-none-any.whl (53 kB)
|████████████████████████████████| 53 kB 1.2 MB/s
Collecting gitdb<5,>=4.0.1
Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
|████████████████████████████████| 63 kB 1.6 MB/s
Collecting smmap<6,>=3.0.1
Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy->mlflow->pycaret) (1.1.2)
Requirement already satisfied: itsdangerous<2.0,>=0.24 in /usr/local/lib/python3.7/dist-packages (from Flask->mlflow->pycaret) (1.1.0)
Requirement already satisfied: Werkzeug<2.0,>=0.15 in /usr/local/lib/python3.7/dist-packages (from Flask->mlflow->pycaret) (1.0.1)
Requirement already satisfied: bleach in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension
=3.5.0->ipywidgets->pycaret) (4.1.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension=3.5.0->ipywidgets->pycaret) (1.5.0)
Requirement already satisfied: testpath in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension
=3.5.0->ipywidgets->pycaret) (0.5.0)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension=3.5.0->ipywidgets->pycaret) (0.8.4)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension
=3.5.0->ipywidgets->pycaret) (0.7.1)
Requirement already satisfied: webencodings in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension=3.5.0->ipywidgets->pycaret) (0.5.1)
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.7/dist-packages (from prometheus-flask-exporter->mlflow->pycaret) (0.12.0)
Collecting pyLDAvis
Downloading pyLDAvis-3.3.0.tar.gz (1.7 MB)
|████████████████████████████████| 1.7 MB 44.1 MB/s
Installing build dependencies … done
Getting requirements to build wheel … done
Installing backend dependencies … done
Preparing wheel metadata … done
Downloading pyLDAvis-3.2.2.tar.gz (1.7 MB)
|████████████████████████████████| 1.7 MB 30.5 MB/s
Requirement already satisfied: numexpr in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (2.7.3)
Requirement already satisfied: future in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (0.16.0)
Collecting funcy
Downloading funcy-1.16-py2.py3-none-any.whl (32 kB)
Requirement already satisfied: numba>=0.35 in /usr/local/lib/python3.7/dist-packages (from pyod->pycaret) (0.51.2)
Requirement already satisfied: statsmodels in /usr/local/lib/python3.7/dist-packages (from pyod->pycaret) (0.10.2)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.35->pyod->pycaret) (0.34.0)
Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from statsmodels->pyod->pycaret) (0.5.2)
Collecting pynndescent>=0.5
Downloading pynndescent-0.5.5.tar.gz (1.1 MB)
|████████████████████████████████| 1.1 MB 55.1 MB/s
Building wheels for collected packages: htmlmin, imagehash, alembic, databricks-cli, pyLDAvis, pyod, umap-learn, pynndescent
Building wheel for htmlmin (setup.py) … done
Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27098 sha256=6dff1694390dae41ea8bd3ca00f5564142023ea037fa606be0a8ffba9c16d1da
Stored in directory: /root/.cache/pip/wheels/70/e1/52/5b14d250ba868768823940c3229e9950d201a26d0bd3ee8655
Building wheel for imagehash (setup.py) … done
Created wheel for imagehash: filename=ImageHash-4.2.1-py2.py3-none-any.whl size=295207 sha256=9e38b104e77871b6f6a6a9267c3debd3ac85d39441acb3cda64d4dc07a11dd27
Stored in directory: /root/.cache/pip/wheels/4c/d5/59/5e3e297533ddb09407769762985d134135064c6831e29a914e
Building wheel for alembic (setup.py) … done
Created wheel for alembic: filename=alembic-1.4.1-py2.py3-none-any.whl size=158172 sha256=652d8b88b2468cf1d1c9f1c3242dda689e0f670bc6e5b88dc4dbf087fecbaccc
Stored in directory: /root/.cache/pip/wheels/be/5d/0a/9e13f53f4f5dfb67cd8d245bb7cdffe12f135846f491a283e3
Building wheel for databricks-cli (setup.py) … done
Created wheel for databricks-cli: filename=databricks_cli-0.16.2-py3-none-any.whl size=106811 sha256=9dbaaca3ece5f6a1522d676d8b1ec35c065a0bd0c564bd862ae3012984b70c9a
Stored in directory: /root/.cache/pip/wheels/f4/5c/ed/e1ce20a53095f63b27b4964abbad03e59cf3472822addf7d29
Building wheel for pyLDAvis (setup.py) … done
Created wheel for pyLDAvis: filename=pyLDAvis-3.2.2-py2.py3-none-any.whl size=135618 sha256=471d50a9a2e725465ffc7d32b21edb15c9b84dc0573891185a43e97f567aa0a7
Stored in directory: /root/.cache/pip/wheels/f8/b1/9b/560ac1931796b7303f7b517b949d2d31a4fbc512aad3b9f284
Building wheel for pyod (setup.py) … done
Created wheel for pyod: filename=pyod-0.9.5-py3-none-any.whl size=132699 sha256=d116f5b46155bf0fa31aa88cc21da0e3be461b448e9c9b2d599c763a5ef0a6a1
Stored in directory: /root/.cache/pip/wheels/3d/bb/b7/62b60fb451b33b0df1ab8006697fba7a6a49709a629055cf77
Building wheel for umap-learn (setup.py) … done
Created wheel for umap-learn: filename=umap_learn-0.5.2-py3-none-any.whl size=82709 sha256=7c48e34d2c19d333a623ed12491d3c7d07bafd52f2d35e474df56908f5cc7525
Stored in directory: /root/.cache/pip/wheels/84/1b/c6/aaf68a748122632967cef4dffef68224eb16798b6793257d82
Building wheel for pynndescent (setup.py) … done
Created wheel for pynndescent: filename=pynndescent-0.5.5-py3-none-any.whl size=52603 sha256=7abff97eebc36deea7220f1b5e9907020826a07404003a9c7d794fef4d396e87
Stored in directory: /root/.cache/pip/wheels/af/e9/33/04db1436df0757c42fda8ea6796d7a8586e23c85fac355f476
Successfully built htmlmin imagehash alembic databricks-cli pyLDAvis pyod umap-learn pynndescent
Installing collected packages: tangled-up-in-unicode, smmap, scipy, multimethod, joblib, websocket-client, visions, scikit-learn, requests, python-editor, Mako, imagehash, gitdb, querystring-parser, PyYAML, pynndescent, pydantic, prometheus-flask-exporter, phik, htmlmin, gunicorn, gitpython, funcy, docker, databricks-cli, alembic, umap-learn, scikit-plot, pyod, pyLDAvis, pandas-profiling, mlxtend, mlflow, lightgbm, kmodes, imbalanced-learn, Boruta, pycaret
Attempting uninstall: scipy
Found existing installation: scipy 1.4.1
Uninstalling scipy-1.4.1:
Successfully uninstalled scipy-1.4.1
Attempting uninstall: joblib
Found existing installation: joblib 1.1.0
Uninstalling joblib-1.1.0:
Successfully uninstalled joblib-1.1.0
Attempting uninstall: scikit-learn
Found existing installation: scikit-learn 1.0.1
Uninstalling scikit-learn-1.0.1:
Successfully uninstalled scikit-learn-1.0.1
Attempting uninstall: requests
Found existing installation: requests 2.23.0
Uninstalling requests-2.23.0:
Successfully uninstalled requests-2.23.0
Attempting uninstall: PyYAML
Found existing installation: PyYAML 3.13
Uninstalling PyYAML-3.13:
Successfully uninstalled PyYAML-3.13
Attempting uninstall: pandas-profiling
Found existing installation: pandas-profiling 1.4.1
Uninstalling pandas-profiling-1.4.1:
Successfully uninstalled pandas-profiling-1.4.1
Attempting uninstall: mlxtend
Found existing installation: mlxtend 0.14.0
Uninstalling mlxtend-0.14.0:
Successfully uninstalled mlxtend-0.14.0
Attempting uninstall: lightgbm
Found existing installation: lightgbm 2.2.3
Uninstalling lightgbm-2.2.3:
Successfully uninstalled lightgbm-2.2.3
Attempting uninstall: imbalanced-learn
Found existing installation: imbalanced-learn 0.8.1
Uninstalling imbalanced-learn-0.8.1:
Successfully uninstalled imbalanced-learn-0.8.1
ERROR: pip’s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests
=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed Boruta-0.3 Mako-1.1.6 PyYAML-6.0 alembic-1.4.1 databricks-cli-0.16.2 docker-5.0.3 funcy-1.16 gitdb-4.0.9 gitpython-3.1.24 gunicorn-20.1.0 htmlmin-0.1.12 imagehash-4.2.1 imbalanced-learn-0.7.0 joblib-1.0.1 kmodes-0.11.1 lightgbm-3.3.1 mlflow-1.22.0 mlxtend-0.19.0 multimethod-1.6 pandas-profiling-3.1.0 phik-0.12.0 prometheus-flask-exporter-0.18.6 pyLDAvis-3.2.2 pycaret-2.3.5 pydantic-1.8.2 pynndescent-0.5.5 pyod-0.9.5 python-editor-1.0.4 querystring-parser-1.2.4 requests-2.26.0 scikit-learn-0.23.2 scikit-plot-0.3.7 scipy-1.5.4 smmap-5.0.0 tangled-up-in-unicode-0.1.0 umap-learn-0.5.2 visions-0.7.4 websocket-client-1.2.3

pycaret을 그냥 설치 할 수도 있고,
version을 정해서 설치 할 수도 있다.

  • 먼저 개괄적으로 확인만 하고, github 에서 autoML을 pycaret 2.0 ver로 진행 해야 한다.




google colab에 설치 한 경우

Install 후 런타임>런타임다시시작(CTRL+M) 을 꼭 한번 해 준 후

아래 설명에 따라 가는 것이 좋음.

** 만약 오류가 난다면, 런타임 초기화 후 import, 런타임다시시작 후 진행 하는 것을 추천

  • 왜 그런지 모름 _ 그냥 이렇게 하면 된다는 것만 알려주겠음




Data Load

1
2
from pycaret.datasets import get_data
data = get_data("diamond")

pycaret_diamond_data




pycaret.regression

1
2
3
from pycaret.regression import *
reg_set = setup(data, target = 'Price', transform_target = True,
log_experiment = True, experiment_name = 'diamond')
  • pycaret.regression : 종류
\ Description Value
0 session_id 2882
1 Target Price
2 Original Data (6000, 8)
3 Missing Values False
4 Numeric Features 1
5 Categorical Features 6
6 Ordinal Features False
7 High Cardinality Features False
8 High Cardinality Method None
9 Transformed Train Set (4199, 28)
10 Transformed Test Set (1801, 28)
11 Shuffle Train-Test True
12 Stratify Train-Test False
13 Fold Generator KFold
14 Fold Number 10
15 CPU Jobs -1
16 Use GPU False
17 Log Experiment True
18 Experiment Name diamond
19 USI 116c
20 Imputation Type simple
21 Iterative Imputation Iteration None
22 Numeric Imputer mean
23 Iterative Imputation Numeric Model None
24 Categorical Imputer constant
25 Iterative Imputation Categorical Model None
26 Unknown Categoricals Handling least_frequent
27 Normalize False
28 Normalize Method None
29 Transformation False
30 Transformation Method None
31 PCA False
32 PCA Method None
33 PCA Components None
34 Ignore Low Variance False
35 Combine Rare Levels False
36 Rare Level Threshold None
37 Numeric Binning False
38 Remove Outliers False
39 Outliers Threshold None
40 Remove Multicollinearity False
41 Multicollinearity Threshold None
42 Remove Perfect Collinearity True
43 Clustering False
44 Clustering Iteration None
45 Polynomial Features False
46 Polynomial Degree None
47 Trignometry Features False
48 Polynomial Threshold None
49 Group Features False
50 Feature Selection False
51 Feature Selection Method classic
52 Features Selection Threshold None
53 Feature Interaction False
54 Feature Ratio False
55 Interaction Threshold None
56 Transform Target True
57 Transform Target Method box-cox
  • 더 많이 알고 싶으면 저거 다 공부 해 ^0^




모델 만들기

  • 최적의 모델을 만들기 위해 한줄의 코드면 된다 ㅠㅠ
1
best  = compare_models()
Model MAE MSE RMSE R2 RMSLE MAPE TT (Sec)
lightgbm Light Gradient Boosting Machine 637.8811 1.928277e+06 1367.4159 0.9813 0.0677 0.0491 0.120
et Extra Trees Regressor 748.9529 2.253684e+06 1478.3926 0.9782 0.0802 0.0594 1.199
rf Random Forest Regressor 742.9041 2.417200e+06 1528.6437 0.9765 0.0785 0.0579 1.090
gbr Gradient Boosting Regressor 764.6458 2.449865e+06 1544.3382 0.9762 0.0783 0.0583 0.288
dt Decision Tree Regressor 946.3401 3.350058e+06 1811.0705 0.9672 0.1034 0.0756 0.040
ada AdaBoost Regressor 1997.1826 1.710448e+07 4091.7565 0.8350 0.1895 0.1511 0.251
knn K Neighbors Regressor 3072.0318 3.642699e+07 6017.2046 0.6421 0.3636 0.2323 0.086
omp Orthogonal Matching Pursuit 3317.3424 8.643676e+07 9045.7885 0.1344 0.2823 0.2209 0.026
llar Lasso Least Angle Regression 6540.9142 1.144871e+08 10682.7674 -0.1241 0.7130 0.5636 0.281
lasso Lasso Regression 6540.9147 1.144871e+08 10682.7665 -0.1241 0.7130 0.5636 0.025
en Elastic Net 6540.9147 1.144871e+08 10682.7665 -0.1241 0.7130 0.5636 0.025
dummy Dummy Regressor 6540.9142 1.144871e+08 10682.7674 -0.1241 0.7130 0.5636 0.021
ridge Ridge Regression 3376.7759 4.409370e+08 17429.1601 -3.0382 0.2235 0.1734 0.026
br Bayesian Ridge 3464.5342 6.180348e+08 19047.2745 -4.5803 0.2244 0.1745 0.028
huber Huber Regressor 3490.0167 7.900161e+08 19860.5244 -6.0721 0.2254 0.1729 0.118
lr Linear Regression 3566.8112 8.908481e+08 21034.8582 -6.9766 0.2253 0.1755 0.309
par Passive Aggressive Regressor 8585.4060 5.154119e+10 94736.3961 -439.8984 0.2947 0.2745 0.031




모형 평가

  • 최적의 모델 확인 후 평가 역시 코드 한줄 ㅠㅠ 감동
1
plot_model(best)

plot_model



1
plot_model(best, plot = "feature")

pycarat_plot_model_feature




모형 저장, 모형 배포

  • MLOps 개념, RestAPI, Flask
1
2
3
4
finalize_best = finalize_model(best)

#save model
save_model(finalize_best, "diamond_pipeline")

Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=None,
steps=[(‘dtypes’,
DataTypes_Auto_infer(categorical_features=[],
display_types=True, features_todrop=[],
id_columns=[], ml_usecase=’regression’,
numerical_features=[], target=’Price’,
time_features=[])),
(‘imputer’,
Simple_Imputer(categorical_strategy=’not_available’,
fill_value_categorical=None,
fill_value_numerical=None,
numeric_strategy=’…
learning_rate=0.1,
max_depth=-1,
min_child_samples=20,
min_child_weight=0.001,
min_split_gain=0.0,
n_estimators=100,
n_jobs=-1,
num_leaves=31,
objective=None,
random_state=2882,
reg_alpha=0.0,
reg_lambda=0.0,
silent=’warn’,
subsample=1.0,
subsample_for_bin=200000,
subsample_freq=0),
silent=’warn’, subsample=1.0,
subsample_for_bin=200000,
subsample_freq=0)]],
verbose=False), ‘diamond_pipeline.pkl’)




MLOps

  • devOPs (개발과 운영 팀이 별도로 있었음.)
  • 자동화 되면서 같이 됨.
    MLOps dash board

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
!pip install mlflow --quiet
!pip install pyngrok --quiet

import mlflow

with mlflow.start_run(run_name="MLflow on Colab"):
mlflow.log_metric("m1", 2.0)
mlflow.log_param("p1", "mlflow-colab")

# run tracking UI in the background
get_ipython().system_raw("mlflow ui --port 5000 &") # run tracking UI in the background


# create remote tunnel using ngrok.com to allow local port access
# borrowed from https://colab.research.google.com/github/alfozan/MLflow-GBRT-demo/blob/master/MLflow-GBRT-demo.ipynb#scrollTo=4h3bKHMYUIG6

from pyngrok import ngrok

# Terminate open tunnels if exist
ngrok.kill()

# Setting the authtoken (optional)
# Get your authtoken from https://dashboard.ngrok.com/auth
NGROK_AUTH_TOKEN = ""
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# Open an HTTPs tunnel on port 5000 for http://localhost:5000
ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLflow Tracking UI:", ngrok_tunnel.public_url)
|████████████████████████████████| 745 kB 5.4 MB/s
Building wheel for pyngrok (setup.py) ... done
---------------------------------------
Exception Traceback (most recent call last)
in ()
4 import mlflow
5
----> 6 with mlflow.start_run(run_name="MLflow on Colab"):
7 mlflow.log_metric("m1", 2.0)
8 mlflow.log_param("p1", "mlflow-colab")

/usr/local/lib/python3.7/dist-packages/mlflow/tracking/fluent.py in start_run(run_id, experiment_id, run_name, nested, tags)

229 + “current run with mlflow.end_run(). To start a nested “

230 + “run, call start_run with nested=True”

–> 231 ).format(_active_run_stack[0].info.run_id)

232 )

233 client = MlflowClient()



Exception: Run with UUID 3cbca838cdd44eac8620700ac1929a64 is already active.

To start a new run, first end the current run with mlflow.end_run().

To start a nested run, call start_run with nested=True


배포 하는 것이 마지막 단계인데, 구글 코랩에서 안먹는 다는 것이 함정이라고 한다.

언젠간 내가 스스로 할 수 있는 날이 오지 않을까 한다.

sklearn_mL_04_ModuleSelection(2.4)

Chepter 2 _사이킷런으로 시작하는 머신러닝(04)

파이썬 머신러닝 완벽 가이드
ref. & copyright(c) Book



Model Selection 모듈 소개

  • 사이킷런 (scikit-Learn) : 파이썬 머신러닝 라이브러리 중 가장 많이 사용되는 라이브러리
1
2
3
4
5

import sklearn

print(cklearn.__version__)

House_price prediction Practice 01

Kaggle 주택가격 예측

Kaggle house oruces advabced regression




1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

데이터 다운로드 및 불러오기

1
2
3
4
5
6
7
import pandas as pd

train = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
test = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")

train.shape, test.shape
#변수를 줄여야 겠다. 어떤 변수를 줄여야 할까 ?

EDA

  • 이상치과 중복값 제거
  • overallQual (주택의 상태를 1~10등급으로 책정)
  • 평점 1 : 판매가가 높음 = 이상치라고 판단 할 수 있따. : 이걸 제거 해 줘야 함.
1
2
train.info()
#80개 컬럼, SalePrice(독립변수)를 제거 하고는 나머지가 종속변수 = 너무 많다 !!
1
2
3
train.drop(train[(train['OverallQual'] < 5) & (train['SalePrice']> 200000)].index, inplace = True)
train.reset_index(drop = True, inplace = True)
train.shape

종속변수 시각화

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm

(mu, sigma) = norm.fit(train['SalePrice'])
print("The value of mu before log transformation is:", mu)
print("The value of sigma before log transformation is:", sigma)

fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(train['SalePrice'], color="b", stat="probability")
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="SalePrice")
ax.set(title="SalePrice distribution")

plt.axvline(mu, color='r', linestyle='--')
plt.text(mu + 10000, 0.11, 'Mean of SalePrice', rotation=0, color='r')
fig.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np 

train["SalePrice"] = np.log1p(train["SalePrice"]) # 로그 변환 후 종속 변수 시각화

(mu, sigma) = norm.fit(train['SalePrice'])
print("The value of mu before log transformation is:", mu)
print("The value of sigma before log transformation is:", sigma)

fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(train['SalePrice'], color="b", stat="probability")
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="SalePrice")
ax.set(title="SalePrice distribution")

plt.axvline(mu, color='r', linestyle='--')
plt.text(mu + 0.05, 0.111, 'Mean of SalePrice', rotation=0, color='r')
plt.ylim(0, 0.12)
fig.show()

data feature 제거

  1. 모형 학습 시간 감소
  2. 연산 시 noise 감소

그래서 : train ID 를 빼주기로함

1
2
3
4
5
train_ID = train['Id']
test_ID = test['Id']
train.drop(['Id'], axis=1, inplace=True)
test.drop(['Id'], axis=1, inplace=True)
train.shape, test.shape
1
2
3
4
5
6
7
8
# y 값 추출, dataset 분리할때 사용
y = train['SalePrice'].reset_index(drop=True)

# 뽑고 나면 원래 df에서 제거
train = train.drop('SalePrice', axis = 1)
train.shape, test.shape, y.shape


1
2
3
4
5
# data 합치기 
# - train data 와 Test를 같이 전처리 하기 위해

all_df = pd.concat([train, test]).reset_index(drop=True)
all_df.shape

결측치 확인

  • 결측치 처리
    1. 제거하기 : column 제거, 특정 행만 제거하기
    2. 채우기 : 1) numeric(수치형) : 평균 또는 중간값으로 채우기
          2) String(문자형) : 최빈값으로 채우기 
      
    3. 통계 기법이용, 채우기 (data 보간)
      • 실무에서는 (KNNImput)등, 시계열 자료 or 산업군에 따라 다르므로 가서 배워라.
1
2
3
4
5
6
7
8
9
10
#결측치 확인 

def check_na(data, head_num = 6):
isnull_na = (data.isnull().sum() / len(data)) * 100
data_na = isnull_na.drop(isnull_na[isnull_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :data_na,
'Data Type': data.dtypes[data_na.index]})
print("결측치 데이터 컬럼과 건수:\n", missing_data.head(head_num))

check_na(all_df, 20)
1
2
3
4
all_df.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'LotFrontage'], axis=1, inplace=True)
check_na(all_df)

#아직도 결측치가 많이 있다.

채우기

1. 문자열 채우기 
2. 
  1. object column 추출
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#a = all_df['BsmtCond'].value_counts().mode() #mode() : 최빈값 가장 빈도수가 높은 값 찾기
#a

print(all_df['BsmtCond'].value_counts())
print()
print(all_df['BsmtCond'].mode()[0])

#object column, 갯수 확인
import numpy as np
cat_all_vars = train.select_dtypes(exclude=[np.number]) #숫자인 것을 제외한 type column 이름 추출
print("The whole number of all_vars(문자형data)", len(list(cat_all_vars)))

#column 이름 뽑아내기
final_cat_vars = []
for v in cat_all_vars:
if v not in ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'LotFrontage']:
final_cat_vars.append(v)

print("The whole number of final_cat_vars", len(final_cat_vars))

#최빈값을 찾아 넣어주기
for i in final_cat_vars:
all_df[i] = all_df[i].fillna(all_df[i].mode()[0])

check_na(all_df, 20)
print("숫자형 data set의 결측치만 남은 것을 알 수 있다. ")
1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
num_all_vars = list(train.select_dtypes(include=[np.number]))
print("The whole number of all_vars", len(num_all_vars))

num_all_vars.remove('LotFrontage')

print("The whole number of final_cat_vars", len(num_all_vars))
for i in num_all_vars:
all_df[i].fillna(value=all_df[i].median(), inplace=True)

print("결측치가 존재 하지 않은 것을 알 수 있다. ")
check_na(all_df, 20)
1
all_df.info()

왜도(Skewnewss) 처리하기 : 정규 분포를 이룰 수 있게 (설문조사 논문 통계의 경우 -1< 외도 <1)

  • boxcose를 사용 할 예정

  • 왜도가 양수일때, 음수일때 (좌, 우로 치우친 정도)

  • 첨도가 양수일때, 음수일때 (뽀족한 정도)

  • RMSE를 최적(낮게)으로 만들기 위해 조정.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from scipy.stats import skew

#외도 판정을 받을 만한 data set을 확인
def find_skew(x):
return skew(x)
#앞에서 뽑은 numeric columns : num_all_vars
#사용자 정의함수를 쓰기 위해 apply(find_skew)를 사용, 오름차순정렬

skewness_features = all_df[num_all_vars].apply(find_skew).sort_values(ascending=False)
skewness_features

#high_skew = skew_valrs[slew_var > 1]

#0~1사이에 있는 것이 기준. 기준 밖으로 나간 경우 조정이 필요(정규분포를 만들어 주기 위해)
# 1. 박스코스 변환 : ML -> RMSE (2.5)
# 2. 로그변환 : ML -> RMSE (2.1)
# => RMSE는 적은 것이 좋기 때문에, 로그 변환으로 사용 하는 것이 좋다.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
skewnewss_index = list(skewness_features.index)
skewnewss_index.remove('LotArea')
#외도 정도가 너무 높은 LotArea를 날려주는 것.
all_numeric_df = all_df.loc[:, skewnewss_index]


fig, ax = plt.subplots(figsize=(10, 6))
ax.set_xlim(0, all_numeric_df.max().sort_values(ascending=False)[0])
ax = sns.boxplot(data=all_numeric_df[skewnewss_index] , orient="h", palette="Set1")
ax.xaxis.grid(False)
ax.set(ylabel="Feature names")
ax.set(xlabel="Numeric values")
ax.set(title="Numeric Distribution of Features Before Box-Cox Transformation")
sns.despine(trim=True, left=True)
1
2
3
4
5
6
7
8
9
10
11
12
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

high_skew = skewness_features[skewness_features > 1]
high_skew_index = high_skew.index

print("The data before Box-Cox Transformation: \n", all_df[high_skew_index].head())

for num_var in high_skew_index:
all_df[num_var] = boxcox1p(all_df[num_var], boxcox_normmax(all_df[num_var] + 1))

print("The data after Box-Cox Transformation: \n", all_df[high_skew_index].head())
1
2
3
4
5
6
7
8
fig, ax = plt.subplots(figsize=(10, 6))
ax.set_xscale('log')
ax = sns.boxplot(data=all_df[high_skew_index] , orient="h", palette="Set1")
ax.xaxis.grid(False)
ax.set(ylabel="Feature names")
ax.set(xlabel="Numeric values")
ax.set(title="Numeric Distribution of Features Before Box-Cox Transformation")
sns.despine(trim=True, left=True)

도출 변수

Feature Engineering 의 Key step

  • 판매량, 단가, 매출액 X
  • 판매량 X 단가 = 매출액(New Value) : 도출 변수
    • ML은 수식이기 때문에 도출변수가 생성 되는것은 연산의 증가로 이어진다.
    • 시간이 오래 걸린다.
    • 결론 : 변수를 줄이는 것이 좋다.
1
2
3
4
#집의 층수 를 더해서 전체면적이라는 변수를 도출 
all_df['TotalSF'] = all_df['TotalBsmtSF'] + all_df['1stFlrSF'] + all_df['2ndFlrSF']
all_df = all_df.drop(['TotalBsmtSF', '1stFlrSF', '2ndFlrSF'], axis=1)
print(all_df.shape)
1
2
3
4
all_df['Total_Bathrooms'] = (all_df['FullBath'] + (0.5 * all_df['HalfBath']) + all_df['BsmtFullBath'] + (0.5 * all_df['BsmtHalfBath']))
all_df['Total_porch_sf'] = (all_df['OpenPorchSF'] + all_df['3SsnPorch'] + all_df['EnclosedPorch'] + all_df['ScreenPorch'])
all_df = all_df.drop(['FullBath', 'HalfBath', 'BsmtFullBath', 'BsmtHalfBath', 'OpenPorchSF', '3SsnPorch', 'EnclosedPorch', 'ScreenPorch'], axis=1)
print(all_df.shape)
  • 따라서 data 정의서 먼저 봐야 한다. : data description.txt를 먼저 봐야 한다. !!! (실무에서는 없는 경우가 많다.)
  • 시각화 : 각각의 data 무한작업, 도메인 공부
1
2
3
4
5
6
7
8
9
10
11
# 연도와 관련된.
num_all_vars = list(train.select_dtypes(include=[np.number]))
year_feature = []
for var in num_all_vars:
if 'Yr' in var:
year_feature.append(var)
elif 'Year' in var:
year_feature.append(var)
else:
print(var, "is not related with Year")
print(year_feature)
1
2
3
4
5
6
7
8
fig, ax = plt.subplots(3, 1, figsize=(10, 6), sharex=True, sharey=True)
for i, var in enumerate(year_feature):
if var != 'YrSold':
ax[i].scatter(train[var], y, alpha=0.3)
ax[i].set_title('{}'.format(var), size=15)
ax[i].set_ylabel('SalePrice', size=15, labelpad=12.5)
plt.tight_layout()
plt.show()

Kgg_House_years



1
2
all_df = all_df.drop(['YearBuilt', 'GarageYrBlt'], axis=1)
print(all_df.shape)
1
2
3
4
5
6
# 리모델링 시점으로 부터 얼마나 되었나 + 팔리는거
YearsSinceRemodel = train['YrSold'].astype(int) - train['YearRemodAdd'].astype(int)

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(YearsSinceRemodel, y, alpha=0.3)
fig.show()

Kgg_House_YearsSinceRemodel



1
2
3
all_df['YearsSinceRemodel'] = all_df['YrSold'].astype(int) - all_df['YearRemodAdd'].astype(int)
all_df = all_df.drop(['YrSold', 'YearRemodAdd'], axis=1)
print(all_df.shape)

더미변수

String data (non-Numeric)

  • 명목형 : 남학생, 여학생…
  • 서열형(순서) : 1등급, 2등급, 3등급 (가중치, 등급숫자 등으로 바꿀 수 있다. )

Kgg_House_StringF



  • 세부적으로 customize 하는 것이 낫다.
  • 명목형 series에 따라 17개의 model을 개별적으로 만들되, 하나의 모델처럼 보이게 시각화 대시보드로 만들어 줘야 합니다.
1
2
all_df['PoolArea'].value_counts()
#0과 다른값들.. 으로 되어있어서
1
2
3
4
5
6
# 0과 1로 나누어 적용
def count_dummy(x):
if x > 0:
return 1
else:
return 0
1
2
3
4
all_df['PoolArea'] = all_df['PoolArea'].apply(count_dummy)
all_df['PoolArea'].value_counts()

# 전체 경향 등에 거의 영향을 주지 않음
1
2
all_df['GarageArea'] = all_df['GarageArea'].apply(count_dummy)
all_df['GarageArea'].value_counts()
1
2
all_df['Fireplaces'] = all_df['Fireplaces'].apply(count_dummy)
all_df['Fireplaces'].value_counts()



Label Encoding, Ordinal Encoding, One-Hot Encoding


  • Label Encoding : 종속변수에만
  • Ordinal Encoding : 독립변수에만
  • 써야 하지만, 개념은 같다.
  • One-Hot Encoding :
1
2
3
4
5
6
7
8
9
10
from sklearn.preprocessing import LabelEncoder
import pandas as pd

temp = pd.DataFrame({'Food_Name': ['Apple', 'Chicken', 'Broccoli'],
'Calories': [95, 231, 50]})

encoder = LabelEncoder()
encoder.fit(temp['Food_Name'])
labels = encoder.transform(temp['Food_Name'])
print(list(temp['Food_Name']), "==>", labels)
1
2
3
4
5
6
7
8
9
10
11
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

temp = pd.DataFrame({'Food_Name': ['Apple', 'Chicken', 'Broccoli'],
'Calories': [95, 231, 50]})

encoder = OrdinalEncoder()
labels = encoder.fit_transform(temp[['Food_Name']])
print(list(temp['Food_Name']), "==>", labels.tolist())


1
2
3
4
5
6
7
8
9
10
# import pandas as pd
# temp = pd.DataFrame({'Food_Name': ['Apple', 'Chicken', 'Broccoli'],
# 'Calories': [95, 231, 50]})

# temp[['Food_No']] = temp.Food_Name.replace(['Chicken', 'Broccoli', 'Apple'],[1, 2, 3])

# print(temp[['Food_Name', 'Food_No']])

#ValueError: Columns must be same length as key

1
2
3
4
5
6
7
8
import pandas as pd

temp = pd.DataFrame({'Food_Name': ['Apple', 'Chicken', 'Broccoli'],
'Calories': [95, 231, 50]})

temp = pd.get_dummies(temp)
print(temp)
print(temp.shape)
1
2
all_df = pd.get_dummies(all_df).reset_index(drop=True)
all_df.shape

머신러닝 모형 학습 및 평가

데이터셋 분리 및 교차 검증


1
2
3
X = all_df.iloc[:len(y), :]
X_test = all_df.iloc[len(y):, :]
X.shape, y.shape, X_test.shape
1
2
3
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape



평가지표

MAE


1
2
3
4
5
6
7
8
9
10
import numpy as np

def mean_absolute_error(y_true, y_pred):

error = 0
for yt, yp in zip(y_true, y_pred):
error = error + np.abs(yt-yp)

mae = error / len(y_true)
return mae
1
2
3
4
5
6
7
8
9
10
import numpy as np

def mean_squared_error(y_true, y_pred):

error = 0
for yt, yp in zip(y_true, y_pred):
error = error + (yt - yp) ** 2

mse = error / len(y_true)
return mse



RMSE


1
2
3
4
5
6
7
8
9
10
11
import numpy as np

def root_rmse_squared_error(y_true, ypred):
error = 0

for yt, yp in zip(y_true, y_pred):
error = error + (yt - yp) ** 2

mse = error / len(y_true)
rmse = np.round(np.sqrt(mse), 3)
return rmse

Test1

1
2
3
4
5
6
y_true = [400, 300, 800]
y_pred = [380, 320, 777]

print("MAE:", mean_absolute_error(y_true, y_pred))
print("MSE:", mean_squared_error(y_true, y_pred))
print("RMSE:", root_rmse_squared_error(y_true, y_pred))

Test2

1
2
3
4
5
6
y_true = [400, 300, 800, 900]
y_pred = [380, 320, 777, 600]

print("MAE:", mean_absolute_error(y_true, y_pred))
print("MSE:", mean_squared_error(y_true, y_pred))
print("RMSE:", root_rmse_squared_error(y_true, y_pred))



RMSE with Sklean


1
2
3
4
from sklearn.metrics import mean_squared_error

def rmsle(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))



모형 정의 및 검증 평가


1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression

def cv_rmse(model, n_folds=5):
cv = KFold(n_splits=n_folds, random_state=42, shuffle=True)
rmse_list = np.sqrt(-cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=cv))
print('CV RMSE value list:', np.round(rmse_list, 4))
print('CV RMSE mean value:', np.round(np.mean(rmse_list), 4))
return (rmse_list)

n_folds = 5
rmse_scores = {}
lr_model = LinearRegression()
1
2
3
score = cv_rmse(lr_model, n_folds)
print("linear regression - mean: {:.4f} (std: {:.4f})".format(score.mean(), score.std()))
rmse_scores['linear regression'] = (score.mean(), score.std())

첫번째 최종 예측 값 제출

1
2
3
4
5
6
7
8
9
from sklearn.model_selection import cross_val_predict

X = all_df.iloc[:len(y), :]
X_test = all_df.iloc[len(y):, :]
X.shape, y.shape, X_test.shape

lr_model_fit = lr_model.fit(X, y)
final_preds = np.floor(np.expm1(lr_model_fit.predict(X_test)))
print(final_preds)
1
2
3
4
submission = pd.read_csv("sample_submission.csv")
submission.iloc[:,1] = final_preds
print(submission.head())
submission.to_csv("The_first_regression.csv", index=False)

모형 알고리즘 추가

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

# LinearRegresison
lr_model = LinearRegression()

# Tree Decision
tree_model = DecisionTreeRegressor()

# Random Forest Regressor
rf_model = RandomForestRegressor()

# Gradient Boosting Regressor
gbr_model = GradientBoostingRegressor()
1
2
3
score = cv_rmse(lr_model, n_folds)
print("linear regression - mean: {:.4f} (std: {:.4f})".format(score.mean(), score.std()))
rmse_scores['linear regression'] = (score.mean(), score.std())
1
2
3
score = cv_rmse(tree_model, n_folds)
print("Decision Tree Regressor - mean: {:.4f} (std: {:.4f})".format(score.mean(), score.std()))
rmse_scores['Decision Tree Regressor'] = (score.mean(), score.std())
1
2
3
score = cv_rmse(rf_model, n_folds)
print("RandomForest Regressor - mean: {:.4f} (std: {:.4f})".format(score.mean(), score.std()))
rmse_scores['RandomForest Regressor'] = (score.mean(), score.std())
1
2
3
score = cv_rmse(gbr_model, n_folds)
print("Gradient Boosting Regressor - mean: {:.4f} (std: {:.4f})".format(score.mean(), score.std()))
rmse_scores['Gradient Boosting Regressor'] = (score.mean(), score.std())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
fig, ax = plt.subplots(figsize=(10, 6))

ax = sns.pointplot(x=list(rmse_scores.keys()), y=[score for score, _ in rmse_scores.values()], markers=['o'], linestyles=['-'], ax=ax)
for i, score in enumerate(rmse_scores.values()):
ax.text(i, score[0] + 0.002, '{:.6f}'.format(score[0]), horizontalalignment='left', size='large', color='black', weight='semibold')

ax.set_ylabel('Score (RMSE)', size=20, labelpad=12.5)
ax.set_xlabel('Model', size=20, labelpad=12.5)
ax.tick_params(axis='x', labelsize=13.5, rotation=10)
ax.tick_params(axis='y', labelsize=12.5)
ax.set_ylim(0, 0.25)
ax.set_title('Rmse Scores of Models without Blended_Predictions', size=20)

fig.show()
  • RMSE 가 적은 것이 좋다. : 예측이 잘 된 Model이라고 할 수 있다.
1
2
3
4
5
6
7
8
9
10
11
lr_model_fit = lr_model.fit(X, y)
tree_model_fit = tree_model.fit(X, y)
rf_model_fit = rf_model.fit(X, y)
gbr_model_fit = gbr_model.fit(X, y)

def blended_learning_predictions(X):
blended_score = (0.3 * lr_model_fit.predict(X)) + \
(0.1 * tree_model_fit.predict(X)) + \
(0.3 * gbr_model_fit.predict(X)) + \
(0.3* rf_model_fit.predict(X))
return blended_score
1
2
3
4
blended_score = rmsle(y, blended_learning_predictions(X))
rmse_scores['blended'] = (blended_score, 0)
print('RMSLE score on train data:')
print(blended_score)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
fig, ax = plt.subplots(figsize=(10, 6))

ax = sns.pointplot(x=list(rmse_scores.keys()), y=[score for score, _ in rmse_scores.values()], markers=['o'], linestyles=['-'], ax=ax)
for i, score in enumerate(rmse_scores.values()):
ax.text(i, score[0] + 0.002, '{:.6f}'.format(score[0]), horizontalalignment='left', size='large', color='black', weight='semibold')

ax.set_ylabel('Score (RMSE)', size=20, labelpad=12.5)
ax.set_xlabel('Model', size=20, labelpad=12.5)
ax.tick_params(axis='x', labelsize=13.5, rotation=10)
ax.tick_params(axis='y', labelsize=12.5)
ax.set_ylim(0, 0.25)

ax.set_title('Rmse Scores of Models with Blended_Predictions', size=20)

fig.show()
1
2
submission.iloc[:,1] = np.floor(np.expm1(blended_learning_predictions(X_test)))
submission.to_csv("The_second_regression.csv", index=False)

submission_House

Regression in python(01)

Chepter 5 _파이썬 머신러닝 완벽 가이드

ref. & copyright(c) Book



회귀

  • Regression: 여러개의 독립변수와 한개의 종속변수 간의 상관관계를 모델링 하는 기법
  • Regression conefficients : 독립변수의 값에 영향을 미치는 회기 계수로 선형 회기 식의 기울기에 해당

  • 러닝머신의 관점

    • 독립변수 : 피처
    • 종속변수 : 결정값

= > 주어진 피처와 결정값 데이터 기반에서 학습을 통해 최적의 **회귀계수** 를 찾아 내는 것이 목표



✌지도학습 2가지 유형

  1. CLASSIFICATION + category, 이산 값 일때
  2. REGRESSION + 숫자, 연속 값 일때

⚡ 회귀의 4가지 유형

  1. 독립변수 개수 - 단일 회귀 - 다중 회귀
  2. 회귀 계수의 결합 - 선형 회귀 : 실제 값과 예측 값의 차이 (오류의 제곱값)를 최소화 하는 직선형 회귀선을 최적화 하는 방식
    • Regularization(규제방법) : 일반적 선형 회귀의 과적합 문제를 해결 하기 위해 회귀 계수를 조정 하는것 (패널티 값 적용) - 비선형 회귀
  • 일반선형회귀 : 예측값과 실제값의 RSS를 최소화 할 수 있도록 회귀계수 최적화 (Regularization X)
  • Ridge(릿지) : 선형 회귀 + L2 Regularization
    • L2 : 상대적으로 큰 회귀 계수 값의 예측 영향도를 감소 시키기 위해 회귀 계수값을 더 작게 만듦.
  • Lasso(라쏘) :
  • ElasticNet(엘라스틱넷) :
  • Rogistic Regression(로지스틱 회귀) :

Ref. scikit-learn

책에 나온 회귀들

딥러닝을 하고 싶다면 볼 것.



y = 4x + 6 + error 시뮬레이션 데이터 값 생성

1
2
3
4
5
6
7
8
9
10
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(123) # 실험 재현성

X = 2 * np.random.rand(100, 1) # 100개의 랜덤값 만들기
y = 4 * X + 6 + np.random.rand(100, 1)

plt.scatter(X, y)

산점도 그래프

1
X.shape, y.shape
((100, 1), (100, 1))

경사하강법으로 최적의 기울기 찾기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# w1과, w0를 업데이트할 w1_update, w0_update 값 반환

def get_weight_updates(w1, w0, X, y, learning_rate=0.01):
N = len(y)

# w1_update, w0_update 초기화
w1_update = np.zeros_like(w1)
w0_update = np.zeros_like(w0)

# 예측 배열 계산하고, 예측값과 실젯 값의 차이 계산
y_pred = np.dot(X, w1.T) + w0
diff = y - y_pred # 실제갓, 예측값 == 오차

# w0_update를 dot 행렬 연산으로 구하기 위해 모두 1 값을 가진 행렬 생성
w0_factors = np.ones((N, 1))

# w1과 w0을 업데이트할 w1_update, w0_update 계산
w1_update = -(2/N) * learning_rate * (np.dot(X.T, diff))
w0_update = -(2/N) * learning_rate * (np.dot(w0_factors.T, diff))

return w1_update, w0_update
1
2
3
4
5
6
7
8
9
10
11
12
13
w0 = np.zeros((1, 1))
w1 = np.zeros((1, 1))

y_pred = np.dot(X, w1.T) + w0
diff = y-y_pred
print(diff.shape)

w0_factors = np.ones((100, 1))
w1_update = -(2/100) * 0.01 * (np.dot(X.T, diff))
w0_update = -(2/100) * 0.01 * (np.dot(w0_factors.T, diff))

print(w1_update.shape, w0_update.shape)
print(w1, w0)
(100, 1)
(1, 1) (1, 1)
[[0.]] [[0.]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 입력 인자 반복문 코드 

def gradient_descent_steps(X, y, iters = 100000):

# w0와 w1을 모두 0으로 초기화
w0 = np.zeros((1, 1))
w1 = np.zeros((1, 1))

# iters 만큼 반복 수행 # get_weight_updates
for ind in range(iters):
w1_update, w0_update = get_weight_updates(w1, w0, X, y, learning_rate=0.01)
w1 = w1 - w1_update
w0 = w0 - w0_update

return w1, w0
  • 예측 오차 비용 계산하는 함수 생성 및 경사 하강법 수행
1
2
3
4
5
6
7
8
9
10
11
def get_cost(y, y_pred):
N = len(y)

cost = np.sum(np.square(y - y_pred)) / N
return cost

w1, w0 = gradient_descent_steps(X, y, iters = 100000)
print("w1:{0:.4f}, w0:{1:.4f}".format(w1[0, 0], w0[0, 0]))

y_pred = w1[0,0] * X + w0
print("Total Cost:{0:.4f}".format(get_cost(y, y_pred)))
w1:3.9462, w0:6.5590
Total Cost:0.0803
1
2
plt.scatter(X, y)
plt.plot(X, y_pred, color = "r")

산점도 그래프_Line

1
2
3
4
5
import pandas as pd


bostonDF = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv")
bostonDF.head()

EDA

  • 종속변수가 기준, y값, medv
1
2
3
4
5
6
7
8
9
10
11
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize = (16, 8), ncols = 4, nrows = 2)
lm_features = ["rm", "zn", "indus", "nox", "age", "ptratio", "lstat", "rad"]

for i, feature in enumerate(lm_features):
row = int(i/4)
col = i%4
print("row is {}, col is {}".format(row, col))
sns.regplot(x = feature, y = "medv", data = bostonDF, ax = ax[row][col])

Multi Graphes

  • 두 연속형 변수를 활용한 산점도나 회귀식 가능.
  • 박스플롯 (x: 명목형, y: medv)

rm 3.4
chas 3.0
rad 0.4
zn 0.1
b 0.0
tax -0.0
age 0.0
indus 0.0
crim -0.1
lstat -0.6
ptratio -0.9
dis -1.7
nox -19.8


1
2
3
4
5
6
7
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression # model

y_target = bostonDF["medv"] # 종속변수, Y
X_data = bostonDF.drop(['medv', 'rad', 'zn', 'b', 'tax', 'age', 'indus', 'crim', 'lstat'], axis = 1, inplace = False) # 독립변수

y_target.shape, X_data.shape
((506,), (506, 5))

데이터셋 분리

  • 예측, 시뮬레이션, 가상의 데이터를 가지고 예측 & 시뮬레이션
  • 예측한 결괏값 vs 실젯값 비교
1
2
3
4
# 임의 샘플링

X_train, X_test, y_train, y_test = train_test_split(X_data, y_target, test_size = 0.3, random_state=156)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((354, 5), (152, 5), (354,), (152,))

ML 모형 만들기

1
2
3
4
5
6
lr = LinearRegression()
lr.fit(X_train, y_train)


y_preds = lr.predict(X_test)
y_preds
array([26.78074859, 16.40377991, 34.38443472, 19.13328473, 32.89690238,
       19.25298249, 28.32071818, 22.76654888,  9.87108567, 14.66339227,
       21.55844556, 17.27788854, 28.55574467, 38.50512646, 23.60848806,
       24.03347202, 23.82317119, 15.9119451 , 28.65132167, 20.98388455,
       20.29188703, 18.37003455, 18.58675839, 14.89143225, 35.24799305,
        7.70600921, 19.39133905, 15.97963635, 16.90296718, 15.484303  ,
       29.67753869, 17.58268684, 16.91992352, 22.47407959, 16.57706526,
       18.5381101 , 13.34337954, 24.11893098, 15.48185399, 24.3234222 ,
       36.24776797, 19.60882283, 20.95016211,  6.85667164, 20.32077896,
       23.05614583, 24.65371876, 35.25609168, 22.32959594, 25.96437918,
       27.29101785, 43.32992941, 41.76994078, 19.34288261, 24.8690423 ,
       25.99270875, 20.76285715, 33.13792328, 25.00439224, 16.82906893,
       22.80895172, 23.72489982, 24.53360315, 11.82722067, 17.55728132,
       37.43371362, 33.37256916, 25.65966256, 20.90725715, 21.09529467,
       15.22097444, 30.6234335 , 37.42143489, 26.22092177, 16.71532104,
       32.62735407, 23.41004013, 23.86575538, 18.75430877, 15.9914079 ,
       30.87778491, 16.04423898, 19.01496945, 20.04269634, 28.30832805,
       15.1948795 , 30.47430322, 33.93480059, 23.87721263, 29.7167635 ,
       29.85142798, 19.10737457, 28.49523963, 27.69846662, 25.49534489,
       24.59255802, 12.34870184, 26.65951587, 31.26197918, 17.86101862,
       27.3059424 , 18.18058484, 15.67184217, 13.17304165, 17.91281425,
       23.48894551, 24.53921273, 28.14530028, 16.05340908, 24.22120622,
       21.94517346, 26.62930956, 11.39298015, 18.53099857, 22.75407122,
       33.6679728 , 23.35342973, 20.85267956, 19.69347759, 28.12264641,
       28.56541499, 17.91759633, 27.83520695, 33.8011824 , 21.75436813,
       26.6360736 , 14.70682076, 19.99114889, 21.81029849, 31.72247354,
       21.33041025, 23.52438417, 35.55842163, 20.54294729, 38.34696416,
       19.25750865, 17.07595035, 18.31764392, 17.66658651, 23.12171447,
       19.58446231, 19.90774119, 14.84809066, 19.50652744, 38.83812958,
       15.26095952, 28.56874885, 17.62298514, 22.46794555, 23.28435884,
       18.8439135 , 31.16286012])

모형 평가

1
2
3
4
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_preds)

print("MSE: {0:.3f}".format(mse))
MSE: 21.369

y = 상수값 + rm 기울기 x rm의 값 +

1
2
3
4
5
6
7
import numpy as np 

print("절편 값:", lr.intercept_) # 절편 값
print("회귀 계수값", np.round(lr.coef_, 1))

coeff_df = pd.Series(data=np.round(lr.coef_, 1), index = X_data.columns)
coeff_df.sort_values(ascending=False)
절편 값: 26.830373506191982
회귀 계수값 [  4.3 -33.1   6.5  -1.1  -1.2]





rm          6.5
chas        4.3
dis        -1.1
ptratio    -1.2
nox       -33.1
dtype: float64

아직 배우지 않았지만, 유용한 기능

1
!pip install pycaret
Collecting pycaret
  Downloading pycaret-2.3.5-py3-none-any.whl (288 kB)
     |████████████████████████████████| 288 kB 32.5 MB/s 
[?25hRequirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.1.0)
Requirement already satisfied: textblob in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.15.3)
Collecting pandas-profiling>=2.8.0
  Downloading pandas_profiling-3.1.0-py2.py3-none-any.whl (261 kB)
     |████████████████████████████████| 261 kB 53.7 MB/s 
[?25hCollecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
     |████████████████████████████████| 1.7 MB 42.3 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Requirement already satisfied: spacy<2.4.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (2.2.4)
Collecting scikit-learn==0.23.2
  Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
     |████████████████████████████████| 6.8 MB 58.0 MB/s 
[?25hRequirement already satisfied: ipywidgets in /usr/local/lib/python3.7/dist-packages (from pycaret) (7.6.5)
Requirement already satisfied: cufflinks>=0.17.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.17.3)
Collecting scikit-plot
  Downloading scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Requirement already satisfied: yellowbrick>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.3.post1)
Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.2.5)
Collecting umap-learn
  Downloading umap-learn-0.5.2.tar.gz (86 kB)
     |████████████████████████████████| 86 kB 6.0 MB/s 
[?25hCollecting Boruta
  Downloading Boruta-0.3-py3-none-any.whl (56 kB)
     |████████████████████████████████| 56 kB 4.7 MB/s 
[?25hRequirement already satisfied: wordcloud in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.5.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.11.2)
Requirement already satisfied: IPython in /usr/local/lib/python3.7/dist-packages (from pycaret) (5.5.0)
Requirement already satisfied: gensim<4.0.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.6.0)
Collecting lightgbm>=2.3.1
  Downloading lightgbm-3.3.1-py3-none-manylinux1_x86_64.whl (2.0 MB)
     |████████████████████████████████| 2.0 MB 47.4 MB/s 
[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.1.5)
Collecting mlxtend>=0.17.0
  Downloading mlxtend-0.19.0-py2.py3-none-any.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 60.5 MB/s 
[?25hRequirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.2.2)
Requirement already satisfied: numpy==1.19.5 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.19.5)
Collecting pyod
  Downloading pyod-0.9.5.tar.gz (113 kB)
     |████████████████████████████████| 113 kB 58.7 MB/s 
[?25hRequirement already satisfied: plotly>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from pycaret) (4.4.1)
Collecting mlflow
  Downloading mlflow-1.22.0-py3-none-any.whl (15.5 MB)
     |████████████████████████████████| 15.5 MB 50.3 MB/s 
[?25hRequirement already satisfied: scipy<=1.5.4 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.4.1)
Collecting imbalanced-learn==0.7.0
  Downloading imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
     |████████████████████████████████| 167 kB 62.1 MB/s 
[?25hCollecting kmodes>=0.10.1
  Downloading kmodes-0.11.1-py2.py3-none-any.whl (19 kB)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==0.23.2->pycaret) (3.0.0)
Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (0.3.0)
Requirement already satisfied: setuptools>=34.4.1 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (57.4.0)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (1.15.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim<4.0.0->pycaret) (5.2.1)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (5.1.1)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (0.8.1)
Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (2.6.1)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (1.0.18)
Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (4.4.2)
Requirement already satisfied: pexpect in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (4.8.0)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (0.7.5)
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (1.0.2)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (3.5.2)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (4.10.1)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (5.1.3)
Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (0.2.0)
Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.1.1)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.3.5)
Requirement already satisfied: wheel in /usr/local/lib/python3.7/dist-packages (from lightgbm>=2.3.1->pycaret) (0.37.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (1.3.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (3.0.6)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (2.6.0)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (4.9.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->pycaret) (2018.9)
Requirement already satisfied: tqdm>=4.48.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (4.62.3)
Collecting tangled-up-in-unicode==0.1.0
  Downloading tangled_up_in_unicode-0.1.0-py3-none-any.whl (3.1 MB)
     |████████████████████████████████| 3.1 MB 47.3 MB/s 
[?25hRequirement already satisfied: markupsafe~=2.0.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (2.0.1)
Collecting pydantic>=1.8.1
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
     |████████████████████████████████| 10.1 MB 37.6 MB/s 
[?25hCollecting htmlmin>=0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
Collecting multimethod>=1.4
  Downloading multimethod-1.6-py3-none-any.whl (9.4 kB)
Collecting PyYAML>=5.0.0
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
     |████████████████████████████████| 596 kB 40.0 MB/s 
[?25hCollecting phik>=0.11.1
  Downloading phik-0.12.0-cp37-cp37m-manylinux2010_x86_64.whl (675 kB)
     |████████████████████████████████| 675 kB 62.8 MB/s 
[?25hRequirement already satisfied: jinja2>=2.11.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (2.11.3)
Collecting visions[type_image_path]==0.7.4
  Downloading visions-0.7.4-py3-none-any.whl (102 kB)
     |████████████████████████████████| 102 kB 12.8 MB/s 
[?25hCollecting joblib
  Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
     |████████████████████████████████| 303 kB 71.5 MB/s 
[?25hCollecting requests>=2.24.0
  Downloading requests-2.26.0-py2.py3-none-any.whl (62 kB)
     |████████████████████████████████| 62 kB 995 kB/s 
[?25hRequirement already satisfied: missingno>=0.4.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (0.5.0)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (21.2.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (2.6.3)
Collecting imagehash
  Downloading ImageHash-4.2.1.tar.gz (812 kB)
     |████████████████████████████████| 812 kB 49.7 MB/s 
[?25hRequirement already satisfied: Pillow in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (7.1.2)
Collecting scipy<=1.5.4
  Downloading scipy-1.5.4-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)
     |████████████████████████████████| 25.9 MB 1.6 MB/s 
[?25hRequirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly>=4.4.1->pycaret) (1.3.3)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython->pycaret) (0.2.5)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from pydantic>=1.8.1->pandas-profiling>=2.8.0->pycaret) (3.10.0.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (2.0.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret) (1.24.3)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (0.4.1)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.6)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (3.0.6)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.1.3)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (7.4.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (2.0.6)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (0.8.2)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.0)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.5)
Requirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<2.4.0->pycaret) (4.8.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<2.4.0->pycaret) (3.6.0)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.3.1)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.6.1)
Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.12.1)
Requirement already satisfied: Send2Trash in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.8.0)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.7/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets->pycaret) (22.3.0)
Requirement already satisfied: ptyprocess in /usr/local/lib/python3.7/dist-packages (from terminado>=0.8.1->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.7.0)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.7/dist-packages (from imagehash->visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret) (1.2.0)
Collecting docker>=4.0.0
  Downloading docker-5.0.3-py2.py3-none-any.whl (146 kB)
     |████████████████████████████████| 146 kB 70.7 MB/s 
[?25hRequirement already satisfied: entrypoints in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (0.3)
Requirement already satisfied: protobuf>=3.7.0 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (3.17.3)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.3.0)
Collecting databricks-cli>=0.8.7
  Downloading databricks-cli-0.16.2.tar.gz (58 kB)
     |████████████████████████████████| 58 kB 5.9 MB/s 
[?25hRequirement already satisfied: click>=7.0 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (7.1.2)
Collecting querystring-parser
  Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)
Collecting alembic<=1.4.1
  Downloading alembic-1.4.1.tar.gz (1.1 MB)
     |████████████████████████████████| 1.1 MB 59.1 MB/s 
[?25hCollecting prometheus-flask-exporter
  Downloading prometheus_flask_exporter-0.18.6-py3-none-any.whl (17 kB)
Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.4.27)
Requirement already satisfied: Flask in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.1.4)
Collecting gitpython>=2.1.0
  Downloading GitPython-3.1.24-py3-none-any.whl (180 kB)
     |████████████████████████████████| 180 kB 58.3 MB/s 
[?25hCollecting gunicorn
  Downloading gunicorn-20.1.0-py3-none-any.whl (79 kB)
     |████████████████████████████████| 79 kB 8.7 MB/s 
[?25hRequirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (21.3)
Requirement already satisfied: sqlparse>=0.3.1 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (0.4.2)
Collecting Mako
  Downloading Mako-1.1.6-py2.py3-none-any.whl (75 kB)
     |████████████████████████████████| 75 kB 4.4 MB/s 
[?25hCollecting python-editor>=0.3
  Downloading python_editor-1.0.4-py3-none-any.whl (4.9 kB)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.7/dist-packages (from databricks-cli>=0.8.7->mlflow->pycaret) (0.8.9)
Collecting websocket-client>=0.32.0
  Downloading websocket_client-1.2.3-py3-none-any.whl (53 kB)
     |████████████████████████████████| 53 kB 2.1 MB/s 
[?25hCollecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
     |████████████████████████████████| 63 kB 1.9 MB/s 
[?25hCollecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy->mlflow->pycaret) (1.1.2)
Requirement already satisfied: Werkzeug<2.0,>=0.15 in /usr/local/lib/python3.7/dist-packages (from Flask->mlflow->pycaret) (1.0.1)
Requirement already satisfied: itsdangerous<2.0,>=0.24 in /usr/local/lib/python3.7/dist-packages (from Flask->mlflow->pycaret) (1.1.0)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.4)
Requirement already satisfied: bleach in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (4.1.0)
Requirement already satisfied: testpath in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.7.1)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.5.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.1)
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.7/dist-packages (from prometheus-flask-exporter->mlflow->pycaret) (0.12.0)
Requirement already satisfied: future in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (0.16.0)
Collecting pyLDAvis
  Downloading pyLDAvis-3.3.0.tar.gz (1.7 MB)
     |████████████████████████████████| 1.7 MB 37.5 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
  Downloading pyLDAvis-3.2.2.tar.gz (1.7 MB)
     |████████████████████████████████| 1.7 MB 45.7 MB/s 
[?25hRequirement already satisfied: numexpr in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (2.7.3)
Collecting funcy
  Downloading funcy-1.16-py2.py3-none-any.whl (32 kB)
Requirement already satisfied: numba>=0.35 in /usr/local/lib/python3.7/dist-packages (from pyod->pycaret) (0.51.2)
Requirement already satisfied: statsmodels in /usr/local/lib/python3.7/dist-packages (from pyod->pycaret) (0.10.2)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.35->pyod->pycaret) (0.34.0)
Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from statsmodels->pyod->pycaret) (0.5.2)
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.5.tar.gz (1.1 MB)
     |████████████████████████████████| 1.1 MB 49.9 MB/s 
[?25hBuilding wheels for collected packages: htmlmin, imagehash, alembic, databricks-cli, pyLDAvis, pyod, umap-learn, pynndescent
  Building wheel for htmlmin (setup.py) ... [?25l[?25hdone
  Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27098 sha256=d7dfcc5cb8473dd5eae3fcf51c538f92f876faa04e78c8b36d9c790b9fac7e10
  Stored in directory: /root/.cache/pip/wheels/70/e1/52/5b14d250ba868768823940c3229e9950d201a26d0bd3ee8655
  Building wheel for imagehash (setup.py) ... [?25l[?25hdone
  Created wheel for imagehash: filename=ImageHash-4.2.1-py2.py3-none-any.whl size=295207 sha256=8b1e1a54f9880fb8de0530e8e168811d3264000c0375d179b04677d7db738f6f
  Stored in directory: /root/.cache/pip/wheels/4c/d5/59/5e3e297533ddb09407769762985d134135064c6831e29a914e
  Building wheel for alembic (setup.py) ... [?25l[?25hdone
  Created wheel for alembic: filename=alembic-1.4.1-py2.py3-none-any.whl size=158172 sha256=3a382d7a8aa3f735be58614dc83527e0801ccb0bc893eb96cc388ee8f0a5dd91
  Stored in directory: /root/.cache/pip/wheels/be/5d/0a/9e13f53f4f5dfb67cd8d245bb7cdffe12f135846f491a283e3
  Building wheel for databricks-cli (setup.py) ... [?25l[?25hdone
  Created wheel for databricks-cli: filename=databricks_cli-0.16.2-py3-none-any.whl size=106811 sha256=ada21177391b9688188e6f778b0ec6b6001615c2b2f13bef53090805b2f183bf
  Stored in directory: /root/.cache/pip/wheels/f4/5c/ed/e1ce20a53095f63b27b4964abbad03e59cf3472822addf7d29
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.2.2-py2.py3-none-any.whl size=135618 sha256=29ef50e1603fe00d18a256b833c7feddebc16ef3ac82f37f109f991b0f95b4b0
  Stored in directory: /root/.cache/pip/wheels/f8/b1/9b/560ac1931796b7303f7b517b949d2d31a4fbc512aad3b9f284
  Building wheel for pyod (setup.py) ... [?25l[?25hdone
  Created wheel for pyod: filename=pyod-0.9.5-py3-none-any.whl size=132699 sha256=851491ca675bc8eb4d9ecfb52396f362de25c1443531f442a9528c0b9b3f7b21
  Stored in directory: /root/.cache/pip/wheels/3d/bb/b7/62b60fb451b33b0df1ab8006697fba7a6a49709a629055cf77
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.2-py3-none-any.whl size=82709 sha256=f4bae757148b4cf4930e495a816ecb3f6fcc3a16d1014c85ce052bb2acccb378
  Stored in directory: /root/.cache/pip/wheels/84/1b/c6/aaf68a748122632967cef4dffef68224eb16798b6793257d82
  Building wheel for pynndescent (setup.py) ... [?25l[?25hdone
  Created wheel for pynndescent: filename=pynndescent-0.5.5-py3-none-any.whl size=52603 sha256=e19d78d031a739792e30a3bf2d93865296b6eb66226835f66a8287b1330882f1
  Stored in directory: /root/.cache/pip/wheels/af/e9/33/04db1436df0757c42fda8ea6796d7a8586e23c85fac355f476
Successfully built htmlmin imagehash alembic databricks-cli pyLDAvis pyod umap-learn pynndescent
Installing collected packages: tangled-up-in-unicode, smmap, scipy, multimethod, joblib, websocket-client, visions, scikit-learn, requests, python-editor, Mako, imagehash, gitdb, querystring-parser, PyYAML, pynndescent, pydantic, prometheus-flask-exporter, phik, htmlmin, gunicorn, gitpython, funcy, docker, databricks-cli, alembic, umap-learn, scikit-plot, pyod, pyLDAvis, pandas-profiling, mlxtend, mlflow, lightgbm, kmodes, imbalanced-learn, Boruta, pycaret
  Attempting uninstall: scipy
    Found existing installation: scipy 1.4.1
    Uninstalling scipy-1.4.1:
      Successfully uninstalled scipy-1.4.1
  Attempting uninstall: joblib
    Found existing installation: joblib 1.1.0
    Uninstalling joblib-1.1.0:
      Successfully uninstalled joblib-1.1.0
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.1
    Uninstalling scikit-learn-1.0.1:
      Successfully uninstalled scikit-learn-1.0.1
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: PyYAML
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
  Attempting uninstall: pandas-profiling
    Found existing installation: pandas-profiling 1.4.1
    Uninstalling pandas-profiling-1.4.1:
      Successfully uninstalled pandas-profiling-1.4.1
  Attempting uninstall: mlxtend
    Found existing installation: mlxtend 0.14.0
    Uninstalling mlxtend-0.14.0:
      Successfully uninstalled mlxtend-0.14.0
  Attempting uninstall: lightgbm
    Found existing installation: lightgbm 2.2.3
    Uninstalling lightgbm-2.2.3:
      Successfully uninstalled lightgbm-2.2.3
  Attempting uninstall: imbalanced-learn
    Found existing installation: imbalanced-learn 0.8.1
    Uninstalling imbalanced-learn-0.8.1:
      Successfully uninstalled imbalanced-learn-0.8.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed Boruta-0.3 Mako-1.1.6 PyYAML-6.0 alembic-1.4.1 databricks-cli-0.16.2 docker-5.0.3 funcy-1.16 gitdb-4.0.9 gitpython-3.1.24 gunicorn-20.1.0 htmlmin-0.1.12 imagehash-4.2.1 imbalanced-learn-0.7.0 joblib-1.0.1 kmodes-0.11.1 lightgbm-3.3.1 mlflow-1.22.0 mlxtend-0.19.0 multimethod-1.6 pandas-profiling-3.1.0 phik-0.12.0 prometheus-flask-exporter-0.18.6 pyLDAvis-3.2.2 pycaret-2.3.5 pydantic-1.8.2 pynndescent-0.5.5 pyod-0.9.5 python-editor-1.0.4 querystring-parser-1.2.4 requests-2.26.0 scikit-learn-0.23.2 scikit-plot-0.3.7 scipy-1.5.4 smmap-5.0.0 tangled-up-in-unicode-0.1.0 umap-learn-0.5.2 visions-0.7.4 websocket-client-1.2.3
1
2
from pycaret.utils import enable_colab
enable_colab()
Colab mode enabled.
1
2
from pycaret.datasets import get_data
dataset = get_data('diamond')
1
2
3
4
5
6
7
8
data = dataset.sample(frac=0.9, random_state=786)
data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (5400, 8)
Unseen Data For Predictions: (600, 8)
1
2
from pycaret.regression import *
exp_reg101 = setup(data = data, target = 'Price', session_id=123)

Make Timer function in python

시간 측정 decorator 함수 만들기



import time

1
import time

실행 시간을 확인 하는 함수 만들기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def timer(func):
""" 함수 실행 시간 확인
:param func: check할 함수 넣을거임
:return: 걸린 시간
"""
def wrapper(*args, **kwargs):
#현재 시간
time_start = time.time()

#decorated function 불러오기
result = func(*args, **kwargs)
time_total = time.time() - time_start
#
print("{}, Total time is {: .2f} sec.".format(func.__name__, time_total))

return result
return wrapper
  • 가변 매개 변수 args(*)

    • 함수를 정의할때 앞에 *가 붙어 있으면, 정해지지 않은 수의 매개변수를 받겠다는 의미
    • 가변 매개변수는 입력 받은 인수를 튜플형식으로 packing 한다.
    • 일반적으로 *args (arguments의 약자로 관례적으로 사용) 를 사용한다.
    • 다른 매개 변수와 혼용가능
  • 키워드 매개변수 kwargs(**)

    • 함수에서 정의되지 않은 매개변수를 받을 때 사용되며, 딕셔너리 형식으로 전달.
    • 일반 매개변수, 가변 매개변수와 함께 일는 경우 순서를 지켜야함 (일반>가변>키워드 순)
    • **kwargs (Keyword arguments의 약자로 관례적으로 사용 )



decorator 함수를 이용하여 시간 확인 함수 설정, 실행

1
2
3
4
5
6
7
8

@timer
def check_time(num):
time.sleep(num)

if __name__ == "__main__":
check_time(1.5)

out

check_time, Total time is 1.50 sec.







관련 이론들을 아래에 적어 놓았다.

  • timestamp :python에서 time은 1970년 1월 1일 0시 0분 0초를 기준으로 경과한 시간을 나타냄
  • time_struct class
    • timestamp가 주어 졌을때 날짜와 시간을 알아내기 위한 API 제공
name value Ex.
tm_year year 1993, 2021
tm_mon month 1~12
tm_mday day 1~31
tm_hour hour 0~23
tm_min minute 0~59
tm_sec second 0~61
tm_wday 요일 0~6 (0 : MON)
tm_yday 연중 경과일 1~366
tm_isdst summertime 0: unapply 1: apply



time() 함수


  • 현재 timestamp 얻기

in

1
2
3
secs = time.time()
print(secs)

out

1638870356.8049076

  • unix timestamp는 소수로 값을 return, 정수 부분이 초 단위.

부가적인 time 함수들

  1. gmtime() : GMT 기준의 time_struct type으로 변환

    in

    1
    2
    tm = time.gmtime(secs)
    print(tm)

    out

    time.struct_time(tm_year=2021, tm_mon=12, tm_mday=7, tm_hour=9, tm_min=53, tm_sec=5, tm_wday=1, tm_yday=341, tm_isdst=0)

  1. localtime() : 현지 시간대 기준의 time_struct type으로 변환

    in

    1
    2
    3
    4
    5
    6
    7
    tm = time.localtime(secs)
    print("year:", tm.tm_year)
    print("month:", tm.tm_mon)
    print("day:", tm.tm_mday)
    print("hour:", tm.tm_hour)
    print("minute:", tm.tm_min)
    print("second:", tm.tm_sec)

    out

    year: 2021
    month: 12
    day: 7
    hour: 18
    minute: 53
    second: 5

  2. ctime() : 요일 월 일 시:분:초 연도 형식으로 출력

    in

    1
    string = time.ctime(secs)

    out

    Tue Dec 7 18:56:03 2021

  1. strftime() : strftime 과 같은 특정 형식으로 변환가능
    • parameter로 time_struct type data를 받기 때문에 위의 함수들을 사용해서 data를 strftime()으로 넘겨야 함.

      in

      1
      2
      3
      tmt = time.localtime(secs)
      string = time.strftime('%Y-%m-%d %I:%M:%S %p', tmt)
      print(string)

      out

      2021-12-07 07:00:54 PM

인자를 쓰는동안 secs로 계속 썻더니 시간이 계속 올라가고있다 하하하 ^0^

  1. strptime() : strftime
    과 같은 특정 포멧의 시간을 time_struct type 으로 변경.

    in

    1
    2
    3
    string = '2021-12-07 07:00:54 PM'
    tmm = time.strptime(string, '%Y-%m-%d %I:%M:%S %p')
    print(tmm)

    out

    time.struct_time(tm_year=2021, tm_mon=12, tm_mday=7, tm_hour=19, tm_min=0,
    tm_sec=54, tm_wday=1, tm_yday=341, tm_isdst=-1)

  2. sleep() : 일정 시간 동안 시간 지연 시키기

    in

    1
    2
    3
    print("Start-->")
    time.sleep(1.5)
    print("<--End")

    out

    Start–>
    <–End

  • time.sleep(sec) : 초 단위로 시간을 지연 시킨다.