2022-01-20 게시 됨2022-01-20 업데이트 됨4분안에 읽기 (약 539 단어)

Use Notion

Notion이 그렇게 편하다며?

Notion 을 이용하여 github blog 를 편하게 쓸수 있다고 한다.
어쩐지 Naver Blog 는 쓰기 넘넘 쉬운데 Github blog 넘 쓰기 힘들었다.

특히 사진을 넣는 부분 ㅠㅠ
진짜 넘 힘들었는데 이제는 자신있다 !

Notion을 사용 해 보자

아마도 이 blog posting이 pycham을 이용하는 마지막 posting이 되지 않을까 싶다.

마지막 posting을 화려하게 마무리 해 보자.

Notion에 들어가서 google account or 본인이 원하는 email을 사용하여 회원가입을 한다.

나의 경우에는 gmail(google 계정) 이 있었기 때문에 클릭 2번으로 회원가입 완료

시작 하기에 대부분의 내용이 나와 있기 때문에 이 부분을 보면 된다.

Notion01

페이지 추가하기를 누르면 페이지가 추가 되고, 페이지의 하위페이지도 추가 할 수 있다.

커버 사진도 손쉽게 변경 할 수 있고, 명령어 사용을 위해 /을 입력하면 원하는 명령어가 리스트로 나온다.

사진의 경우 네이버 처럼 Ctrl + C, Ctrl + V, 즉 복붙을 통해 편리하게 넣을 수 있다.

나의 경우 Image를 Ctrl + C, Ctrl + V,를 통해 가져와서 pycham 에서 directory를 image file 로 옮기면서 이름을 변경하는 형식으로 진행했다.

사실 포스팅 하나마다 포스팅 directory를 이용하면, pycham으로 작성하는 것이 더 쉬울 수도 …

Mark Down 문서를 작성하기 가장 쉬운 곳은 visual studio code 라고 한다.

Notion! 결론은 ?

directory를 source안에 많이 만드는 것이 훨씬 쉽게 작성 하는 방법 인것 같다.

예쁘게 만들고 싶다면 처음부터 Notion을 사용 하는 것이 좋을 듯 !

내가 정리 한 글이 마음에 들지 않는다면, 노션 가이드북
을 참고 하는 방법도 있다.

1 2	<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9661048314566450" crossorigin="anonymous"></script>

2022-01-03 게시 됨2022-01-20 업데이트 됨python / Crawling12분안에 읽기 (약 1870 단어)

Crawling_basic(01)

크롤링

즐거운 마음으로 크롤링 해 봅시다 !

01. file 준비

file 은
repository에 올려놓았다.

pycham을 열어서 python 가상환경(VENV)설정 후 그 file에서 진행.

02. 크롤링 : BeautifulSoup 설치

beautifulsoup 설치는 여기 에서 시작

Crawling_beautifulsoup

terminal에 위의 명령어 입력

BeautifulSoup 설치 되었는지 한번 더 확인

1 2	from bs4 import BeautifulSoup print("library imported")

terminal에 library imported가 print 된다.

03. 크롤링 : code

03.1 객체 초기화

def main():
    #객체 초기화, 뒤에있는 parser가 핵심: python에서 접근 가능하게 만들어줌
    soup = BeautifulSoup(open("data/index.html"), "html.parser")
    print(soup)

if __name__ == "__main__":
    main()

함수를 만들어 html file을 열고 그 file을 parser로 python이 읽을 수 있는 형태로 가져온다.

out:


library imported
<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<title>Crawl This Page</title>
</head>
<body>
<div class="cheshire">
<p>Don't crawl this.</p>
</div>
<div class="elice">
<p>Hello, Python Crawling!</p>
</div>
</body>
</html>

Process finished with exit code 0

읽어 오는 데 까지가 끝

03.2 원하는 객체 뽑아내기 : HTML tag로

# /c/../Crawring/venv/Scripts/python
# -*- Encoding : UTF-8 -*-

from bs4 import BeautifulSoup

print("library imported")


def main():
    # 객체 초기화, 뒤에있는 parser가 핵심: python에서 접근 가능하게 만들어줌
    soup = BeautifulSoup(open("data/index.html"), "html.parser")
    #원하는 data 출력
    print(soup.find("div").get_text()) #find_all 쓰면 get_text가 안된다.
    #저장(Excel:pandas, DB)

if __name__ == "__main__":
    main()

out:

library imported

Don't crawl this.

Process finished with exit code 0

.find("div") : div tag를 뽑아오는 구문
.get_text() : div tag를 삭제하고 text만 뽑아옴
.find_all("div") : get text()가 안먹는다.

03.3 원하는 객체 뽑아내기 : class name으로


def main():
    # 객체 초기화
    soup = BeautifulSoup(open("data/index2.html"), "html.parser")
    # 원하는 data 출력
    print(soup.find("div", class_ = "elice").find("p").get_text())
    # 저장(Excel:pandas, DB)

if __name__ == "__main__":
    main()

03.4 원하는 객체 뽑아내기 : id 로


def main():
    # 객체 초기화
    soup = BeautifulSoup(open("data/index3.html"), "html.parser")
    # 원하는 data 출력
    print(soup.find("div", id = "main").find("p").get_text())
    # 저장(Excel:pandas, DB)

if __name__ == "__main__":
    main()

04. data 뽑아내기

신비롭게 data 뽑아내는 거는 google colab에서 진행

04.1 data file 받아오기

Data file은 여기에서 무료로 혹은 유료로 받을 수 있다.
인증키를 받아서 사용 해야 하는데, 인증키는 여기 서 개인정보를 입력하고 받아 오면 된다.
인증키를 받았다면 google colab 열고 진행
인증키는 입력한 메일로 오기 때문에 메일을 잘 쓰기 바란다.

import requests
key = "*본인의 인증키 숫자를 넣는다*"
url = "http://data.ex.co.kr/openapi/trtm/realUnitTrtm?key=`인증키여기`&type=json&iStartUnitCode=101&iEndUnitCode=103"

responses = requests.get(url)

<Response [200]>

인증키를 서공적으로 넣으면 위와 같은 out이 나온다.

04.2 responses로 json file 만들기

1 2	json = responses.json()

out:

{'code': 'SUCCESS',
 'count': 602,
 'message': '인증키가 유효합니다.',
 'numOfRows': 10,
 'pageNo': 1,
 'pageSize': 61,
 'realUnitTrtmVO': [{'efcvTrfl': '18',
   'endUnitCode': '103 ',
   'endUnitNm': '수원신갈',
   'iEndUnitCode': None,
   'iStartEndStdTypeCode': None,
   'iStartUnitCode': None,
   'numOfRows': None,
   'pageNo': None,
   'startEndStdTypeCode': '2',
   'startEndStdTypeNm': '도착기준통행시간',
   'startUnitCode': '101 ',
   'startUnitNm': '서울',
   'stdDate': '20220103',
   'stdTime': '00:25',
   'sumTmUnitTypeCode': None,
   'tcsCarTypeCode': '1',
   'tcsCarTypeDivCode': '1',
   'tcsCarTypeDivName': '소형차',
   'tcsCarTypeName': '1종',
   'timeAvg': '7.466666666666667',
   'timeMax': '9.216',
   'timeMin': '5.800'},
  {'efcvTrfl': '53',
   'endUnitCode': '103 ',
   'endUnitNm': '수원신갈',
   'iEndUnitCode': None,
   'iStartEndStdTypeCode': None,
   'iStartUnitCode': None,
   'numOfRows': None,
   'pageNo': None,
   'startEndStdTypeCode': '2',
   'startEndStdTypeNm': '도착기준통행시간',
   'startUnitCode': '101 ',
   'startUnitNm': '서울',
   'stdDate': '20220103',
   'stdTime': '00:30',
   'sumTmUnitTypeCode': None,
   'tcsCarTypeCode': '1',
   'tcsCarTypeDivCode': '1',
   'tcsCarTypeDivName': '소형차',
   'tcsCarTypeName': '1종',
   'timeAvg': '7.466666666666667',
   'timeMax': '9.283',
   'timeMin': '5.783'},
  {'efcvTrfl': '14',
   'endUnitCode': '103 ',
   'endUnitNm': '수원신갈',
   'iEndUnitCode': None,
   'iStartEndStdTypeCode': None,
   'iStartUnitCode': None,
   'numOfRows': None,
   'pageNo': None,
   'startEndStdTypeCode': '2',
   'startEndStdTypeNm': '도착기준통행시간',
   'startUnitCode': '101 ',
   'startUnitNm': '서울',
   'stdDate': '20220103',
   'stdTime': '00:35',
   'sumTmUnitTypeCode': None,
   'tcsCarTypeCode': '1',
   'tcsCarTypeDivCode': '1',
   'tcsCarTypeDivName': '소형차',
   'tcsCarTypeName': '1종',
   'timeAvg': '7.416666666666667',
   'timeMax': '8.400',
   'timeMin': '6.050'},
  {'efcvTrfl': '15',
   'endUnitCode': '103 ',
   'endUnitNm': '수원신갈',
   'iEndUnitCode': None,
   'iStartEndStdTypeCode': None,
   'iStartUnitCode': None,
   'numOfRows': None,
   'pageNo': None,
   'startEndStdTypeCode': '2',
   'startEndStdTypeNm': '도착기준통행시간',
   'startUnitCode': '101 ',
   'startUnitNm': '서울',
   'stdDate': '20220103',
   'stdTime': '00:40',
   'sumTmUnitTypeCode': None,
   'tcsCarTypeCode': '1',
   'tcsCarTypeDivCode': '1',
   'tcsCarTypeDivName': '소형차',
   'tcsCarTypeName': '1종',
   'timeAvg': '7.516666666666667',
   'timeMax': '9.283',
   'timeMin': '5.966'},
  {'efcvTrfl': '41',
   'endUnitCode': '103 ',
   'endUnitNm': '수원신갈',
   'iEndUnitCode': None,
   'iStartEndStdTypeCode': None,
   'iStartUnitCode': None,
   'numOfRows': None,
   'pageNo': None,
   'startEndStdTypeCode': '2',
   'startEndStdTypeNm': '도착기준통행시간',
   'startUnitCode': '101 ',
   'startUnitNm': '서울',
   'stdDate': '20220103',
   'stdTime': '00:45',
   'sumTmUnitTypeCode': None,
   'tcsCarTypeCode': '1',
   'tcsCarTypeDivCode': '1',
   'tcsCarTypeDivName': '소형차',
   'tcsCarTypeName': '1종',
   'timeAvg': '7.5',
   'timeMax': '8.800',
   'timeMin': '5.750'},
  {'efcvTrfl': '11',
   'endUnitCode': '103 ',
   'endUnitNm': '수원신갈',
   'iEndUnitCode': None,
   'iStartEndStdTypeCode': None,
   'iStartUnitCode': None,
   'numOfRows': None,
   'pageNo': None,
   'startEndStdTypeCode': '2',
   'startEndStdTypeNm': '도착기준통행시간',
   'startUnitCode': '101 ',
   'startUnitNm': '서울',
   'stdDate': '20220103',
   'stdTime': '00:50',
   'sumTmUnitTypeCode': None,
   'tcsCarTypeCode': '1',
   'tcsCarTypeDivCode': '1',
   'tcsCarTypeDivName': '소형차',
   'tcsCarTypeName': '1종',
   'timeAvg': '7.416666666666667',
   'timeMax': '8.800',
   'timeMin': '5.766'},
  {'efcvTrfl': '8',
   'endUnitCode': '103 ',
   'endUnitNm': '수원신갈',
   'iEndUnitCode': None,
   'iStartEndStdTypeCode': None,
   'iStartUnitCode': None,
   'numOfRows': None,
   'pageNo': None,
   'startEndStdTypeCode': '2',
   'startEndStdTypeNm': '도착기준통행시간',
   'startUnitCode': '101 ',
   'startUnitNm': '서울',
   'stdDate': '20220103',
   'stdTime': '00:55',
   'sumTmUnitTypeCode': None,
   'tcsCarTypeCode': '1',
   'tcsCarTypeDivCode': '1',
   'tcsCarTypeDivName': '소형차',
   'tcsCarTypeName': '1종',
   'timeAvg': '7.433333333333334',
   'timeMax': '8.350',
   'timeMin': '5.750'},
  {'efcvTrfl': '87',
   'endUnitCode': '103 ',
   'endUnitNm': '수원신갈',
   'iEndUnitCode': None,
   'iStartEndStdTypeCode': None,
   'iStartUnitCode': None,
   'numOfRows': None,
   'pageNo': None,
   'startEndStdTypeCode': '2',
   'startEndStdTypeNm': '도착기준통행시간',
   'startUnitCode': '101 ',
   'startUnitNm': '서울',
   'stdDate': '20220103',
   'stdTime': '01  ',
   'sumTmUnitTypeCode': None,
   'tcsCarTypeCode': '1',
   'tcsCarTypeDivCode': '1',
   'tcsCarTypeDivName': '소형차',
   'tcsCarTypeName': '1종',
   'timeAvg': '7.55',
   'timeMax': '10.183',
   'timeMin': '5.750'},
  {'efcvTrfl': '30',
   'endUnitCode': '103 ',
   'endUnitNm': '수원신갈',
   'iEndUnitCode': None,
   'iStartEndStdTypeCode': None,
   'iStartUnitCode': None,
   'numOfRows': None,
   'pageNo': None,
   'startEndStdTypeCode': '2',
   'startEndStdTypeNm': '도착기준통행시간',
   'startUnitCode': '101 ',
   'startUnitNm': '서울',
   'stdDate': '20220103',
   'stdTime': '01:00',
   'sumTmUnitTypeCode': None,
   'tcsCarTypeCode': '1',
   'tcsCarTypeDivCode': '1',
   'tcsCarTypeDivName': '소형차',
   'tcsCarTypeName': '1종',
   'timeAvg': '7.625',
   'timeMax': '10.183',
   'timeMin': '6.600'},
  {'efcvTrfl': '3',
   'endUnitCode': '103 ',
   'endUnitNm': '수원신갈',
   'iEndUnitCode': None,
   'iStartEndStdTypeCode': None,
   'iStartUnitCode': None,
   'numOfRows': None,
   'pageNo': None,
   'startEndStdTypeCode': '2',
   'startEndStdTypeNm': '도착기준통행시간',
   'startUnitCode': '101 ',
   'startUnitNm': '서울',
   'stdDate': '20220103',
   'stdTime': '01:05',
   'sumTmUnitTypeCode': None,
   'tcsCarTypeCode': '1',
   'tcsCarTypeDivCode': '1',
   'tcsCarTypeDivName': '소형차',
   'tcsCarTypeName': '1종',
   'timeAvg': '7.683333333333334',
   'timeMax': '8.250',
   'timeMin': '7.500'}]}

json 객체에 responses 함수를 이용하여 json형태의 file을 담고
cars : 내가 찾고 싶은 부분을 file의 형태에서 찾아서 넣는다.
file의 구조와 내가 찾고 싶은 부분을 알고 있어야 원하는 정보를 넣을 수 있다.

04.3 json file 에서 원하는 정보 빼오기

알 수 없지만, realUnitTrtmVO 라는 tag? dictionarly에 접근하여
정보를 빼보자.

1	cars = json["realUnitTrtmVO"]

out:

[{'efcvTrfl': '18',
  'endUnitCode': '103 ',
  'endUnitNm': '수원신갈',
  'iEndUnitCode': None,
  'iStartEndStdTypeCode': None,
  'iStartUnitCode': None,
  'numOfRows': None,
  'pageNo': None,
  'startEndStdTypeCode': '2',
  'startEndStdTypeNm': '도착기준통행시간',
  'startUnitCode': '101 ',
  'startUnitNm': '서울',
  'stdDate': '20220103',
  'stdTime': '00:25',
  'sumTmUnitTypeCode': None,
  'tcsCarTypeCode': '1',
  'tcsCarTypeDivCode': '1',
  'tcsCarTypeDivName': '소형차',
  'tcsCarTypeName': '1종',
  'timeAvg': '7.466666666666667',
  'timeMax': '9.216',
  'timeMin': '5.800'},
 {'efcvTrfl': '53',
  'endUnitCode': '103 ',
  'endUnitNm': '수원신갈',
  'iEndUnitCode': None,
  'iStartEndStdTypeCode': None,
  'iStartUnitCode': None,
  'numOfRows': None,
  'pageNo': None,
  'startEndStdTypeCode': '2',
  'startEndStdTypeNm': '도착기준통행시간',
  'startUnitCode': '101 ',
  'startUnitNm': '서울',
  'stdDate': '20220103',
  'stdTime': '00:30',
  'sumTmUnitTypeCode': None,
  'tcsCarTypeCode': '1',
  'tcsCarTypeDivCode': '1',
  'tcsCarTypeDivName': '소형차',
  'tcsCarTypeName': '1종',
  'timeAvg': '7.466666666666667',
  'timeMax': '9.283',
  'timeMin': '5.783'},
 {'efcvTrfl': '14',
  'endUnitCode': '103 ',
  'endUnitNm': '수원신갈',
  'iEndUnitCode': None,
  'iStartEndStdTypeCode': None,
  'iStartUnitCode': None,
  'numOfRows': None,
  'pageNo': None,
  'startEndStdTypeCode': '2',
  'startEndStdTypeNm': '도착기준통행시간',
  'startUnitCode': '101 ',
  'startUnitNm': '서울',
  'stdDate': '20220103',
  'stdTime': '00:35',
  'sumTmUnitTypeCode': None,
  'tcsCarTypeCode': '1',
  'tcsCarTypeDivCode': '1',
  'tcsCarTypeDivName': '소형차',
  'tcsCarTypeName': '1종',
  'timeAvg': '7.416666666666667',
  'timeMax': '8.400',
  'timeMin': '6.050'},
 {'efcvTrfl': '15',
  'endUnitCode': '103 ',
  'endUnitNm': '수원신갈',
  'iEndUnitCode': None,
  'iStartEndStdTypeCode': None,
  'iStartUnitCode': None,
  'numOfRows': None,
  'pageNo': None,
  'startEndStdTypeCode': '2',
  'startEndStdTypeNm': '도착기준통행시간',
  'startUnitCode': '101 ',
  'startUnitNm': '서울',
  'stdDate': '20220103',
  'stdTime': '00:40',
  'sumTmUnitTypeCode': None,
  'tcsCarTypeCode': '1',
  'tcsCarTypeDivCode': '1',
  'tcsCarTypeDivName': '소형차',
  'tcsCarTypeName': '1종',
  'timeAvg': '7.516666666666667',
  'timeMax': '9.283',
  'timeMin': '5.966'},
 {'efcvTrfl': '41',
  'endUnitCode': '103 ',
  'endUnitNm': '수원신갈',
  'iEndUnitCode': None,
  'iStartEndStdTypeCode': None,
  'iStartUnitCode': None,
  'numOfRows': None,
  'pageNo': None,
  'startEndStdTypeCode': '2',
  'startEndStdTypeNm': '도착기준통행시간',
  'startUnitCode': '101 ',
  'startUnitNm': '서울',
  'stdDate': '20220103',
  'stdTime': '00:45',
  'sumTmUnitTypeCode': None,
  'tcsCarTypeCode': '1',
  'tcsCarTypeDivCode': '1',
  'tcsCarTypeDivName': '소형차',
  'tcsCarTypeName': '1종',
  'timeAvg': '7.5',
  'timeMax': '8.800',
  'timeMin': '5.750'},
 {'efcvTrfl': '11',
  'endUnitCode': '103 ',
  'endUnitNm': '수원신갈',
  'iEndUnitCode': None,
  'iStartEndStdTypeCode': None,
  'iStartUnitCode': None,
  'numOfRows': None,
  'pageNo': None,
  'startEndStdTypeCode': '2',
  'startEndStdTypeNm': '도착기준통행시간',
  'startUnitCode': '101 ',
  'startUnitNm': '서울',
  'stdDate': '20220103',
  'stdTime': '00:50',
  'sumTmUnitTypeCode': None,
  'tcsCarTypeCode': '1',
  'tcsCarTypeDivCode': '1',
  'tcsCarTypeDivName': '소형차',
  'tcsCarTypeName': '1종',
  'timeAvg': '7.416666666666667',
  'timeMax': '8.800',
  'timeMin': '5.766'},
 {'efcvTrfl': '8',
  'endUnitCode': '103 ',
  'endUnitNm': '수원신갈',
  'iEndUnitCode': None,
  'iStartEndStdTypeCode': None,
  'iStartUnitCode': None,
  'numOfRows': None,
  'pageNo': None,
  'startEndStdTypeCode': '2',
  'startEndStdTypeNm': '도착기준통행시간',
  'startUnitCode': '101 ',
  'startUnitNm': '서울',
  'stdDate': '20220103',
  'stdTime': '00:55',
  'sumTmUnitTypeCode': None,
  'tcsCarTypeCode': '1',
  'tcsCarTypeDivCode': '1',
  'tcsCarTypeDivName': '소형차',
  'tcsCarTypeName': '1종',
  'timeAvg': '7.433333333333334',
  'timeMax': '8.350',
  'timeMin': '5.750'},
 {'efcvTrfl': '87',
  'endUnitCode': '103 ',
  'endUnitNm': '수원신갈',
  'iEndUnitCode': None,
  'iStartEndStdTypeCode': None,
  'iStartUnitCode': None,
  'numOfRows': None,
  'pageNo': None,
  'startEndStdTypeCode': '2',
  'startEndStdTypeNm': '도착기준통행시간',
  'startUnitCode': '101 ',
  'startUnitNm': '서울',
  'stdDate': '20220103',
  'stdTime': '01  ',
  'sumTmUnitTypeCode': None,
  'tcsCarTypeCode': '1',
  'tcsCarTypeDivCode': '1',
  'tcsCarTypeDivName': '소형차',
  'tcsCarTypeName': '1종',
  'timeAvg': '7.55',
  'timeMax': '10.183',
  'timeMin': '5.750'},
 {'efcvTrfl': '30',
  'endUnitCode': '103 ',
  'endUnitNm': '수원신갈',
  'iEndUnitCode': None,
  'iStartEndStdTypeCode': None,
  'iStartUnitCode': None,
  'numOfRows': None,
  'pageNo': None,
  'startEndStdTypeCode': '2',
  'startEndStdTypeNm': '도착기준통행시간',
  'startUnitCode': '101 ',
  'startUnitNm': '서울',
  'stdDate': '20220103',
  'stdTime': '01:00',
  'sumTmUnitTypeCode': None,
  'tcsCarTypeCode': '1',
  'tcsCarTypeDivCode': '1',
  'tcsCarTypeDivName': '소형차',
  'tcsCarTypeName': '1종',
  'timeAvg': '7.625',
  'timeMax': '10.183',
  'timeMin': '6.600'},
 {'efcvTrfl': '3',
  'endUnitCode': '103 ',
  'endUnitNm': '수원신갈',
  'iEndUnitCode': None,
  'iStartEndStdTypeCode': None,
  'iStartUnitCode': None,
  'numOfRows': None,
  'pageNo': None,
  'startEndStdTypeCode': '2',
  'startEndStdTypeNm': '도착기준통행시간',
  'startUnitCode': '101 ',
  'startUnitNm': '서울',
  'stdDate': '20220103',
  'stdTime': '01:05',
  'sumTmUnitTypeCode': None,
  'tcsCarTypeCode': '1',
  'tcsCarTypeDivCode': '1',
  'tcsCarTypeDivName': '소형차',
  'tcsCarTypeName': '1종',
  'timeAvg': '7.683333333333334',
  'timeMax': '8.250',
  'timeMin': '7.500'}]

04.4 csv file로 출력

pandas를 이용하여 dictionarly, json file을 dataFrame, csv file로 변환

import pandas as pd
dt = []
for car in cars:
  dic_df = {}
  dic_df["data"] = car["stdDate"]
  dic_df["time"] = car["stdTime"]
  dic_df["destination"] = car["endUnitNm"]
  dt.append(dic_df)

pd.DataFrame(dt).to_csv("temp.csv",index = False, encoding="euc-kr")

Encoding의 경우 해당 json file의 documents를 봐야함.

google과 같은 기업에서 API를 받아와서 사용 하는 것도 가능 하므로
앞으로 이 기술은 무궁무진한 발전 가능성이 있을 것으로 보임 ^^

2021-12-25 게시 됨2021-12-24 업데이트 됨python / machineLeaning6분안에 읽기 (약 851 단어)

DTS: ML_Grid search(Hyper Parameter)

§ 이전 posting

☞ PipeLine

☞ Learning curve

ML pipeLine 검증 곡선 그리기

ML 그리드 서치
- grid search를 이용한 파이프라인(pipeLine) 설계및
  하이퍼 파라미터 튜닝(hyper parameter)
- 그리드 서치와 랜덤 서치가 있다.
  - 랜덤 서치로 먼저 뽑아 낸 후 그리드 서치를 이용하여 안정적으로 서치 !
나도 공부 하기 싫으닌까 그냥
남 이 하는거 따라 쓰고 싶다.

남 : Kaggle competition

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV 
from sklearn.svm import SVC 

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 
               'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 
               'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 
               'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

df = pd.read_csv(data_url, names=column_name)

X = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values

le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, 
                                                    # stratify = y, 
                                                    random_state=1)
kfold = StratifiedKFold(n_splits = 10, random_state=1, shuffle=True)

pipe_tree = make_pipeline(StandardScaler(), 
                          PCA(n_components=2), 
                          DecisionTreeClassifier(random_state=1))

# 이 Line이 핵쉼 !!

# estimator.get_params().keys()
# pipe_tree.get_params().keys() ---> 이렇게 씀. 

print(pipe_tree.get_params().keys())
param_grid = [{"decisiontreeclassifier__max_depth": [1, 2, 3, 4, 5, 6, 7, None]}]

gs = GridSearchCV(estimator = pipe_tree, 
                  param_grid = param_grid, 
                  scoring="accuracy", 
                  cv = kfold)

gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

clf = gs.best_estimator_
# 자동으로 제일 좋은 것을 뽑아서 알려줌.
clf.fit(X_train, y_train) 
print("테스트 정확도:", clf.score(X_test, y_test))

dict_keys([‘memory’, ‘steps’, ‘verbose’, ‘standardscaler’, ‘pca’, ‘decisiontreeclassifier’, ‘standardscaler__copy’, ‘standardscaler__with_mean’, ‘standardscaler__with_std’, ‘pca__copy’, ‘pca__iterated_power’, ‘pca__n_components’, ‘pca__random_state’, ‘pca__svd_solver’, ‘pca__tol’, ‘pca__whiten’, ‘decisiontreeclassifier__ccp_alpha’, ‘decisiontreeclassifier__class_weight’, ‘decisiontreeclassifier__criterion’, ‘decisiontreeclassifier__max_depth’, ‘decisiontreeclassifier__max_features’, ‘decisiontreeclassifier__max_leaf_nodes’, ‘decisiontreeclassifier__min_impurity_decrease’, ‘decisiontreeclassifier__min_samples_leaf’, ‘decisiontreeclassifier__min_samples_split’, ‘decisiontreeclassifier__min_weight_fraction_leaf’, ‘decisiontreeclassifier__random_state’, ‘decisiontreeclassifier__splitter’])
0.927536231884058
{‘decisiontreeclassifier__max_depth’: 7}
테스트 정확도: 0.9210526315789473

svc를 이용한 hyperparameter tuenning

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV 
from sklearn.svm import SVC 

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 
               'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 
               'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 
               'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

df = pd.read_csv(data_url, names=column_name)

X = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values

le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, 
                                                    # stratify = y, 
                                                    random_state=1)

pipe_svc = make_pipeline(StandardScaler(), 
                        PCA(n_components=2), 
                        SVC(random_state=1))

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
param_grid = [{"svc__C": param_range, 
               "svc__gamma": param_range, 
               "svc__kernel": ["linear"]}]

gs = GridSearchCV(estimator = pipe_svc, 
                  param_grid = param_grid, 
                  scoring="accuracy", 
                  cv = 10)

gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

clf = gs.best_estimator_
clf.fit(X_train, y_train) 
print("테스트 정확도:", clf.score(X_test, y_test))

효효효

2021-12-24 게시 됨2021-12-24 업데이트 됨python / machineLeaning5분안에 읽기 (약 823 단어)

DTS: ML_Validation CurveG(01)

§ 이전 posting

☞ PipeLine

☞ Learning curve

검증 곡선 그려 보기

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
from lightgbm import LGBMClassifier

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 
               'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 
               'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 
               'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

df = pd.read_csv(data_url, names=column_name)

X = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values

le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, 
                                                    # stratify = y, 
                                                    random_state=1)
kfold = StratifiedKFold(n_splits = 10, random_state=1, shuffle=True)

pipe_lr = make_pipeline(StandardScaler(), 
                        PCA(n_components=2), 
                        LogisticRegression(solver = "liblinear", penalty = "l2", random_state=1))

param_range = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_scores, test_scores = validation_curve(estimator=pipe_lr, 
                                                        X = X_train, 
                                                        y = y_train, 
                                                        param_name = "logisticregression__C", 
                                                        param_range = param_range, 
                                                        cv = kfold)

train_mean = np.mean(train_scores, axis = 1)
train_std = np.std(train_scores, axis = 1)
test_mean = np.mean(test_scores, axis = 1)
test_std = np.std(test_scores, axis = 1)

fig, ax = plt.subplots(figsize = (16, 10))
ax.plot(param_range, train_mean, color = "blue", marker = "o", markersize=5, label = "training accuracy")
ax.fill_between(param_range, train_mean + train_std, train_mean - train_std, alpha = 0.15, color = "blue") # 추정 분산
ax.plot(param_range, test_mean, color = "green", marker = "s", linestyle = "--", markersize=5, label = "Validation accuracy")
ax.fill_between(param_range, test_mean + test_std, test_mean - test_std, alpha = 0.15, color = "green")
plt.grid()
plt.xscale("log")
plt.xlabel("Parameter C")
plt.ylabel("Accuracy")
plt.legend(loc = "lower right")
plt.ylim([0.8, 1.03])
plt.tight_layout()
plt.show()

ML_ValidationCurve

data 불러오기

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
from lightgbm import LGBMClassifier

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 
               'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 
               'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 
               'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

df = pd.read_csv(data_url, names=column_name)

X = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values

train, test 나누고 pipe line 설계


le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, 
                                                    # stratify = y, 
                                                    random_state=1)
kfold = StratifiedKFold(n_splits = 10, random_state=1, shuffle=True)

pipe_lr = make_pipeline(StandardScaler(), 
                        PCA(n_components=2), 
                        LogisticRegression(solver = "liblinear", penalty = "l2", random_state=1))

그리드 서치


param_range = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_scores, test_scores = validation_curve(estimator=pipe_lr, 
                                                        X = X_train, 
                                                        y = y_train, 
                                                        param_name = "logisticregression__C", 
                                                        param_range = param_range, 
                                                        cv = kfold)

train_mean = np.mean(train_scores, axis = 1)
train_std = np.std(train_scores, axis = 1)
test_mean = np.mean(test_scores, axis = 1)
test_std = np.std(test_scores, axis = 1)

fig, ax = plt.subplots(figsize = (8, 5))
ax.plot(param_range, train_mean, color = "blue", marker = "o", markersize=5, label = "training accuracy")
ax.fill_between(param_range, train_mean + train_std, train_mean - train_std, alpha = 0.15, color = "blue") # 추정 분산
ax.plot(param_range, test_mean, color = "green", marker = "s", linestyle = "--", markersize=5, label = "Validation accuracy")
ax.fill_between(param_range, test_mean + test_std, test_mean - test_std, alpha = 0.15, color = "green")
plt.grid()
plt.xscale("log")
plt.xlabel("Parameter C")
plt.ylabel("Accuracy")
plt.legend(loc = "lower right")
plt.ylim([0.8, 1.03])
plt.tight_layout()
plt.show()

ML_Gridsearch_g

2021-12-24 게시 됨2021-12-24 업데이트 됨python / machineLeaning4분안에 읽기 (약 532 단어)

DTS: ML_Learning CurveG(01)

§ 이전 posting

☞ PipeLine

§ 다음 posting

☞ PipeLine

Learning curve 그리기

pipeLine 이용하여 ML 돌림
이후 ML 을 확인 하기 위해 Learning, validation curve를 그려 확인
일반적으로 두 curve 를 함께 그린다.

data 불러오기, 훈련 세트 분리, 교차검증 정의

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 
               'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 
               'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 
               'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

df = pd.read_csv(data_url, names=column_name)
print(df.info())

X = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values

le = LabelEncoder()
y = le.fit_transform(y)
print("종속변수 클래스:", le.classes_)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, stratify = y, random_state=1)

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipe_lr = make_pipeline(StandardScaler(), 
                        PCA(n_components=2), 
                        LogisticRegression(solver="liblinear", random_state=1))

Learning curve 결과 값 구하기

from sklearn.model_selection import learning_curve
import numpy as np

train_sizes, train_scores, test_scores = learning_curve(
    estimator = pipe_lr, 
    X = X_train, 
    y = y_train, 
    train_sizes = np.linspace(0.1, 1.0, 10), 
    cv = 10
)

train_mean = np.mean(train_scores, axis = 1)
train_std = np.std(train_scores, axis = 1)
test_mean = np.mean(test_scores, axis = 1)
test_std = np.std(test_scores, axis = 1)

print("mean(test)-----------------\n", train_mean,"\n mean(train)-----------------\n",test_mean )

print("STD(test)-----------------\n", train_std,"\n STD(train)-----------------\n",test_std )

mean(test)—————–
[0.9525 0.96049383 0.93032787 0.92822086 0.93382353 0.93469388
0.94090909 0.94740061 0.94945652 0.95378973]
mean(train)—————–
[0.92763285 0.92763285 0.93415459 0.93415459 0.93855072 0.94516908
0.94956522 0.947343 0.94516908 0.94956522]
STD(test)—————–
[0.0075 0.00493827 0.00839914 0.01132895 0.00395209 0.00730145
0.00862865 0.0072109 0.00656687 0.00632397]
STD(train)—————–
[0.0350718 0.02911549 0.02165313 0.02743013 0.02529372 0.02426857
0.0238436 0.02421442 0.02789264 0.02919026]

Learning Curve Graph

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize = (8,5))
ax.plot(train_sizes, 
        train_mean, 
        color = "blue", 
        marker = "o", 
        markersize = 10, 
        label = "training acc.")
ax.fill_between(train_sizes, 
                train_mean + train_std, 
                train_mean - train_std, 
                alpha = 0.15, color = "darkblue")

ax.plot(train_sizes,
        test_mean, color = "green",
        marker = "s",
        linestyle = "--", # 점선으로 표시
        markersize = 10,
        label = "testing acc.")

ax.fill_between(train_sizes, 
                test_mean + test_std, 
                test_mean - test_std, 
                alpha = 0.15, color = "salmon")
plt.grid()
plt.xlabel("Number of training samples")
plt.ylabel("Accuracy")
plt.legend(loc = "lower right")
plt.ylim([0.8, 1.03])
plt.tight_layout()
plt.show

# sample 수가 많아지면, 점점 가까워 진다.

ML_Learning_Curve

분야 좋은데 인거 알겠고, 재미있는데 참 ㅎㅎ

2021-12-22 게시 됨2021-12-24 업데이트 됨python / machineLeaning5분안에 읽기 (약 768 단어)

DTS: PipeLine 만들고 활용하기

§ 다음 posting

☞ PipeLine

☞ Learning curve

sklearn.pipeline.Pipeline

class sklearn.pipeline.Pipeline(steps, *, memory=None, verbose=False)
data : ref

Model을 바로 확인 하기 어렵다.

과대적합 하는지 확인 하기 위해 pipeLine을 이용하여 쉽게 파악 할 수 있다.

mlops? 때문이다.

sklearn.pipeline

pipeLine : 최종 추정을 위한 변환 파이프라인
매개변수를 바꿔가며 교차 검증 할 수 있는 여러 단계를 묶어 놓아 하나의 함수로 만들어 사용하기 쉽게 한 것.
해당 이름의 매개 변수를
chaining estimators 을 위해 설정하거나,
제거 할 수 있다.
- convenience and encapsulation
- joint parameter selection
- safety

뭘 한건지 모르겠지만, 오늘 할 것 정리 해 보자 .

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
import numpy as np

일단 sklearn을 이용한 ML을 하기 위해 library를 import 해 보자.

data 불러오기

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
               'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
               'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst',
               'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

df = pd.read_csv(data_url, names=column_name)
print(df.info())

test, Train 나누기

X = df.loc[:, "radius_mean":].values
y = df.loc[:, "diagnosis"].values

le = LabelEncoder()
y = le.fit_transform(y)
print("종속변수 클래스:", le.classes_)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, stratify = y, random_state=1)

이 코드 하나가 pipe Line

LogisticRegression

from sklearn.linear_model import LogisticRegression
pipe_lr = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        LogisticRegression(solver="liblinear", random_state=1))

PipeLine_LR

DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier
pipe_lr = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        DecisionTreeClassifier(random_state=0))

PipeLine_DTC

LGBM

from lightgbm import LGBMClassifier
pipe_lr = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        LGBMClassifier(objective='multiclass', random_state=5))

LGBMC : 이거 아닌거같은데 못봤다. 안됨 여튼
이런식으로 바꿔 끼워가며 확인 할 수 있다.

pipeLine만들기

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipe_lr = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        LogisticRegression(solver="liblinear", random_state=1))

kfold = StratifiedKFold(n_splits = 10, random_state=1, shuffle=True).split(X_train, y_train)
scores = []
for k, (train, test) in enumerate(kfold):
  pipe_lr.fit(X_train[train], y_train[train])
  score = pipe_lr.score(X_train[test], y_train[test])
  scores.append(score)
  print("폴드: %2d, 클래스 분포: %s, 정확도: %.3f" % (k+1, np.bincount(y_train[train]), score))

print("\nCV 정확도: %.3f +/- %.3f" % (np.mean(scores), np.std(scores)))

from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=pipe_lr,
                         X = X_train,
                         y = y_train,
                         cv = 10,
                         n_jobs = 1)

print("CV 정확도 점수 : %s" % scores)
print("CV 정확도 : %.3f +/- %.3f" % (np.mean(scores), np.std(scores)))

Kaggle data랑 뭐가 다른지 확인 해 보라고 하는데
Kaggle에서 어디 있는지 잘 모르겠다.
자바는 어느정도 감이 왔는데 python은 당최 아얘 감조차 안온다.
그냥 python 강의나 들어야 하나 고민중 …

2021-12-22 게시 됨2021-12-23 업데이트 됨python / machineLeaning2분안에 읽기 (약 299 단어)

DTS: Outlier detection02

§ data 출처

이상값 찾기

서로 겹치는 값이 있거나, 한 변수의 범주거나 연속일 경우

수치형 데이터에 대한 상관행렬

1 2	# 상관관계 확인 covidtotals.corr(method = "pearson")

corr <|0.2| : 약한 상관관계
corr < |0.3~0.6| : 중간정도의 상관관계
상관관계를 확인 할 수 있다.

crosstab

총 사망자 분위수별 총 확진자 분위수의 크로스 탭 표시
- case: 확진자수
- deaths: 사망자 수

1 2	pd.crosstab(covidtotalsonly["total_cases_q"], covidtotalsonly["total_deaths_q"])

Outlier_crosstab

매우 낮은 수로 사망 했지만, 확진이 중간 = 이상치

1 2	covidtotals.loc[(covidtotalsonly["total_cases_q"]== "very high") & (covidtotalsonly["total_deaths_q"]== "medium")].T


fig, ax = plt.subplots()
sns.regplot(x = "total_cases_pm", y = "total_deaths_pm", data = covidtotals, ax = ax)
ax.set(xlabel = "Cases Per Million", ylabel = "Deaths Per Million", title = "Total Covid Cases and Deaths per Million by Country")
ax.ticklabel_format(axis = "x", useOffset=False, style = "plain")
plt.xticks(rotation=90)
plt.show()

Outlier_regplot

2021-12-22 게시 됨2021-12-21 업데이트 됨python / machineLeaning3분안에 읽기 (약 497 단어)

DTS: Missing Value detection(02)

♠ Ref.01

note를 public으로 올려는 놨는데 검색이 될까 모르겠네요.

Missing Value : 결측치 확인

data Loading

1
2
3

import pandas as pd
covidtotals = pd.read_csv("../input/covid-data/covidtotals.csv")
covidtotals.head()

MissingValue_covidtotals

data info

1	covidtotals.info()

MissingValue_covid_info

data division

인구통계 관련 column
Covid 관련 column

1 2	case_vars = ["location", "total_cases", "total_deaths", "total_cases_pm", "total_deaths_pm"] demo_vars = ["population", "pop_density", "median_age", "gdp_per_capita", "hosp_beds"]

demo_vars column별로 결측치를 측정

1	covidtotals[demo_vars].isnull().sum(axis = 0) # column별로 결측치를 측정

MissingValue_covid_isnullsum

case_vars column별로 결측치를 측정

1	covidtotals[case_vars].isnull().sum(axis = 0) # column별로 결측치를 측정

MissingValue_covid_nullSum

case_vars 에는 결측치가 없지만, demo_vars에는 결측치가 있는 것을 확인 할 수 있다.


pop_density		12
median_age		24
gdp_per_capita		28
hosp_beds		46

위의 column들에 각각 수만큼의 결측치를 확인 할 수 있다.

행 방향으로 발생한 결측치 확인

1 2	demovars_misscnt = covidtotals[demo_vars].isnull().sum(axis = 1) demovars_misscnt.value_counts()

0 156

1 24
2 12
3 10
4 8
dtype: int64

1	covidtotals[case_vars].isnull().sum(axis = 1).value_counts()

0 210
dtype: int64

인구통계 데이터가 3가지 이상 누락된 국가를 나열하기

1
2
3

["location"] + demo_vars
covidtotals.loc[demovars_misscnt >= 3, ["location"] + demo_vars].T

MissingValue_covid_Location

case에는 누락국가가 없지만, 그냥 한번 확인

1 2	casevars_misscnt = covidtotals[case_vars].isnull().sum(axis = 1) casevars_misscnt.value_counts()

0 210
dtype: int64

1	covidtotals[covidtotals['location'] == "Hong Kong"]

temp = covidtotals.copy()
temp[case_vars].isnull().sum(axis = 0)
temp.total_cases_pm.fillna(0, inplace = True)
temp.total_deaths_pm.fillna(0, inplace = True)
temp[case_vars].isnull().sum(axis = 0)

MissingValue_covid_Del

이건 잘 모르겠다. 그냥 삭제 할 수 있다.

2021-12-21 게시 됨2021-12-21 업데이트 됨python / machineLeaning4분안에 읽기 (약 623 단어)

DTS: Missing Value detection(01)

♠ Ref.01

Missing Value : 결측치

정의 :
1. Missing Feature(누락 data) 를 처리 해주어야 ML이 잘 돌아 간다.
2. Na, Nan 과 같은 값
종류 :
1. Random : 패턴이 없는 무작위 값
2. No Random : 패턴을 가진 결측치

Deletion

deletion해서 특성이 바뀌지 않는다면, 가장 좋은 방법
- dropna()
- axis = (0 : 행 제거, default),(1: 열제거)
- subset = (특정 feature을 지정하여 해당 누락 data 제거)
Listwist(목록삭제)
- 결측치가 있는 행 전부 삭제
pairwise(단일 값 삭제)

df = df.dropna() # 결측치 있는 행 전부 삭제
df = df.dropna(axis = 1) # 결측치 있는 열 전부 삭제

df = df.dropna(how = 'all') # 전체가 결측인 행 삭제
df = df.dropna(thresh = 2) # threshold 2, 결측치 2초과 삭제

df = df.dropna(subset=['col1', 'col2', 'col3'])

# 특정열 모두가 결측치일 경우 해당 행 삭제
df = df.dropna(subset=['col1', 'col2', 'col3'], how = 'all')

# 특정열에 1개 초과의 결측치가 있을 경우 해당 행 삭제
df = df.dropna(subset=['col1', 'col2', 'col3'], thresh = 1 )

#바로 적용
df.dropna(inplace = True)
```              

<br><br>

---

### Imputation
1. 결측치를 특정 값으로 대치 
  - mode : 최빈값
    + 번주형, 빈도가 제일 높은값으로 대치 
  - median : 중앙값
    + 연속형, 결측값을 제외한 중앙값으로 대치 
  - mean : 평균
    + 연속형, 결측값을 제외한 평균으로 대치 
  - similar case imputation : 조건부 대치 
  - Generalized imputation : 회귀분석을 이용한 대치 
2. 사용함수
   - fillna(), replace(), interpolate()

##### fillna() : 0 처리

```python
df.fillna(0)

df[].fillna() : 특정 column만 대치

# 0으로 대체하기
df['col'] = df['col'].fillna(0)

# 컬럼의 평균으로 대체하기
df['col'] = df['col'].fillna(df['col'].mean())

# 바로 위의 값으로 채우기
df.fillna(method = 'pad')

#바로 아래 값으로 채우기 
df.fillna(method='bfill')

replace()

1 2	# 대체, 결측치가 있으면, -50으로 채운다. df.replace(to_replace = np.nan, value = -50)

interpolate()

만약, 값들이 선형적이라추정 후 간격으로 처리

1	df.interpolate(method = 'linear' , limit_direction = 'forward')

prediction Model (예측모델)
- 결측치가 pattern을 가진다고 가정하고 진행.
- 결측값이 없는 컬럼들로 구성된 dataset으로 예측
- 회기분석기술 혹은 SVM과같은 ML 통계기법이 있다.
guid Line (Missiong Value : MV)
- MV < 10% : 삭제 or 대치
- 10% < MV < 50% : regression or model based imputation
- 50%< MV : 해당 column 제거

2021-12-21 게시 됨2021-12-22 업데이트 됨python / machineLeaning8분안에 읽기 (약 1215 단어)

DTS: Outlier detection01

이상값 찾기

주관적이며 연구자 마다 다르고, 산업에 따라 차이가 있다.
통계에서의 이상값
- 정규 분포를 이루고 있지 않음 : 이상값이 존재
- 왜도, 첨도가 발생.
균등분포(Uniform distribution)

1. 변수 1개를 이용하여 이상값 찾기

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm # 검정 확인을 위한 그래프 
import scipy.stats as scistat #샤피로 검정을 위한 Library

covidtotals = pd.read_csv("../input/covid-data/covidtotals.csv")
covidtotals.set_index("iso_code", inplace = True)

case_vars = ["location", "total_cases", "total_deaths", "total_cases_pm", "total_deaths_pm"]
demo_vars = ["population", "pop_density", "median_age", "gdp_per_capita", "hosp_beds"]

covidtotals.head()

covidtotals_Kg

결측치와 마찬가지로 covidtotals data를 kaggle note에 불러와서 실행

백분위수(quantile)로 데이터 표시

판다스 내부의 함수를 이용하여 확인한다.

covid_case_df = covidtotals.loc[:, case_vars]
covid_case_df.describe

covid_case_df.quantile(np.arange(0.0, 1.1, 0.1))
#Index이기 때문에 1.1로 표시

outlier_quantile

왜도(대칭 정도), 첨도(뾰족한 정도) 구하기

역시 pandas 함수를 이용.

들어가기 전에

Futrue_warring

pandas.DataFrame.skew

위와 같은 Warring Error가 발생 하면, 구글링을 통해 처리 할 수 있어야 한다.

왜도 구하기

1	covid_case_df.skew(axis=0, numeric_only = True)

total_cases 10.804275

total_deaths 8.929816

total_cases_pm 4.396091

total_deaths_pm 4.674417

dtype: float64

-1~1사이에 있어야 대칭이다.
skewness < |3| : 기본적 허용
대칭이 아닌 것을 알 수 있다.
(
= 정규분포가 아니다.
)

첨도 구하기

정규 분포의 첨도는 0이다.
- 0보다 크면 더 뾰족하고
- 0보다 작으면 뭉툭하다.

1 2	#첨도 구하기 covid_case_df.kurtosis(axis=0, numeric_only = True)

total_cases 134.979577

total_deaths 95.737841

total_cases_pm 25.242790

total_deaths_pm 27.238232

dtype: float64

5~10 정도 사이에 첨도가 있어야 하는데 정규분포를 이루고 있지 않다.
- kurtosis < |7| : 기본적 허용
(
= 정규분포가 아니다.
)
- 이산값이 있을 확률이 높다는 것을 알 수 있다.

정규성 검정 테스트

정규성 가정을 검토하는 방법
1. Q-Q plot
  1. 그래프로 정규성 확인
    - 눈으로 보는 것이기 때문에 해석이 주관적.
2. Shapiro-Wilk Test (샤피로-윌크 검정)
  - 귀무가설 : 표본의 모집단이 정규 분포를 이루고 있다. (H0: 정규분포를 따른다 p-value > 0.05)
  - 대립가설 : 표본의 모집단이 정규 분포를 이루고 있지 않다.
  - p value < 0.05 : 귀무가설을 충족하지 않아 대립가설로
3. Kolnogorov-Smirnov test (콜모고로프-스미노프 검정)
  1. EDF(Empirical distribution fuction)에 기반한 적합도 검정방법
  - 자료의 평균/표준편차, Histogram을 통해 표준 정규분포와 비교하여 적합도 검정.
  - p value > 0.05 : 정규성 가정

Shapiro-Wilk Test

1 2	# 샤피로 검정 scistat.shapiro(covid_case_df['total_cases'])

ShapiroResult(statistic=0.19379639625549316, pvalue=3.753789128593843e-29)

우리는 p value 를 가지고 유의성을 확인한다.
p value : 3.75e-29 이므로 정규분포를 이루지 않음.

covid_case_df[‘total_cases’] 안에 아래 column들을 하나씩 다 넣어 봐야 한다.

1 2	case_vars = ["location", "total_cases", "total_deaths", "total_cases_pm", "total_deaths_pm"] demo_vars = ["population", "pop_density", "median_age", "gdp_per_capita", "hosp_beds"]

함수를 짜면 너의 code가 될 것이라고 한다.

qqplot

통계적 이상값 범위 : 1사분위 (25%), 3사분위(75%) 사이의 거리
- 그 거리가 상하좌우 1.5배를 넘으면 이상값으로 여김

1
2
3

sm.qqplot(covid_case_df[["total_cases"]].sort_values(
    ["total_cases"]), line = 's')
plt.title("Total Class")

outlier_qqplot_1

thirdq = covid_case_df["total_cases"].quantile(0.75)
firstq = covid_case_df["total_cases"].quantile(0.25)

interquantile_range = 1.5 * (thirdq- firstq)
outlier_high = interquantile_range + thirdq
outliner_low = firstq - interquantile_range

print(outliner_low, outlier_high, sep = " <-------> ")

-14736.125 <——-> 25028.875

이상치를 제거한 data 가져오기

조건: outlier_high 보다 높은 이상치 or outlier_low 보다 낮은 이상치

1 2	remove_outlier_df = covid_case_df.loc[~(covid_case_df["total_cases"]>outlier_high)\|(covid_case_df["total_cases"]<outliner_low)] remove_outlier_df.info()

Outlier_removedDT

이상치 data

1 2	remove_outlier_df = covid_case_df.loc[(covid_case_df["total_cases"]>outlier_high)\|(covid_case_df["total_cases"]<outliner_low)] remove_outlier_df.info()

outlier_qqplot_2

fig, ax = plt.subplots(figsize = (16, 6), ncols = 2)
ax[0].hist(covid_case_df["total_cases"]/1000, bins = 7)
ax[0].set_title("Total Covid Cases (thousands) for all")
ax[0].set_xlabel("Cases")
ax[0].set_ylabel("Number of Countries")
ax[1].hist(remove_outlier_df["total_cases"]/1000, bins = 7)
ax[1].set_title("Total Covid Cases (thousands) for removed outlier")
ax[1].set_xlabel("Cases")
ax[1].set_ylabel("Number of Countries")
plt.show()

완벽하진 않지만, 먼 잔차들을 제거한 정규 분포를 이루는 듯한 그래프를 얻을 수 있었다.
이를 train data에 EDA로 돌리고, ML을 진행 하면 더 좋은 score를 얻을 수도 있고, 아닐 수도 있다.
just Test

Notion이 그렇게 편하다며?

Notion을 사용 해 보자

Notion! 결론은 ?

크롤링

01. file 준비

02. 크롤링 : BeautifulSoup 설치

03. 크롤링 : code

03.1 객체 초기화

03.2 원하는 객체 뽑아내기 : HTML tag로

03.3 원하는 객체 뽑아내기 : class name으로

03.4 원하는 객체 뽑아내기 : id 로

04. data 뽑아내기

04.1 data file 받아오기

04.2 responses로 json file 만들기

04.3 json file 에서 원하는 정보 빼오기

04.4 csv file로 출력

ML pipeLine 검증 곡선 그리기

svc를 이용한 hyperparameter tuenning

검증 곡선 그려 보기

data 불러오기

train, test 나누고 pipe line 설계

그리드 서치

Learning curve 그리기

data 불러오기, 훈련 세트 분리, 교차검증 정의

Learning curve 결과 값 구하기

Learning Curve Graph

sklearn.pipeline.Pipeline

뭘 한건지 모르겠지만, 오늘 할 것 정리 해 보자 .

data 불러오기

test, Train 나누기

이 코드 하나가 pipe Line

pipeLine만들기

§ data 출처

이상값 찾기

crosstab

Missing Value : 결측치 확인

data Loading

data info

data division

demo_vars column별로 결측치를 측정

case_vars column별로 결측치를 측정

행 방향으로 발생한 결측치 확인

인구통계 데이터가 3가지 이상 누락된 국가를 나열하기

case에는 누락국가가 없지만, 그냥 한번 확인

Missing Value : 결측치

Deletion

df[].fillna() : 특정 column만 대치

replace()

interpolate()

이상값 찾기

백분위수(quantile)로 데이터 표시

왜도(대칭 정도), 첨도(뾰족한 정도) 구하기

왜도 구하기

첨도 구하기

정규성 검정 테스트

Shapiro-Wilk Test

qqplot

이상치를 제거한 data 가져오기

광고

링크

카테고리

최근 글

아카이브

태그

업데이트 소식 받기

follow.it