Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | ||||
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 | 29 | 30 | 31 |
Tags
- 알고리즘
- 시저암호
- 컴퓨터 구조
- 마이데이터
- 웹모의해킹
- 파이썬 문법
- 파이썬
- 코딩테스트
- 코딩테스트 연습
- 코테
- 머신러닝
- 클라우드
- 데이터분석
- XSS 취약점
- 백준
- AWS
- 회귀분석
- 도커
- 자료형
- 함수
- docker
- 개인정보보호법
- 데이터 분석
- 개인정보보호
- 데이터3법
- vagrant
- 프로그래머스
- 정보보안
- 웹 모의해킹
- AI
Archives
- Today
- Total
찬란하게
[데이터분석] 쇼핑몰 매출 데이터 가공 본문
In [124]:
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))
In [125]:
#경고(WARNING) 비표시
import warnings
warnings.filterwarnings('ignore')
데이터 준비 :¶
- customer_master : 고객 정보
- item_master : 상품 정보
- transaction1 & 2 : 거래 내역
- transaction_detail_1 & 2 : 거래 내역 디테일
In [126]:
import pandas as pd
customer_master = pd.read_csv('dataset/customer_master.csv')
customer_master
Out[126]:
customer_id | customer_name | registration_date | gender | age | birth | pref | ||
---|---|---|---|---|---|---|---|---|
0 | IK152942 | 김서준 | 2019-01-01 0:25 | hirata_yuujirou@example.com | M | 29 | 1990-06-10 | 대전광역시 |
1 | TS808488 | 김예준 | 2019-01-01 1:13 | tamura_shiori@example.com | F | 33 | 1986-05-20 | 인천광역시 |
2 | AS834628 | 김도윤 | 2019-01-01 2:00 | hisano_yuki@example.com | F | 63 | 1956-01-02 | 광주광역시 |
3 | AS345469 | 김시우 | 2019-01-01 4:48 | tsuruoka_kaoru@example.com | M | 74 | 1945-03-25 | 인천광역시 |
4 | GD892565 | 김주원 | 2019-01-01 4:54 | oouchi_takashi@example.com | M | 54 | 1965-08-05 | 울산광역시 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
4995 | AS677229 | 정우석 | 2019-07-31 16:52 | hirayama_risa@example.com | F | 77 | 1941-10-17 | 대전광역시 |
4996 | HD758694 | 정영훈 | 2019-07-31 19:09 | nakahara_mahiru@example.com | F | 27 | 1991-11-13 | 광주광역시 |
4997 | PL538517 | 정준기 | 2019-07-31 19:30 | tabata_yuu1@example.com | F | 73 | 1945-12-28 | 대전광역시 |
4998 | OA955088 | 정도형 | 2019-07-31 22:32 | setouchi_hikaru@example.com | F | 75 | 1944-04-09 | 부산광역시 |
4999 | HI349563 | 정지석 | 2019-07-31 22:49 | horii_kanji@example.com | M | 21 | 1998-02-06 | 서울특별시 |
5000 rows × 8 columns
In [127]:
item_master = pd.read_csv('dataset/item_master.csv')
item_master
Out[127]:
item_id | item_name | item_price | |
---|---|---|---|
0 | S001 | PC-A | 50000 |
1 | S002 | PC-B | 85000 |
2 | S003 | PC-C | 120000 |
3 | S004 | PC-D | 180000 |
4 | S005 | PC-E | 210000 |
In [128]:
transaction_detail_1 = pd.read_csv('dataset/transaction_detail_1.csv')
transaction_detail_1
Out[128]:
detail_id | transaction_id | item_id | quantity | |
---|---|---|---|---|
0 | 0 | T0000000113 | S005 | 1 |
1 | 1 | T0000000114 | S001 | 1 |
2 | 2 | T0000000115 | S003 | 1 |
3 | 3 | T0000000116 | S005 | 1 |
4 | 4 | T0000000117 | S002 | 2 |
... | ... | ... | ... | ... |
4995 | 4995 | T0000004865 | S003 | 1 |
4996 | 4996 | T0000004866 | S001 | 3 |
4997 | 4997 | T0000004867 | S001 | 3 |
4998 | 4998 | T0000004868 | S005 | 1 |
4999 | 4999 | T0000004869 | S003 | 1 |
5000 rows × 4 columns
In [129]:
transaction_detail_2
Out[129]:
detail_id | transaction_id | item_id | quantity | |
---|---|---|---|---|
0 | 5000 | T0000004870 | S002 | 3 |
1 | 5001 | T0000004871 | S003 | 1 |
2 | 5002 | T0000004872 | S001 | 2 |
3 | 5003 | T0000004873 | S004 | 1 |
4 | 5004 | T0000004874 | S003 | 2 |
... | ... | ... | ... | ... |
2139 | 7139 | T0000006894 | S004 | 1 |
2140 | 7140 | T0000006895 | S002 | 1 |
2141 | 7141 | T0000006896 | S001 | 2 |
2142 | 7142 | T0000006897 | S002 | 1 |
2143 | 7143 | T0000006898 | S002 | 1 |
2144 rows × 4 columns
step2 : 데이터를 결합(유니언)¶
데이터 준비 :¶
- transaction1 & 2 : 거래 내역 1과 2를 합치기
- transaction_detail_1 & 2 : 거래 내역 디테일 1과 2를 합치기
In [130]:
transaction_1 = pd.read_csv('dataset/transaction_1.csv')
transaction_2 = pd.read_csv('dataset/transaction_2.csv')
transaction_1
Out[130]:
transaction_id | price | payment_date | customer_id | |
---|---|---|---|---|
0 | T0000000113 | 210000 | 2019-02-01 01:36:57 | PL563502 |
1 | T0000000114 | 50000 | 2019-02-01 01:37:23 | HD678019 |
2 | T0000000115 | 120000 | 2019-02-01 02:34:19 | HD298120 |
3 | T0000000116 | 210000 | 2019-02-01 02:47:23 | IK452215 |
4 | T0000000117 | 170000 | 2019-02-01 04:33:46 | PL542865 |
... | ... | ... | ... | ... |
4995 | T0000005108 | 210000 | 2019-06-15 02:42:41 | HD315748 |
4996 | T0000005109 | 150000 | 2019-06-15 03:36:16 | HI215420 |
4997 | T0000005110 | 50000 | 2019-06-15 03:44:06 | IK880102 |
4998 | T0000005111 | 210000 | 2019-06-15 04:14:06 | IK074758 |
4999 | T0000005112 | 50000 | 2019-06-15 04:42:38 | HD444151 |
5000 rows × 4 columns
In [131]:
transaction_2
Out[131]:
transaction_id | price | payment_date | customer_id | |
---|---|---|---|---|
0 | T0000005113 | 295000 | 2019-06-15 07:20:27 | TS169261 |
1 | T0000005114 | 50000 | 2019-06-15 07:35:47 | HI599892 |
2 | T0000005115 | 85000 | 2019-06-15 07:56:36 | HI421757 |
3 | T0000005116 | 50000 | 2019-06-15 08:40:55 | OA386378 |
4 | T0000005117 | 120000 | 2019-06-15 08:44:23 | TS506913 |
... | ... | ... | ... | ... |
1781 | T0000006894 | 180000 | 2019-07-31 21:20:44 | HI400734 |
1782 | T0000006895 | 85000 | 2019-07-31 21:52:48 | AS339451 |
1783 | T0000006896 | 100000 | 2019-07-31 23:35:25 | OA027325 |
1784 | T0000006897 | 85000 | 2019-07-31 23:39:35 | TS624738 |
1785 | T0000006898 | 85000 | 2019-07-31 23:41:38 | AS834214 |
1786 rows × 4 columns
합치기 연산 : .concat¶
In [132]:
transaction = pd.concat([transaction_1, transaction_2], ignore_index=True)
transaction.head()
Out[132]:
transaction_id | price | payment_date | customer_id | |
---|---|---|---|---|
0 | T0000000113 | 210000 | 2019-02-01 01:36:57 | PL563502 |
1 | T0000000114 | 50000 | 2019-02-01 01:37:23 | HD678019 |
2 | T0000000115 | 120000 | 2019-02-01 02:34:19 | HD298120 |
3 | T0000000116 | 210000 | 2019-02-01 02:47:23 | IK452215 |
4 | T0000000117 | 170000 | 2019-02-01 04:33:46 | PL542865 |
In [133]:
print(len(transaction_1))
print(len(transaction_2))
print(len(transaction))
5000
1786
6786
In [134]:
transaction_detail_2 = pd.read_csv('dataset/transaction_detail_2.csv')
transaction_detail=pd.concat([transaction_detail_1,transaction_detail_2], ignore_index=True)
transaction_detail.head()
Out[134]:
detail_id | transaction_id | item_id | quantity | |
---|---|---|---|---|
0 | 0 | T0000000113 | S005 | 1 |
1 | 1 | T0000000114 | S001 | 1 |
2 | 2 | T0000000115 | S003 | 1 |
3 | 3 | T0000000116 | S005 | 1 |
4 | 4 | T0000000117 | S002 | 2 |
In [135]:
print(len(transaction_detail_1))
print(len(transaction_detail_2))
print(len(transaction_detail))
5000
2144
7144
step3 : 매출 데이터끼리 결합(조인)¶
~step2 과정:¶
- 데이터 준비
- transation_1 & transation_2: 거래내역 합치기
- union : transation
- transation_detail_1 & transation_detail_2 : 거래내역 디테일 합치기
- union : transacion_detail
거래 내역과 거래 내역 디테일을 조인¶
- left join : 왼쪽 테이블을 기준으로 조인
In [136]:
transaction_detail
Out[136]:
detail_id | transaction_id | item_id | quantity | |
---|---|---|---|---|
0 | 0 | T0000000113 | S005 | 1 |
1 | 1 | T0000000114 | S001 | 1 |
2 | 2 | T0000000115 | S003 | 1 |
3 | 3 | T0000000116 | S005 | 1 |
4 | 4 | T0000000117 | S002 | 2 |
... | ... | ... | ... | ... |
7139 | 7139 | T0000006894 | S004 | 1 |
7140 | 7140 | T0000006895 | S002 | 1 |
7141 | 7141 | T0000006896 | S001 | 2 |
7142 | 7142 | T0000006897 | S002 | 1 |
7143 | 7143 | T0000006898 | S002 | 1 |
7144 rows × 4 columns
In [137]:
transaction
Out[137]:
transaction_id | price | payment_date | customer_id | |
---|---|---|---|---|
0 | T0000000113 | 210000 | 2019-02-01 01:36:57 | PL563502 |
1 | T0000000114 | 50000 | 2019-02-01 01:37:23 | HD678019 |
2 | T0000000115 | 120000 | 2019-02-01 02:34:19 | HD298120 |
3 | T0000000116 | 210000 | 2019-02-01 02:47:23 | IK452215 |
4 | T0000000117 | 170000 | 2019-02-01 04:33:46 | PL542865 |
... | ... | ... | ... | ... |
6781 | T0000006894 | 180000 | 2019-07-31 21:20:44 | HI400734 |
6782 | T0000006895 | 85000 | 2019-07-31 21:52:48 | AS339451 |
6783 | T0000006896 | 100000 | 2019-07-31 23:35:25 | OA027325 |
6784 | T0000006897 | 85000 | 2019-07-31 23:39:35 | TS624738 |
6785 | T0000006898 | 85000 | 2019-07-31 23:41:38 | AS834214 |
6786 rows × 4 columns
join : .merge¶
In [138]:
join_data = pd.merge(transaction_detail,
transaction[["transaction_id", "payment_date", "customer_id"]],
on="transaction_id",
how="left")
join_data
Out[138]:
detail_id | transaction_id | item_id | quantity | payment_date | customer_id | |
---|---|---|---|---|---|---|
0 | 0 | T0000000113 | S005 | 1 | 2019-02-01 01:36:57 | PL563502 |
1 | 1 | T0000000114 | S001 | 1 | 2019-02-01 01:37:23 | HD678019 |
2 | 2 | T0000000115 | S003 | 1 | 2019-02-01 02:34:19 | HD298120 |
3 | 3 | T0000000116 | S005 | 1 | 2019-02-01 02:47:23 | IK452215 |
4 | 4 | T0000000117 | S002 | 2 | 2019-02-01 04:33:46 | PL542865 |
... | ... | ... | ... | ... | ... | ... |
7139 | 7139 | T0000006894 | S004 | 1 | 2019-07-31 21:20:44 | HI400734 |
7140 | 7140 | T0000006895 | S002 | 1 | 2019-07-31 21:52:48 | AS339451 |
7141 | 7141 | T0000006896 | S001 | 2 | 2019-07-31 23:35:25 | OA027325 |
7142 | 7142 | T0000006897 | S002 | 1 | 2019-07-31 23:39:35 | TS624738 |
7143 | 7143 | T0000006898 | S002 | 1 | 2019-07-31 23:41:38 | AS834214 |
7144 rows × 6 columns
In [139]:
# 행 개수 비교
print(len(transaction_detail))
print(len(transaction))
print(len(join_data))
7144
6786
7144
step4 : 마스터데이터를 결합(조인)해보자¶
거래내역(join data) + 고객정보 --> 조인¶
In [140]:
customer_master
Out[140]:
customer_id | customer_name | registration_date | gender | age | birth | pref | ||
---|---|---|---|---|---|---|---|---|
0 | IK152942 | 김서준 | 2019-01-01 0:25 | hirata_yuujirou@example.com | M | 29 | 1990-06-10 | 대전광역시 |
1 | TS808488 | 김예준 | 2019-01-01 1:13 | tamura_shiori@example.com | F | 33 | 1986-05-20 | 인천광역시 |
2 | AS834628 | 김도윤 | 2019-01-01 2:00 | hisano_yuki@example.com | F | 63 | 1956-01-02 | 광주광역시 |
3 | AS345469 | 김시우 | 2019-01-01 4:48 | tsuruoka_kaoru@example.com | M | 74 | 1945-03-25 | 인천광역시 |
4 | GD892565 | 김주원 | 2019-01-01 4:54 | oouchi_takashi@example.com | M | 54 | 1965-08-05 | 울산광역시 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
4995 | AS677229 | 정우석 | 2019-07-31 16:52 | hirayama_risa@example.com | F | 77 | 1941-10-17 | 대전광역시 |
4996 | HD758694 | 정영훈 | 2019-07-31 19:09 | nakahara_mahiru@example.com | F | 27 | 1991-11-13 | 광주광역시 |
4997 | PL538517 | 정준기 | 2019-07-31 19:30 | tabata_yuu1@example.com | F | 73 | 1945-12-28 | 대전광역시 |
4998 | OA955088 | 정도형 | 2019-07-31 22:32 | setouchi_hikaru@example.com | F | 75 | 1944-04-09 | 부산광역시 |
4999 | HI349563 | 정지석 | 2019-07-31 22:49 | horii_kanji@example.com | M | 21 | 1998-02-06 | 서울특별시 |
5000 rows × 8 columns
In [141]:
join_data = pd.merge(join_data, customer_master,
on="customer_id", how="left")
join_data
Out[141]:
detail_id | transaction_id | item_id | quantity | payment_date | customer_id | customer_name | registration_date | gender | age | birth | pref | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | T0000000113 | S005 | 1 | 2019-02-01 01:36:57 | PL563502 | 김태경 | 2019-01-07 14:34 | imoto_yoshimasa@example.com | M | 30 | 1989-07-15 | 대전광역시 |
1 | 1 | T0000000114 | S001 | 1 | 2019-02-01 01:37:23 | HD678019 | 김영웅 | 2019-01-27 18:00 | mifune_rokurou@example.com | M | 73 | 1945-11-29 | 서울특별시 |
2 | 2 | T0000000115 | S003 | 1 | 2019-02-01 02:34:19 | HD298120 | 김강현 | 2019-01-11 8:16 | yamane_kogan@example.com | M | 42 | 1977-05-17 | 광주광역시 |
3 | 3 | T0000000116 | S005 | 1 | 2019-02-01 02:47:23 | IK452215 | 김주한 | 2019-01-10 5:07 | ikeda_natsumi@example.com | F | 47 | 1972-03-17 | 인천광역시 |
4 | 4 | T0000000117 | S002 | 2 | 2019-02-01 04:33:46 | PL542865 | 김영빈 | 2019-01-25 6:46 | kurita_kenichi@example.com | M | 74 | 1944-12-17 | 광주광역시 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7139 | 7139 | T0000006894 | S004 | 1 | 2019-07-31 21:20:44 | HI400734 | 김윤성 | 2019-01-04 13:24 | shishido_akira@example.com | M | 64 | 1955-01-13 | 대구광역시 |
7140 | 7140 | T0000006895 | S002 | 1 | 2019-07-31 21:52:48 | AS339451 | 김무경 | 2019-02-11 19:34 | aihara_miki@example.com | F | 74 | 1945-02-03 | 대구광역시 |
7141 | 7141 | T0000006896 | S001 | 2 | 2019-07-31 23:35:25 | OA027325 | 박준석 | 2019-04-17 9:23 | matsuda_saki@example.com | F | 40 | 1979-05-25 | 서울특별시 |
7142 | 7142 | T0000006897 | S002 | 1 | 2019-07-31 23:39:35 | TS624738 | 이가빈 | 2019-02-20 18:15 | shinndou_masatoshi@example.com | M | 56 | 1963-02-21 | 인천광역시 |
7143 | 7143 | T0000006898 | S002 | 1 | 2019-07-31 23:41:38 | AS834214 | 이승채 | 2019-04-07 3:20 | tahara_yuuko@example.com | F | 74 | 1944-12-18 | 대전광역시 |
7144 rows × 13 columns
(거래내역 + 고객정보) + 아이템 정보 --> 조인¶
In [142]:
item_master
Out[142]:
item_id | item_name | item_price | |
---|---|---|---|
0 | S001 | PC-A | 50000 |
1 | S002 | PC-B | 85000 |
2 | S003 | PC-C | 120000 |
3 | S004 | PC-D | 180000 |
4 | S005 | PC-E | 210000 |
In [143]:
join_data = pd.merge(join_data, item_master, on="item_id", how="left")
In [144]:
join_data.head()
Out[144]:
detail_id | transaction_id | item_id | quantity | payment_date | customer_id | customer_name | registration_date | gender | age | birth | pref | item_name | item_price | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | T0000000113 | S005 | 1 | 2019-02-01 01:36:57 | PL563502 | 김태경 | 2019-01-07 14:34 | imoto_yoshimasa@example.com | M | 30 | 1989-07-15 | 대전광역시 | PC-E | 210000 |
1 | 1 | T0000000114 | S001 | 1 | 2019-02-01 01:37:23 | HD678019 | 김영웅 | 2019-01-27 18:00 | mifune_rokurou@example.com | M | 73 | 1945-11-29 | 서울특별시 | PC-A | 50000 |
2 | 2 | T0000000115 | S003 | 1 | 2019-02-01 02:34:19 | HD298120 | 김강현 | 2019-01-11 8:16 | yamane_kogan@example.com | M | 42 | 1977-05-17 | 광주광역시 | PC-C | 120000 |
3 | 3 | T0000000116 | S005 | 1 | 2019-02-01 02:47:23 | IK452215 | 김주한 | 2019-01-10 5:07 | ikeda_natsumi@example.com | F | 47 | 1972-03-17 | 인천광역시 | PC-E | 210000 |
4 | 4 | T0000000117 | S002 | 2 | 2019-02-01 04:33:46 | PL542865 | 김영빈 | 2019-01-25 6:46 | kurita_kenichi@example.com | M | 74 | 1944-12-17 | 광주광역시 | PC-B | 85000 |
In [167]:
# 관계형 데이터 베이스는 정규화로 나눠서 분리해서 저장
# join -> 합쳐놓기
In [146]:
# 파생변수 만들기
join_data["price"] = join_data["quantity"] *join_data["item_price"]
In [147]:
join_data[["quantity", "item_price","price"]].head()
Out[147]:
quantity | item_price | price | |
---|---|---|---|
0 | 1 | 210000 | 210000 |
1 | 1 | 50000 | 50000 |
2 | 1 | 120000 | 120000 |
3 | 1 | 210000 | 210000 |
4 | 2 | 85000 | 170000 |
step6 : 데이터를 검산하자¶
In [148]:
print(join_data["price"].sum())
print(transaction["price"].sum())
971135000
971135000
In [149]:
join_data["price"].sum() == transaction["price"].sum()
Out[149]:
True
step7 : 각종 통계량을 파악하자¶
In [168]:
### 누락값 = 결측치 = NA
In [150]:
join_data.isnull()
Out[150]:
detail_id | transaction_id | item_id | quantity | payment_date | customer_id | customer_name | registration_date | gender | age | birth | pref | item_name | item_price | price | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7139 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
7140 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
7141 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
7142 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
7143 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
7144 rows × 16 columns
In [151]:
join_data.isnull().sum()
Out[151]:
detail_id 0
transaction_id 0
item_id 0
quantity 0
payment_date 0
customer_id 0
customer_name 0
registration_date 0
email 0
gender 0
age 0
birth 0
pref 0
item_name 0
item_price 0
price 0
dtype: int64
.describe()¶
In [152]:
# 수치형 데이터
join_data.describe()
Out[152]:
detail_id | quantity | age | item_price | price | |
---|---|---|---|---|---|
count | 7144.000000 | 7144.000000 | 7144.000000 | 7144.000000 | 7144.000000 |
mean | 3571.500000 | 1.199888 | 50.265677 | 121698.628219 | 135937.150056 |
std | 2062.439494 | 0.513647 | 17.190314 | 64571.311830 | 68511.453297 |
min | 0.000000 | 1.000000 | 20.000000 | 50000.000000 | 50000.000000 |
25% | 1785.750000 | 1.000000 | 36.000000 | 50000.000000 | 85000.000000 |
50% | 3571.500000 | 1.000000 | 50.000000 | 102500.000000 | 120000.000000 |
75% | 5357.250000 | 1.000000 | 65.000000 | 187500.000000 | 210000.000000 |
max | 7143.000000 | 4.000000 | 80.000000 | 210000.000000 | 420000.000000 |
In [153]:
# 범주형 데이터
join_data["payment_date"].min()
Out[153]:
'2019-02-01 01:36:57'
In [154]:
join_data["payment_date"].max()
Out[154]:
'2019-07-31 23:41:38'
In [155]:
join_data.dtypes
Out[155]:
detail_id int64
transaction_id object
item_id object
quantity int64
payment_date object
customer_id object
customer_name object
registration_date object
email object
gender object
age int64
birth object
pref object
item_name object
item_price int64
price int64
dtype: object
object --> datetime¶
In [156]:
join_data["payment_date"] = pd.to_datetime(join_data["payment_date"])
join_data["payment_date"]
Out[156]:
0 2019-02-01 01:36:57
1 2019-02-01 01:37:23
2 2019-02-01 02:34:19
3 2019-02-01 02:47:23
4 2019-02-01 04:33:46
...
7139 2019-07-31 21:20:44
7140 2019-07-31 21:52:48
7141 2019-07-31 23:35:25
7142 2019-07-31 23:39:35
7143 2019-07-31 23:41:38
Name: payment_date, Length: 7144, dtype: datetime64[ns]
In [157]:
join_data["payment_date"].dtypes
Out[157]:
dtype('<M8[ns]')
In [158]:
join_data["payment_month"] = join_data["payment_date"].dt.strftime("%Y%m")
join_data["payment_month"].head()
Out[158]:
0 201902
1 201902
2 201902
3 201902
4 201902
Name: payment_month, dtype: object
In [159]:
join_data[["payment_date", "payment_month"]].head()
Out[159]:
payment_date | payment_month | |
---|---|---|
0 | 2019-02-01 01:36:57 | 201902 |
1 | 2019-02-01 01:37:23 | 201902 |
2 | 2019-02-01 02:34:19 | 201902 |
3 | 2019-02-01 02:47:23 | 201902 |
4 | 2019-02-01 04:33:46 | 201902 |
group by¶
In [160]:
join_data.groupby("payment_month").sum()
Out[160]:
detail_id | quantity | age | item_price | price | |
---|---|---|---|---|---|
payment_month | |||||
201902 | 676866 | 1403 | 59279 | 142805000 | 160185000 |
201903 | 2071474 | 1427 | 58996 | 142980000 | 160370000 |
201904 | 3476816 | 1421 | 59246 | 143670000 | 160510000 |
201905 | 4812795 | 1390 | 58195 | 139655000 | 155420000 |
201906 | 6369999 | 1446 | 61070 | 147090000 | 164030000 |
201907 | 8106846 | 1485 | 62312 | 153215000 | 170620000 |
In [161]:
join_data.groupby("payment_month").sum()["price"]
Out[161]:
payment_month
201902 160185000
201903 160370000
201904 160510000
201905 155420000
201906 164030000
201907 170620000
Name: price, dtype: int64
In [162]:
join_data.groupby(["payment_month","item_name"]).sum()[["price", "quantity"]]
Out[162]:
price | quantity | ||
---|---|---|---|
payment_month | item_name | ||
201902 | PC-A | 24150000 | 483 |
PC-B | 25245000 | 297 | |
PC-C | 19800000 | 165 | |
PC-D | 31140000 | 173 | |
PC-E | 59850000 | 285 | |
201903 | PC-A | 26000000 | 520 |
PC-B | 25500000 | 300 | |
PC-C | 19080000 | 159 | |
PC-D | 25740000 | 143 | |
PC-E | 64050000 | 305 | |
201904 | PC-A | 25900000 | 518 |
PC-B | 23460000 | 276 | |
PC-C | 21960000 | 183 | |
PC-D | 24300000 | 135 | |
PC-E | 64890000 | 309 | |
201905 | PC-A | 24850000 | 497 |
PC-B | 25330000 | 298 | |
PC-C | 20520000 | 171 | |
PC-D | 25920000 | 144 | |
PC-E | 58800000 | 280 | |
201906 | PC-A | 26000000 | 520 |
PC-B | 23970000 | 282 | |
PC-C | 21840000 | 182 | |
PC-D | 28800000 | 160 | |
PC-E | 63420000 | 302 | |
201907 | PC-A | 25250000 | 505 |
PC-B | 28220000 | 332 | |
PC-C | 19440000 | 162 | |
PC-D | 26100000 | 145 | |
PC-E | 71610000 | 341 |
In [163]:
pd.pivot_table(join_data,
index='item_name',
columns='payment_month',
values=['price','quantity'], aggfunc='sum' )
Out[163]:
price | quantity | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
payment_month | 201902 | 201903 | 201904 | 201905 | 201906 | 201907 | 201902 | 201903 | 201904 | 201905 | 201906 | 201907 |
item_name | ||||||||||||
PC-A | 24150000 | 26000000 | 25900000 | 24850000 | 26000000 | 25250000 | 483 | 520 | 518 | 497 | 520 | 505 |
PC-B | 25245000 | 25500000 | 23460000 | 25330000 | 23970000 | 28220000 | 297 | 300 | 276 | 298 | 282 | 332 |
PC-C | 19800000 | 19080000 | 21960000 | 20520000 | 21840000 | 19440000 | 165 | 159 | 183 | 171 | 182 | 162 |
PC-D | 31140000 | 25740000 | 24300000 | 25920000 | 28800000 | 26100000 | 173 | 143 | 135 | 144 | 160 | 145 |
PC-E | 59850000 | 64050000 | 64890000 | 58800000 | 63420000 | 71610000 | 285 | 305 | 309 | 280 | 302 | 341 |
step10 : 상품별 매출 추이를 가시화해보자¶
In [188]:
graph_data=pd.pivot_table(join_data,
index='payment_month',
columns='item_name',
values='price', aggfunc='sum')
In [165]:
graph_data.head()
Out[165]:
item_name | PC-A | PC-B | PC-C | PC-D | PC-E |
---|---|---|---|---|---|
payment_month | |||||
201902 | 24150000 | 25245000 | 19800000 | 31140000 | 59850000 |
201903 | 26000000 | 25500000 | 19080000 | 25740000 | 64050000 |
201904 | 25900000 | 23460000 | 21960000 | 24300000 | 64890000 |
201905 | 24850000 | 25330000 | 20520000 | 25920000 | 58800000 |
201906 | 26000000 | 23970000 | 21840000 | 28800000 | 63420000 |
matplotlib 라이브러리¶
In [180]:
list(graph_data.index)
Out[180]:
['201902', '201903', '201904', '201905', '201906', '201907']
In [173]:
graph_data["PC-A"]
Out[173]:
payment_month
201902 24150000
201903 26000000
201904 25900000
201905 24850000
201906 26000000
201907 25250000
Name: PC-A, dtype: int64
In [198]:
import matplotlib.pyplot as plt
# 바로 보여줘
%matplotlib inline
# plt.plot(x,y)
plt.plot(graph_data.index, graph_data["PC-A"], label='PC-A')
plt.plot(graph_data.index, graph_data["PC-B"], label='PC-B')
plt.plot(graph_data.index, graph_data["PC-C"], label='PC-C')
plt.plot(graph_data.index, graph_data["PC-D"], label='PC-D')
plt.plot(graph_data.index, graph_data["PC-E"], label='PC-E')
plt.legend()
Out[198]:
<matplotlib.legend.Legend at 0x228b56e08b0>
In [ ]:
'AI (인공지능) > 미니프로젝트' 카테고리의 다른 글
언어 판별기 제작하기! ㄴstep1. (0) | 2021.03.26 |
---|---|
[데이터분석][AI] SVM모델 적용시키기 (0) | 2021.03.25 |
[데이터분석][AI] 보스톤 주택가격 - 회귀분석모델(Regression) 정확도 검사 (0) | 2021.03.24 |
[데이터 분석][AI]ML - 회귀분석 (0) | 2021.03.24 |
서울시 범죄현황 통계자료 (0) | 2020.09.08 |