728x90
EMP 파일 실습
In [1]:
import pandas as pd
import numpy as np
1. Data Load
- df = read_csv()
- df.info()
- df.head()
- df.shape
In [51]:
df=pd.read_csv("./emp2.csv") #csv파일 불러서 df변수에 저장
df
Out[51]:
EMPNO | ENAME | JOB | MGR | HIRDDATE | SAL | COMM | DEPTNO | |
---|---|---|---|---|---|---|---|---|
0 | 7369 | SMITH | CLERK | 7902.0 | 1980-12-17 | 1800 | NaN | 30 |
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981-02-20 | 1600 | 300.0 | 30 |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981-02-22 | 1250 | 500.0 | 30 |
3 | 7566 | JONES | MANAGER | 7839.0 | 1981-04-02 | 2975 | NaN | 20 |
4 | 7654 | MARTIN | SALESMAN | 7698.0 | 1981-09-28 | 1250 | 1400.0 | 30 |
5 | 7698 | BLAKE | MANAGER | 7839.0 | 1981-05-01 | 2850 | NaN | 30 |
6 | 7782 | CLARK | MANAGER | 7839.0 | 1981-06-09 | 2450 | NaN | 10 |
7 | 7788 | SCOTT | ANALYST | 7566.0 | 1987-07-13 | 3000 | NaN | 20 |
8 | 7839 | KING | PRESIDENT | NaN | 1981-11-17 | 5000 | NaN | 10 |
9 | 7844 | TURNER | SALESMAN | 7698.0 | 1981-09-08 | 1500 | 0.0 | 30 |
10 | 7876 | ADAMS | CLERK | 7788.0 | 1987-07-13 | 1100 | NaN | 20 |
11 | 7900 | JAMES | CLERK | 7698.0 | 1981-12-03 | 950 | NaN | 30 |
12 | 7902 | FORD | ANALYST | 7566.0 | 1981-12-03 | 6000 | NaN | 20 |
13 | 7934 | MILLER | CLERK | 7782.0 | 1982-01-23 | 1300 | NaN | 10 |
- object(글자)타입 X
- 결측(NaN) X
In [52]:
df.info() #데이터 구조
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 EMPNO 14 non-null int64
1 ENAME 14 non-null object
2 JOB 14 non-null object
3 MGR 13 non-null float64
4 HIRDDATE 14 non-null object
5 SAL 14 non-null int64
6 COMM 4 non-null float64
7 DEPTNO 14 non-null int64
dtypes: float64(2), int64(3), object(3)
memory usage: 1.0+ KB
In [53]:
df.isna() #결측 확인
Out[53]:
EMPNO | ENAME | JOB | MGR | HIRDDATE | SAL | COMM | DEPTNO | |
---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | True | False |
1 | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | True | False |
4 | False | False | False | False | False | False | False | False |
5 | False | False | False | False | False | False | True | False |
6 | False | False | False | False | False | False | True | False |
7 | False | False | False | False | False | False | True | False |
8 | False | False | False | True | False | False | True | False |
9 | False | False | False | False | False | False | False | False |
10 | False | False | False | False | False | False | True | False |
11 | False | False | False | False | False | False | True | False |
12 | False | False | False | False | False | False | True | False |
13 | False | False | False | False | False | False | True | False |
In [54]:
df.isna().sum() #결측 수 세기
Out[54]:
EMPNO 0
ENAME 0
JOB 0
MGR 1
HIRDDATE 0
SAL 0
COMM 10
DEPTNO 0
dtype: int64
In [55]:
df.head()
Out[55]:
EMPNO | ENAME | JOB | MGR | HIRDDATE | SAL | COMM | DEPTNO | |
---|---|---|---|---|---|---|---|---|
0 | 7369 | SMITH | CLERK | 7902.0 | 1980-12-17 | 1800 | NaN | 30 |
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981-02-20 | 1600 | 300.0 | 30 |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981-02-22 | 1250 | 500.0 | 30 |
3 | 7566 | JONES | MANAGER | 7839.0 | 1981-04-02 | 2975 | NaN | 20 |
4 | 7654 | MARTIN | SALESMAN | 7698.0 | 1981-09-28 | 1250 | 1400.0 | 30 |
In [56]:
df.shape #(행,열) 수
Out[56]:
(14, 8)
컬럼명 대문자를 소문자로 바꾸기
In [58]:
lists = df.columns.tolist()
for i, col in enumerate(lists):
list[i] = col.lower()
print(list)
['empno', 'ename', 'job', 'mgr', 'hirddate', 'sal', 'comm', 'deptno']
In [59]:
list=[col.lower() for col in df.columns.tolist()]
print(list)
['empno', 'ename', 'job', 'mgr', 'hirddate', 'sal', 'comm', 'deptno']
In [60]:
df.columns=list
2. SELECT
- 칼럼별로 호출 df[칼럼명]
- array처럼 [행,열]로 호출 iloc()
- 로우명, 칼럼명으로 호출 loc()
df[] - 칼럼별로 호출 df[칼럼명]
In [61]:
df[['empno','ename','sal']] #select empno, enmae, sal from emp
Out[61]:
empno | ename | sal | |
---|---|---|---|
0 | 7369 | SMITH | 1800 |
1 | 7499 | ALLEN | 1600 |
2 | 7521 | WARD | 1250 |
3 | 7566 | JONES | 2975 |
4 | 7654 | MARTIN | 1250 |
5 | 7698 | BLAKE | 2850 |
6 | 7782 | CLARK | 2450 |
7 | 7788 | SCOTT | 3000 |
8 | 7839 | KING | 5000 |
9 | 7844 | TURNER | 1500 |
10 | 7876 | ADAMS | 1100 |
11 | 7900 | JAMES | 950 |
12 | 7902 | FORD | 6000 |
13 | 7934 | MILLER | 1300 |
df.iloc() -array처럼 [행,열]로 호출
In [62]:
df.iloc[[1,2,3],[0,1,2]] #[[1,2,3행],[0,1,2열]]
Out[62]:
empno | ename | job | |
---|---|---|---|
1 | 7499 | ALLEN | SALESMAN |
2 | 7521 | WARD | SALESMAN |
3 | 7566 | JONES | MANAGER |
In [49]:
df.iloc[[0,1,2,3],0:3] #[0,1,2,3행],0~3열] **마지막 번호까지 출력함 ==> 통상 사용하는 슬라이싱이랑 다름
Out[49]:
empno | ename | job | |
---|---|---|---|
0 | 7369 | smith999 | CLERK |
1 | 7499 | ALLEN | SALESMAN |
2 | 7521 | WARD | SALESMAN |
3 | 7566 | JONES | MANAGER |
df.loc() -로우명, 칼럼명으로 호출 loc()
In [63]:
df.loc[[0,1,2,3],['empno','ename','job']] #[[행이름][열이름]]
Out[63]:
empno | ename | job | |
---|---|---|---|
0 | 7369 | SMITH | CLERK |
1 | 7499 | ALLEN | SALESMAN |
2 | 7521 | WARD | SALESMAN |
3 | 7566 | JONES | MANAGER |
In [64]:
df.loc[[0,1,2,3],'empno':'job']
Out[64]:
empno | ename | job | |
---|---|---|---|
0 | 7369 | SMITH | CLERK |
1 | 7499 | ALLEN | SALESMAN |
2 | 7521 | WARD | SALESMAN |
3 | 7566 | JONES | MANAGER |
In [69]:
df #전체 데이터 출력
Out[69]:
empno | ename | job | mgr | hirddate | sal | comm | deptno | |
---|---|---|---|---|---|---|---|---|
0 | 7369 | SMITH | CLERK | 7902.0 | 1980-12-17 | 1800 | NaN | 30 |
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981-02-20 | 1600 | 300.0 | 30 |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981-02-22 | 1250 | 500.0 | 30 |
3 | 7566 | JONES | MANAGER | 7839.0 | 1981-04-02 | 2975 | NaN | 20 |
4 | 7654 | MARTIN | SALESMAN | 7698.0 | 1981-09-28 | 1250 | 1400.0 | 30 |
5 | 7698 | BLAKE | MANAGER | 7839.0 | 1981-05-01 | 2850 | NaN | 30 |
6 | 7782 | CLARK | MANAGER | 7839.0 | 1981-06-09 | 2450 | NaN | 10 |
7 | 7788 | SCOTT | ANALYST | 7566.0 | 1987-07-13 | 3000 | NaN | 20 |
8 | 7839 | KING | PRESIDENT | NaN | 1981-11-17 | 5000 | NaN | 10 |
9 | 7844 | TURNER | SALESMAN | 7698.0 | 1981-09-08 | 1500 | 0.0 | 30 |
10 | 7876 | ADAMS | CLERK | 7788.0 | 1987-07-13 | 1100 | NaN | 20 |
11 | 7900 | JAMES | CLERK | 7698.0 | 1981-12-03 | 950 | NaN | 30 |
12 | 7902 | FORD | ANALYST | 7566.0 | 1981-12-03 | 6000 | NaN | 20 |
13 | 7934 | MILLER | CLERK | 7782.0 | 1982-01-23 | 1300 | NaN | 10 |
3. 조건절 where
df[조건][칼럼]
df[칼럼][조건]
In [72]:
#select empno, ename, sal from emp where deptno=10;
df[['empno','ename','sal']][df['deptno']==10]
Out[72]:
empno | ename | sal | |
---|---|---|---|
6 | 7782 | CLARK | 2450 |
8 | 7839 | KING | 5000 |
13 | 7934 | MILLER | 1300 |
In [73]:
df[df['deptno']==10][['empno','ename','sal']]
Out[73]:
empno | ename | sal | |
---|---|---|---|
6 | 7782 | CLARK | 2450 |
8 | 7839 | KING | 5000 |
13 | 7934 | MILLER | 1300 |
df에서는 and=&, or=|
df[(df[]조건) &/| (df[]조건)] ==> ()처리 반드시
In [81]:
df[(df['deptno']==30) & (df['job']=='SALESMAN')]
Out[81]:
empno | ename | job | mgr | hirddate | sal | comm | deptno | |
---|---|---|---|---|---|---|---|---|
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981-02-20 | 1600 | 300.0 | 30 |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981-02-22 | 1250 | 500.0 | 30 |
4 | 7654 | MARTIN | SALESMAN | 7698.0 | 1981-09-28 | 1250 | 1400.0 | 30 |
9 | 7844 | TURNER | SALESMAN | 7698.0 | 1981-09-08 | 1500 | 0.0 | 30 |
In [83]:
df[(df['job']=='MANAGER') | (df['job']=='SALESMAN')]
Out[83]:
empno | ename | job | mgr | hirddate | sal | comm | deptno | |
---|---|---|---|---|---|---|---|---|
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981-02-20 | 1600 | 300.0 | 30 |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981-02-22 | 1250 | 500.0 | 30 |
3 | 7566 | JONES | MANAGER | 7839.0 | 1981-04-02 | 2975 | NaN | 20 |
4 | 7654 | MARTIN | SALESMAN | 7698.0 | 1981-09-28 | 1250 | 1400.0 | 30 |
5 | 7698 | BLAKE | MANAGER | 7839.0 | 1981-05-01 | 2850 | NaN | 30 |
6 | 7782 | CLARK | MANAGER | 7839.0 | 1981-06-09 | 2450 | NaN | 10 |
9 | 7844 | TURNER | SALESMAN | 7698.0 | 1981-09-08 | 1500 | 0.0 | 30 |
In [85]:
df[((df['deptno']==30) & (df['job']=='SALESMAN'))|(df['deptno']==10)][['job','deptno']]
Out[85]:
job | deptno | |
---|---|---|
1 | SALESMAN | 30 |
2 | SALESMAN | 30 |
4 | SALESMAN | 30 |
6 | MANAGER | 10 |
8 | PRESIDENT | 10 |
9 | SALESMAN | 30 |
13 | CLERK | 10 |
4. 서브쿼리 사용하는 법 (비추) - 가능하면 db에서 해결하는게 빠름
단일행 서브쿼리
Value 값을 변수에 넣어서 비교
In [113]:
x=df[df['empno']==7782]['job'].values[0]
x
df[df['job']==x][['ename','job']]
Out[113]:
ename | job | |
---|---|---|
3 | JONES | MANAGER |
5 | BLAKE | MANAGER |
6 | CLARK | MANAGER |
다중행 서브쿼리
in -> df[].isin() 함수 사용
In [125]:
x_list=df[df['deptno']==10]['job'].values
x_list
Out[125]:
array(['MANAGER', 'PRESIDENT', 'CLERK'], dtype=object)
In [127]:
df[df['job'].isin(x_list)][['ename','job']]
Out[127]:
ename | job | |
---|---|---|
0 | SMITH | CLERK |
3 | JONES | MANAGER |
5 | BLAKE | MANAGER |
6 | CLARK | MANAGER |
8 | KING | PRESIDENT |
10 | ADAMS | CLERK |
11 | JAMES | CLERK |
13 | MILLER | CLERK |
5. SQL 연산 ||, +-*/ 사용하는 법
|| 사용법
In [133]:
new_col=df['ename']+df['job']
df['new_col']=new_col
df
Out[133]:
empno | ename | job | mgr | hirddate | sal | comm | deptno | new_col | |
---|---|---|---|---|---|---|---|---|---|
0 | 7369 | SMITH | CLERK | 7902.0 | 1980-12-17 | 1800 | NaN | 30 | SMITHCLERK |
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981-02-20 | 1600 | 300.0 | 30 | ALLENSALESMAN |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981-02-22 | 1250 | 500.0 | 30 | WARDSALESMAN |
3 | 7566 | JONES | MANAGER | 7839.0 | 1981-04-02 | 2975 | NaN | 20 | JONESMANAGER |
4 | 7654 | MARTIN | SALESMAN | 7698.0 | 1981-09-28 | 1250 | 1400.0 | 30 | MARTINSALESMAN |
5 | 7698 | BLAKE | MANAGER | 7839.0 | 1981-05-01 | 2850 | NaN | 30 | BLAKEMANAGER |
6 | 7782 | CLARK | MANAGER | 7839.0 | 1981-06-09 | 2450 | NaN | 10 | CLARKMANAGER |
7 | 7788 | SCOTT | ANALYST | 7566.0 | 1987-07-13 | 3000 | NaN | 20 | SCOTTANALYST |
8 | 7839 | KING | PRESIDENT | NaN | 1981-11-17 | 5000 | NaN | 10 | KINGPRESIDENT |
9 | 7844 | TURNER | SALESMAN | 7698.0 | 1981-09-08 | 1500 | 0.0 | 30 | TURNERSALESMAN |
10 | 7876 | ADAMS | CLERK | 7788.0 | 1987-07-13 | 1100 | NaN | 20 | ADAMSCLERK |
11 | 7900 | JAMES | CLERK | 7698.0 | 1981-12-03 | 950 | NaN | 30 | JAMESCLERK |
12 | 7902 | FORD | ANALYST | 7566.0 | 1981-12-03 | 6000 | NaN | 20 | FORDANALYST |
13 | 7934 | MILLER | CLERK | 7782.0 | 1982-01-23 | 1300 | NaN | 10 | MILLERCLERK |
수치(db: number, python: int,float, df: int64, float64) 컬럼, 컬럼간의 연산
- NaN에는 어떤 연산을 해도 NaN이다
- db: nvl(col,0)
- python : df.fillna(0) df[컬럼].fillna(0)
- db: 컬럼 is not null, 컬럼 is null
- df.isna() df.notna()
- df.isnull() df.notnull()
- df.astype() 캐스팅
In [152]:
new_col2=df['sal']+df['comm']
df['sal+comm']=new_col2
df
Out[152]:
empno | ename | job | mgr | hirddate | sal | comm | deptno | new_col | sal+comm | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 7369 | SMITH | CLERK | 7902.0 | 1980-12-17 | 1800 | NaN | 30 | SMITHCLERK | NaN |
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981-02-20 | 1600 | 300.0 | 30 | ALLENSALESMAN | 1900.0 |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981-02-22 | 1250 | 500.0 | 30 | WARDSALESMAN | 1750.0 |
3 | 7566 | JONES | MANAGER | 7839.0 | 1981-04-02 | 2975 | NaN | 20 | JONESMANAGER | NaN |
4 | 7654 | MARTIN | SALESMAN | 7698.0 | 1981-09-28 | 1250 | 1400.0 | 30 | MARTINSALESMAN | 2650.0 |
5 | 7698 | BLAKE | MANAGER | 7839.0 | 1981-05-01 | 2850 | NaN | 30 | BLAKEMANAGER | NaN |
6 | 7782 | CLARK | MANAGER | 7839.0 | 1981-06-09 | 2450 | NaN | 10 | CLARKMANAGER | NaN |
7 | 7788 | SCOTT | ANALYST | 7566.0 | 1987-07-13 | 3000 | NaN | 20 | SCOTTANALYST | NaN |
8 | 7839 | KING | PRESIDENT | NaN | 1981-11-17 | 5000 | NaN | 10 | KINGPRESIDENT | NaN |
9 | 7844 | TURNER | SALESMAN | 7698.0 | 1981-09-08 | 1500 | 0.0 | 30 | TURNERSALESMAN | 1500.0 |
10 | 7876 | ADAMS | CLERK | 7788.0 | 1987-07-13 | 1100 | NaN | 20 | ADAMSCLERK | NaN |
11 | 7900 | JAMES | CLERK | 7698.0 | 1981-12-03 | 950 | NaN | 30 | JAMESCLERK | NaN |
12 | 7902 | FORD | ANALYST | 7566.0 | 1981-12-03 | 6000 | NaN | 20 | FORDANALYST | NaN |
13 | 7934 | MILLER | CLERK | 7782.0 | 1982-01-23 | 1300 | NaN | 10 | MILLERCLERK | NaN |
In [154]:
new_col2=df['sal']+df['comm'].fillna(0)
df['sal+comm']=new_col2.astype(int) #int로 캐스팅
df
Out[154]:
empno | ename | job | mgr | hirddate | sal | comm | deptno | new_col | sal+comm | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 7369 | SMITH | CLERK | 7902.0 | 1980-12-17 | 1800 | NaN | 30 | SMITHCLERK | 1800 |
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981-02-20 | 1600 | 300.0 | 30 | ALLENSALESMAN | 1900 |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981-02-22 | 1250 | 500.0 | 30 | WARDSALESMAN | 1750 |
3 | 7566 | JONES | MANAGER | 7839.0 | 1981-04-02 | 2975 | NaN | 20 | JONESMANAGER | 2975 |
4 | 7654 | MARTIN | SALESMAN | 7698.0 | 1981-09-28 | 1250 | 1400.0 | 30 | MARTINSALESMAN | 2650 |
5 | 7698 | BLAKE | MANAGER | 7839.0 | 1981-05-01 | 2850 | NaN | 30 | BLAKEMANAGER | 2850 |
6 | 7782 | CLARK | MANAGER | 7839.0 | 1981-06-09 | 2450 | NaN | 10 | CLARKMANAGER | 2450 |
7 | 7788 | SCOTT | ANALYST | 7566.0 | 1987-07-13 | 3000 | NaN | 20 | SCOTTANALYST | 3000 |
8 | 7839 | KING | PRESIDENT | NaN | 1981-11-17 | 5000 | NaN | 10 | KINGPRESIDENT | 5000 |
9 | 7844 | TURNER | SALESMAN | 7698.0 | 1981-09-08 | 1500 | 0.0 | 30 | TURNERSALESMAN | 1500 |
10 | 7876 | ADAMS | CLERK | 7788.0 | 1987-07-13 | 1100 | NaN | 20 | ADAMSCLERK | 1100 |
11 | 7900 | JAMES | CLERK | 7698.0 | 1981-12-03 | 950 | NaN | 30 | JAMESCLERK | 950 |
12 | 7902 | FORD | ANALYST | 7566.0 | 1981-12-03 | 6000 | NaN | 20 | FORDANALYST | 6000 |
13 | 7934 | MILLER | CLERK | 7782.0 | 1982-01-23 | 1300 | NaN | 10 | MILLERCLERK | 1300 |
In [157]:
df.isnull() #null인 값 True 논리값으로 반환
Out[157]:
empno | ename | job | mgr | hirddate | sal | comm | deptno | new_col | sal+comm | |
---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | True | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | True | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False |
5 | False | False | False | False | False | False | True | False | False | False |
6 | False | False | False | False | False | False | True | False | False | False |
7 | False | False | False | False | False | False | True | False | False | False |
8 | False | False | False | True | False | False | True | False | False | False |
9 | False | False | False | False | False | False | False | False | False | False |
10 | False | False | False | False | False | False | True | False | False | False |
11 | False | False | False | False | False | False | True | False | False | False |
12 | False | False | False | False | False | False | True | False | False | False |
13 | False | False | False | False | False | False | True | False | False | False |
In [159]:
df.notnull() #null이 아닌 값 True 논리값으로 반환
Out[159]:
empno | ename | job | mgr | hirddate | sal | comm | deptno | new_col | sal+comm | |
---|---|---|---|---|---|---|---|---|---|---|
0 | True | True | True | True | True | True | False | True | True | True |
1 | True | True | True | True | True | True | True | True | True | True |
2 | True | True | True | True | True | True | True | True | True | True |
3 | True | True | True | True | True | True | False | True | True | True |
4 | True | True | True | True | True | True | True | True | True | True |
5 | True | True | True | True | True | True | False | True | True | True |
6 | True | True | True | True | True | True | False | True | True | True |
7 | True | True | True | True | True | True | False | True | True | True |
8 | True | True | True | False | True | True | False | True | True | True |
9 | True | True | True | True | True | True | True | True | True | True |
10 | True | True | True | True | True | True | False | True | True | True |
11 | True | True | True | True | True | True | False | True | True | True |
12 | True | True | True | True | True | True | False | True | True | True |
13 | True | True | True | True | True | True | False | True | True | True |
In [158]:
df.isnull().sum() #null 개수
Out[158]:
empno 0
ename 0
job 0
mgr 1
hirddate 0
sal 0
comm 10
deptno 0
new_col 0
sal+comm 0
dtype: int64
6. 그룹함수
Group 함수
- db : count() min() max() sum() avg()
- df : value_counts() min() max() sum() mean() ... std() var()
In [184]:
#부서별 사원수 출력
#select deptno, count(1) from emp group by deptno;
df.groupby("deptno")["empno"].count()
Out[184]:
deptno
10 3
20 4
30 7
Name: empno, dtype: int64
In [182]:
df.groupby("deptno")["sal"].mean()
Out[182]:
deptno
10 2916.666667
20 3268.750000
30 1600.000000
Name: sal, dtype: float64
In [174]:
#select deptno, min(sal), max(sal), sum(sal), avg(sal) from emp group by deptno
df.groupby("deptno")["sal"].agg(["min","max","sum","mean"])
Out[174]:
min | max | sum | mean | |
---|---|---|---|---|
deptno | ||||
10 | 1300 | 5000 | 8750 | 2916.666667 |
20 | 1100 | 6000 | 13075 | 3268.750000 |
30 | 950 | 2850 | 11200 | 1600.000000 |
In [180]:
#select deptno, max(sal), min(comm) from emp group by deptno
mydict = {"sal":"max", "comm":"min"} #컬럼이름(키 값)이 같으면 안됨
df.groupby("deptno").agg(mydict)
Out[180]:
sal | comm | |
---|---|---|
deptno | ||
10 | 5000 | NaN |
20 | 6000 | NaN |
30 | 2850 | 0.0 |
- 조건문 + 그룹바이
- df[df[]조건].groupby().....
In [195]:
#select deptno, min(sal), max(sal), sum(sal), avg(sal) from emp
#where deptno =10
#group by deptno
df[df['deptno']==10].groupby("deptno")[["sal"]].agg(["min","max","sum","mean"])
Out[195]:
sal | ||||
---|---|---|---|---|
min | max | sum | mean | |
deptno | ||||
10 | 1300 | 5000 | 8750 | 2916.666667 |
In [202]:
df.describe()
Out[202]:
empno | mgr | sal | comm | deptno | sal+comm | |
---|---|---|---|---|---|---|
count | 14.000000 | 13.000000 | 14.000000 | 4.000000 | 14.000000 | 14.000000 |
mean | 7726.571429 | 7739.307692 | 2358.928571 | 550.000000 | 22.857143 | 2516.071429 |
std | 178.294361 | 103.714660 | 1520.362025 | 602.771377 | 8.254203 | 1450.053286 |
min | 7369.000000 | 7566.000000 | 950.000000 | 0.000000 | 10.000000 | 950.000000 |
25% | 7588.000000 | 7698.000000 | 1262.500000 | 225.000000 | 20.000000 | 1562.500000 |
50% | 7785.000000 | 7698.000000 | 1700.000000 | 400.000000 | 25.000000 | 2175.000000 |
75% | 7868.000000 | 7839.000000 | 2943.750000 | 725.000000 | 30.000000 | 2943.750000 |
max | 7934.000000 | 7902.000000 | 6000.000000 | 1400.000000 | 30.000000 | 6000.000000 |
In [199]:
df['sal'].describe()
Out[199]:
count 14.000000
mean 2358.928571
std 1520.362025
min 950.000000
25% 1262.500000
50% 1700.000000
75% 2943.750000
max 6000.000000
Name: sal, dtype: float64
In [205]:
gdf=df.describe()
gdf.loc[['count','mean','max'],['sal','mgr','comm']]
Out[205]:
sal | mgr | comm | |
---|---|---|---|
count | 14.000000 | 13.000000 | 4.0 |
mean | 2358.928571 | 7739.307692 | 550.0 |
max | 6000.000000 | 7902.000000 | 1400.0 |
7. UPDATE
In [211]:
# create table empcp as (select empno, ename, sal, job from emp where deptno=30)
empcp = df[df['deptno']==30][['empno','ename','sal','job']].copy() ##카피
empcp
Out[211]:
empno | ename | sal | job | |
---|---|---|---|---|
0 | 7369 | SMITH | 1800 | CLERK |
1 | 7499 | ALLEN | 1600 | SALESMAN |
2 | 7521 | WARD | 1250 | SALESMAN |
4 | 7654 | MARTIN | 1250 | SALESMAN |
5 | 7698 | BLAKE | 2850 | MANAGER |
9 | 7844 | TURNER | 1500 | SALESMAN |
11 | 7900 | JAMES | 950 | CLERK |
In [213]:
empcp.iloc[1] = [7499,'ALLEN22',1800,'MANAGER']
empcp
Out[213]:
empno | ename | sal | job | |
---|---|---|---|---|
0 | 7369 | SMITH | 1800 | CLERK |
1 | 7499 | ALLEN22 | 1800 | MANAGER |
2 | 7521 | WARD | 1250 | SALESMAN |
4 | 7654 | MARTIN | 1250 | SALESMAN |
5 | 7698 | BLAKE | 2850 | MANAGER |
9 | 7844 | TURNER | 1500 | SALESMAN |
11 | 7900 | JAMES | 950 | CLERK |
8. DELETE
- 행 한줄 지우기
- empcp.drop(index = '행', axis=0)
In [211]:
# create table empcp as (select empno, ename, sal, job from emp where deptno=30)
empcp = df[df['deptno']==30][['empno','ename','sal','job']].copy() ##카피
empcp
Out[211]:
empno | ename | sal | job | |
---|---|---|---|---|
0 | 7369 | SMITH | 1800 | CLERK |
1 | 7499 | ALLEN | 1600 | SALESMAN |
2 | 7521 | WARD | 1250 | SALESMAN |
4 | 7654 | MARTIN | 1250 | SALESMAN |
5 | 7698 | BLAKE | 2850 | MANAGER |
9 | 7844 | TURNER | 1500 | SALESMAN |
11 | 7900 | JAMES | 950 | CLERK |
In [216]:
empcp.drop(index=11, axis=0) #-- inplace=True를 사용해야 실제로 지워짐 -> 사용안하면 삭제한 견본만 보임
#empcp.drop(index=11, axis=0, inplace=True)
#empcp = empcp.drop(index=11, axis=0) 이렇게도 지우기 가능
Out[216]:
empno | ename | sal | job | |
---|---|---|---|---|
0 | 7369 | SMITH | 1800 | CLERK |
1 | 7499 | ALLEN22 | 1800 | MANAGER |
2 | 7521 | WARD | 1250 | SALESMAN |
4 | 7654 | MARTIN | 1250 | SALESMAN |
5 | 7698 | BLAKE | 2850 | MANAGER |
9 | 7844 | TURNER | 1500 | SALESMAN |
- 열(컬럼) 지우기
- empcp.drop(['컬럼','컬럼'], axis=1)
In [218]:
empcp
Out[218]:
empno | ename | sal | job | |
---|---|---|---|---|
0 | 7369 | SMITH | 1800 | CLERK |
1 | 7499 | ALLEN22 | 1800 | MANAGER |
2 | 7521 | WARD | 1250 | SALESMAN |
4 | 7654 | MARTIN | 1250 | SALESMAN |
5 | 7698 | BLAKE | 2850 | MANAGER |
9 | 7844 | TURNER | 1500 | SALESMAN |
11 | 7900 | JAMES | 950 | CLERK |
In [221]:
empcp.drop('job', axis=1) ##칼럼 지우기 (axis = 1)
#-- inplace=True를 사용해야 실제로 지워짐 -> 사용안하면 삭제한 견본만 보임
#empcp.drop('job', axis=1, inplace=True)
#empcp=empcp.drop(['job','sal','ename'], axis=1)
Out[221]:
empno | ename | sal | |
---|---|---|---|
0 | 7369 | SMITH | 1800 |
1 | 7499 | ALLEN22 | 1800 |
2 | 7521 | WARD | 1250 |
4 | 7654 | MARTIN | 1250 |
5 | 7698 | BLAKE | 2850 |
9 | 7844 | TURNER | 1500 |
11 | 7900 | JAMES | 950 |
9. INSERT
'Python > 데이터 분석' 카테고리의 다른 글
(데이터 분석) Pandas SQL처럼 활용하기 -정리 중- (0) | 2022.01.10 |
---|---|
(데이터 분석) 파이썬 - Pandas DataFrame SQL처럼 활용하기(Join,Union) (0) | 2022.01.10 |
(데이터 분석)파이썬 - Pandas_dataframe (0) | 2022.01.06 |
(데이터 분석)파이썬 - Pandas DataFrame, Numpy array (0) | 2022.01.05 |
(데이터 분석) Pandas 가이드북 링크 (0) | 2022.01.05 |
댓글