1. pandas Series 데이터 생성하기¶

import numpy as np
import pandas as pd

pandas의 중요한 객체 : Data Frame!!¶

엑셀 파일과 같은 차트
ex)

	이름	키	몸무게
	홍길동	162	50
	김철수	180	78

Series¶

pandas의 기본 객체 중 하나
dataframe에서 파생된 결과로 series가 많이 사용됨
numpy의 ndarray를 기반으로 인덱싱을 기능을 추가하여 1차원 배열을 나타냄
index를 지정하지 않을 시, 기본적으로 ndarray와 같이 0-based 인덱스 생성, 지정할 경우 명시적으로 지정된 index를 사용
같은 타입의 0개 이상의 데이터를 가질 수 있음

data로만 생성하기
- index는 기본적으로 0부터 자동적으로 생성

## Series 생성하기
a1 = pd.Series([1, 2, 3])
a1  ## 기본적으로 인덱스가 자동 생성됨

0    1
1    2
2    3
dtype: int64

a2 = pd.Series(['a','b','c']) ## 문자열 시리즈
a2 ## 자동으로 타입을 인식함

0    a
1    b
2    c
dtype: object

a3 = pd.Series(np.arange(200))
a3

0        0
1        1
2        2
3        3
4        4
      ... 
195    195
196    196
197    197
198    198
199    199
Length: 200, dtype: int32

data, index함께 명시하기

pd.Series( 
    data = None,   ## 데이터 입력
    index = None,  ## 인덱스
    dtype = None,  ## 타입
    ...
    )

pd.Series([1, 2, 3], [100, 200, 300]) # 인덱스에 100, 200, 300 입력

100    1
200    2
300    3
dtype: int64

pd.Series([1, 2, 3], ['a', 'm', 'k']) # 인덱스에 문자열 삽입

a    1
m    2
k    3
dtype: int64

data, index, data type 함께 명시하기

a6 = pd.Series(np.arange(5), np.arange(100, 105), dtype = np.int16) ## Series 에 타입 명시
a6

100    0
101    1
102    2
103    3
104    4
dtype: int16

인덱스 활용하기¶

a6.index # 인덱스 불러오기

Int64Index([100, 101, 102, 103, 104], dtype='int64')

a6.values # 값만 불러오기

array([0, 1, 2, 3, 4], dtype=int16)

인덱스를 통한 데이터 접근

a6[100] # 인덱스로 데이터 추출

0

a6[105] # 인덱스가 없는 걸 추출하려 하면 keyError를 내뱉음

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-16-d9fb44784a5e> in <module>
----> 1 a6[105]

~\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
   1066         key = com.apply_if_callable(key, self)
   1067         try:
-> 1068             result = self.index.get_value(self, key)
   1069 
   1070             if not is_scalar(result):

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
   4728         k = self._convert_scalar_indexer(k, kind="getitem")
   4729         try:
-> 4730             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4731         except KeyError as e1:
   4732             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 105

인덱스를 통한 데이터 업데이트

a6[100] = 70 # 인덱스를 이용해 데이터 변경
a6

100    70
101     1
102     2
103     3
104     4
dtype: int16

a6[105] = 67 # 인덱스에 없는 값에 데이터를 넣으면 데이터 추가
a6

100    70
101     1
102     2
103     3
104     4
105    67
dtype: int64

인덱스 재사용하기

a7 = pd.Series(np.arange(6), a6.index) # 다른 테이블의 인덱스를 그대로 사용해도 됨

a7

100    0
101    1
102    2
103    3
104    4
105    5
dtype: int32

index를 활용하여 멀티플한 값에 접근

a7[[100, 104, 105]] # 여러개의 indes에 접근하려면 리스트 입력

100    0
104    4
105    5
dtype: int32

2.Series 데이터 심플 분석(개수, 빈도 등 계산하기)¶

Series size, shape, unique, count, value_counts 함수¶

size : 개수 반환
shape : 튜플형태로 shape반환
unique: 유일한 값만 ndarray로 반환
count : NaN을 제외한 개수를 반환
mean: NaN을 제외한 평균
value_counts: NaN을 제외하고 각 값들의 빈도를 반환

s = pd.Series([1, 1, 2, 1, 2, 2, 2, 1, 1, 3, 3, 4, 5, 5, 7, np.NaN])
s

0     1.0
1     1.0
2     2.0
3     1.0
4     2.0
5     2.0
6     2.0
7     1.0
8     1.0
9     3.0
10    3.0
11    4.0
12    5.0
13    5.0
14    7.0
15    NaN
dtype: float64

len(s) # 데이터 프레임의 길이 구하기 (row 개수)

16

s.size # Series 의 길이 구하기

16

s.shape # 데이터 프레임의 크기 구하기(row, col)

(16,)

s.unique() # Series의 unique한 숫자 구하기

array([ 1.,  2.,  3.,  4.,  5.,  7., nan])

s.count() # NaN을 제외한 data의 개수 구하기

15

numpy와의 차이점

numpy의 경우 nan이 있으면 연산을 못함
series의 경우 nan을 무시하고 연산을 진행함

a = np.array([2, 2, 2, 2, np.NaN])
print(a.mean()) #numpy일 때

b = pd.Series(a)
print(b.mean()) # Series일 때

nan
2.0

s.value_counts() # 각 원소의 빈도수(분포)를 구하기

1.0    5
2.0    4
5.0    2
3.0    2
7.0    1
4.0    1
dtype: int64

head, tail 함수¶

head : 상위 n개 출력 기본 5개
tail : 하위 n개 출력 기본 5개

s.head() # 상위 n개 출력(default : 5)

0    1.0
1    1.0
2    2.0
3    1.0
4    2.0
dtype: float64

s.head(10) # 상위 10개 출력

0    1.0
1    1.0
2    2.0
3    1.0
4    2.0
5    2.0
6    2.0
7    1.0
8    1.0
9    3.0
dtype: float64

s.tail() ## 하위 n개 출력

11    4.0
12    5.0
13    5.0
14    7.0
15    NaN
dtype: float64

3. Series 데이터 연산하기¶

index를 기준으로 연산¶

Series 끼리 연산할 때는 같은 index를 기준으로 연산한다.

s1 = pd.Series([1, 2, 3, 4], ['a', 'b', 'c', 'd'])
s2 = pd.Series([6, 3, 2, 1], ['d', 'c', 'b', 'a'])

print(s1)
print(s2)

a    1
b    2
c    3
d    4
dtype: int64
d    6
c    3
b    2
a    1
dtype: int64

s1 + s2 # 같은 인덱스끼리 연산

a     2
b     4
c     6
d    10
dtype: int64

산술연산¶

Series의 경우에도 스칼라와의 연산은 각 원소별로 스칼라와의 연산이 적용
Series와의 연산은 각 인덱스에 맞는 값끼리 연산이 적용
- 이때, 인덱스의 pair가 맞지 않으면, 결과는 NaN

s1 ** 2

a     1
b     4
c     9
d    16
dtype: int64

s1 ** s2

a       1
b       4
c      27
d    4096
dtype: int64

index pair가 맞지 않는 경우¶

해당 index에 대해선 NaN 값 생성

s1['k'] = 7
s2['e'] = 9

print(s1)
print(s2)

a    1
b    2
c    3
d    4
k    7
dtype: int64
d    6
c    3
b    2
a    1
e    9
dtype: int64

s1 + s2 # index가 맞지 않으면 nan 생성

a     2.0
b     4.0
c     6.0
d    10.0
e     NaN
k     NaN
dtype: float64

4. Series boolean selection 으로 데이터 선택하기¶

Boolean selection¶

boolean Series가 []와 함께 사용되면 True 값에 해당하는 값만 새로 반환되는 Series객체에 포함됨
다중조건의 경우, &(and), |(or)를 사용하여 연결 가능

s = pd.Series(np.arange(10), np.arange(10) + 1)
s

1     0
2     1
3     2
4     3
5     4
6     5
7     6
8     7
9     8
10    9
dtype: int32

s > 5 ## True, False 값 내뱉기

1     False
2     False
3     False
4     False
5     False
6     False
7      True
8      True
9      True
10     True
dtype: bool

s[s > 5] # 값이 True인 값만 내뱉기

7     6
8     7
9     8
10    9
dtype: int32

s[s % 2 == 0] # 짝수인 값만 추출하기

1    0
3    2
5    4
7    6
9    8
dtype: int32

index를 기준으로 filtering 하기

s.index > 5

array([False, False, False, False, False,  True,  True,  True,  True,
        True])

s[s.index > 5] # index가 5보다 큰 것만 추출

6     5
7     6
8     7
9     8
10    9
dtype: int32

다중 조건인 경우

다중조건을 만족시키려면 각 조건에 ()를 넣어줘야 한다.

s[(s > 5) & (s < 8)]

7    6
8    7
dtype: int32

(s >= 7).sum() ## Boolean의 sum. 개수 구하기

3

s[s >=7].sum() ## 해당 데이터의 값의 합을 구하기

24

5. Series의 데이터 변경 - 슬라이싱 이해하기¶

Series 값 변경¶

추가 및 업데이트: 인덱스를 이용
삭제: drop함수 이용

s = pd.Series(np.arange(100, 105), ['a', 'b', 'c', 'd', 'e'])
s

a    100
b    101
c    102
d    103
e    104
dtype: int32

## 인덱스로 값 변경하기
s['a'] = 200
s

a    200
b    101
c    102
d    103
e    104
dtype: int32

## 인덱스로 값 추가하기
s['k'] = 700
s

a    200
b    101
c    102
d    103
e    104
k    700
dtype: int64

## 인덱스로 값 삭제하기
s.drop('k') # 반환만 할 뿐 s 객체에 연산이 적용된 것은 아님
s

a    200
b    101
c    102
d    103
e    104
k    700
dtype: int64

s = s.drop('k')
s

a    200
b    101
c    102
d    103
e    104
dtype: int64

s.drop('e', inplace = True) # 해당 객체에 연산을 바로 적용
s

a    200
b    101
c    102
d    103
dtype: int64

s[['a','b']] = [300, 500] # 여러개의 값 업데이트 하기
s

a    300
b    500
c    102
d    103
dtype: int64

Slicing¶

리스트, ndarray와 동일하게 적용

s1 = pd.Series(np.arange(100, 105))
s1

0    100
1    101
2    102
3    103
4    104
dtype: int32

s1[1:3] # 1번째 ~3번째 데이터 출력

1    101
2    102
dtype: int32

s2 = pd.Series(np.arange(100, 105), ['a', 'b', 'c', 'd', 'e'])
s2

a    100
b    101
c    102
d    103
e    104
dtype: int32

s2[1:3] # index가 숫자가 아닐 경우 순서대로 받아와서 sliccing

b    101
c    102
dtype: int32

s2['c':'d'] # 문자열로 indexing하는 경우 마지막 문자도 포함해서 불러옴

c    102
d    103
dtype: int32

머신러닝과 데이터 분석 A-Z 올인원 패키지 - 데이터 분석을 위한 Python(Pandas) – (3) (0)	2020.09.30
머신러닝과 데이터 분석 A-Z 올인원 패키지 - 데이터 분석을 위한 Python(Pandas) – (2) (0)	2020.09.30
머신러닝과 데이터 분석 A-Z 올인원 패키지 - 데이터 처리를 위한 Python(Numpy)(2) (0)	2020.09.30
머신러닝과 데이터 분석 A-Z 올인원 패키지-데이터 처리를 위한 Python(Numpy) – (1) (0)	2020.09.30
머신러닝과 데이터 분석 A-Z 올인원 패키지-데이터 수집을 위한 Python(2) (0)	2020.09.10

머신러닝과 데이터 분석 A-Z 올인원 패키지 - 데이터 분석을 위한 Python(Pandas) – (1)

1. pandas Series 데이터 생성하기¶

pandas의 중요한 객체 : Data Frame!!¶

Series¶

인덱스 활용하기¶

2.Series 데이터 심플 분석(개수, 빈도 등 계산하기)¶

Series size, shape, unique, count, value_counts 함수¶

head, tail 함수¶

3. Series 데이터 연산하기¶

index를 기준으로 연산¶

산술연산¶

index pair가 맞지 않는 경우¶

4. Series boolean selection 으로 데이터 선택하기¶

Boolean selection¶

5. Series의 데이터 변경 - 슬라이싱 이해하기¶

Series 값 변경¶

Slicing¶

'AI Study > ML_Basic' 카테고리의 다른 글

티스토리툴바