[NumPy] 넘파이 기초 - array, 수학, 통계

728x90

What is NumPy?

Python의 list는 원소의 데이터 타입(정확히는 객체)이 달라도 되고, mutable하다. 그러나 NumPy의 array는 C의 array와 특징이 비슷하다. 모든 원소의 데이터 타입은 같아야하고(당연히 메모리도 같다), fixed-size이다. 그래서 NumPy의 핵심이라 할 수 있는 것은 ndarray object이다.

엥? 넘파이도 배열 크기를 바꿀 수 있는데요? 그건 사실 새로운 배열을 만드는 것이다.

넘파이를 쓰는 이유는 수학/과학 관련 패키지가 잘 구성되어있다는 것이다. 그리고 built-in sequence가 있어서 실제로 효율적으로 계산한다. shape 변환과 정렬, 수학 공식은 물론이고 기본적인 선형대수, 통계, 랜덤 시뮬레이션, 이산 푸리에 변환까지 지원한다.

직관적으로, 배열(사실상 행렬)을 이용한 수학/과학적 연산이 필요할 때 넘파이 패키지를 이용하여 (직접 구현한 것 보다) 효율적으로 계산할 때 사용한다.

https://numpy.org/doc/stable/user/whatisnumpy.html

Data types

C type과 비교하여 표로 보는것이 편하다

C type	Numpy type
bool	numpy.bool_
signed char, unsigned char	numpy.byte, numpy.ubyte
short, unsigned short	numpy.short, numpy.ushort
int, unsigned int	numpy.intc, numpy.uintc
long, unsigned long	numpy.int_, numpy.uint
long long, unsigned long long	numpy.longlong, numpy.ulonglong
float, double, long double	numpy.single, numpy.double, numpy.longdouble
float complex, double complex, long double complex	numpy.csingle, numpy.cdouble, numpy.clongdouble

위의 표에 있는 내용은 platform dependency가 존재한다. (C언어에서도 모든 int가 4바이트가 아닌 것 처럼) 이런 platform dependency를 피하기 위해 size alias를 지원한다. C에서도 int32_t 라고 사용하듯이 넘파이도 동일하게 생각하면 된다.

정수의 경우 signed/unsigned 가 numpy.int8/uint8 처럼 사용한다. (8, 16, 32, 64 비트로 가능)

import numpy as np

x = np.array([1, 2, 3, 4], dtype='float32')
print(x)

array([1., 2., 3., 4.], dtype=float32)

Create arrays

($d_1, d_2, \cdots d_n$)은 $n$차원 배열이고 size는 $d_1 \times d_2 \times \cdots \times d_n$ 이다.

a = np.zeros(10)
b = np.ones((2, 5))
c = np.full((2, 3, 4), 1.5)

print(a.shape, b.shape, c.shape)
print(a)
print(b)
print(c)

----- result -----
(10,) (2, 5) (2, 3, 4)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]
[[[1.5 1.5 1.5 1.5]
  [1.5 1.5 1.5 1.5]
  [1.5 1.5 1.5 1.5]]

 [[1.5 1.5 1.5 1.5]
  [1.5 1.5 1.5 1.5]
  [1.5 1.5 1.5 1.5]]]

특별히 항등행렬(identity matrix)는 $I$로 표기하는데 발음이 아이(eye)인 것에 착안하여 numpy.eye로 생성한다.

print(eye(3))

----- result -----
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

NumPy array's attributes

dtype - data type

ndim - number of axes, axis의 개수

shape - 각 axis의 size

size - array의 전체 원소 개수

itemsize - 각 원소의 바이트 수

nbytes - 전체 array의 바이트 수

x = np.random.randint(0, 10, size=(2, 3, 4))
print('dtype:', x.dtype)
print('ndim:', x.ndim)
print('shape:', x.shape)
print('size:', x.size)
print('itemsize:', x.itemsize)
print('nbytes:', x.nbytes)

----- result -----
dtype: int32
ndim: 3
shape: (2, 3, 4)
size: 24
itemsize: 4
nbytes: 96

reshape - ndarray의 shape을 바꾸는 함수. -1로 인자를 넘겨주면 값을 추측해서 계산하고, reshape(-1)은 1차원으로 flatten한다.

x = np.arange(12)
print('x:\n', x)

a = x.reshape(3, 4)
print('a:\n', a)

b = x.reshape(-1, 3)
print('b:\n', b)

c = x.reshape(2, 2, -1)
print('c:\n', c)

d = x.reshape(-1)
print('d:\n', d)

e = x.reshape(3, 3, -1) # error
----- result -----
x:
 [ 0  1  2  3  4  5  6  7  8  9 10 11]
a:
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
b:
 [[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
c:
 [[[ 0  1  2]
  [ 3  4  5]]

 [[ 6  7  8]
  [ 9 10 11]]]
d:
 [ 0  1  2  3  4  5  6  7  8  9 10 11]
 
 ValueError                                Traceback (most recent call last)
Cell In[27], line 16
     13 d = x.reshape(-1)
     14 print('d:\n', d)
---> 16 e = x.reshape(3, 3, -1)

ValueError: cannot reshape array of size 12 into shape (3,3,newaxis)

Vectorized function

넘파이는 vectorized function을 이용하여 같은 기능이더라도 병렬처리하여 ndarray의 연산을 빠르게 할 수 있다.

먼저 vectorized function가 얼마나 빠른지 보자.

x = np.arange(0, 100000)

def add1(x, num):
    y = []
    for i in range(len(x)):
        y.append(x[i] + num)
    return y

def add2(x, num):
    y = x + num
    return y
    
----- result -----
%timeit add1(x, 5)
47.3 ms ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit add2(x, 5)
106 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

기본적인 연산들은 vectorized 연산이 가능하다

x = np.array([1, 2, 3, -1, -2, -3])

y1 = x + 10      # [11 12 13  9  8  7]
y2 = x - 10      # [ -9  -8  -7 -11 -12 -13]
y3 = x * 10      # [ 10  20  30 -10 -20 -30]
y4 = x / 2       # [ 0.5  1.   1.5 -0.5 -1.  -1.5]
y5 = x // 2      # [ 0  1  1 -1 -1 -2]
y6 = x % 3       # [1 2 0 2 1 0]
y7 = x ** 2      # [1 4 9 1 4 9]
y8 = -x          # [-1 -2 -3  1  2  3]
y9 = 10 * x + 5  # [ 15  25  35  -5 -15 -25]

Math functions

넘파이는 다양한 수학/과학 관련 함수들과 상수(e.g. $\pi$)이 내장되어있다. 그리고 마찬가지로 vectorized 연산을 제공한다.

# 절댓값
x = np.array([1, 2, 3, -1, -2, -3])

y = np.abs(x)
print(y)

지수함수와 로그함수

x = np.array([1, 2, 3])

y1 = np.exp(x)  
y2 = np.exp2(x) 
y3 = np.power(3, x)

----- result -----
[ 2.71828183  7.3890561  20.08553692]
[2. 4. 8.]
[ 3  9 27]

x = np.array([1, 2, 4, 8, 10])

y1 = np.log(x)
y2 = np.log2(x)
y3 = np.log10(x)

----- result -----
[0.         0.69314718 1.38629436 2.07944154 2.30258509]
[0.         1.         2.         3.         3.32192809]
[0.         0.30103    0.60205999 0.90308999 1.        ]

그리고 지수/로그에서 수치적 정확도를 위한 특별한 함수를 제공하고 있다.

x1 = np.array([1e-3, 1e-4, 1e-5], dtype=np.float32)
x2 = np.array(x1, dtype=np.float64)

y1 = np.exp(x1) - 1
y2 = np.exp(x2) - 1
y0 = np.expm1(x1)

----- result -----
[1.00052357e-03 1.00016594e-04 1.00135803e-05]
[1.00050021e-03 1.00004998e-04 1.00000497e-05]
[1.0005003e-03 1.0000499e-04 1.0000050e-05]

y1 = np.log(x1 + 1)
y2 = np.log(x2 + 1)
y0 = np.log1p(x1)

----- result -----
[9.99547192e-04 1.00011595e-04 1.00135303e-05]
[9.99500381e-04 9.99949978e-05 9.99994975e-06]
[9.995003e-04 9.999500e-05 9.999950e-06]

삼각함수

x = np.linspace(0, np.pi, 4)
y_sin = np.sin(x)
y_cos = np.cos(x)
y_tan = np.tan(x)

----- result -----
[0.         1.04719755 2.0943951  3.14159265]
[0.00000000e+00 8.66025404e-01 8.66025404e-01 1.22464680e-16]
[ 1.   0.5 -0.5 -1. ]
[ 0.00000000e+00  1.73205081e+00 -1.73205081e+00 -1.22464680e-16]


x = np.array([-1, 0, 1])

y1 = np.arcsin(x)
y2 = np.arccos(x)
y3 = np.arctan(x)

----- result -----
[-1.57079633  0.          1.57079633]
[3.14159265 1.57079633 0.        ]
[-0.78539816  0.          0.78539816]

통계 관련 함수

통계함수 역시 파이썬 내장 함수보다 훨씬 효율적이다. 대표적으로 sum, max, min 등이 파이썬 내장함수인데, 넘파이 내장 함수를 이용하면 훨씬 빠르게 동작할 수 있다.

data = np.random.random(100000)

%timeit sum(data)
%timeit np.sum(data)
%timeit max(data)
%timeit data.max()

----- result -----
19.4 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
87.3 µs ± 4.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
9.24 ms ± 595 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
41.5 µs ± 4.35 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

다음은 행렬 연산의 예이다

X = np.random.randint(1, 4, size=(3, 4))
print('X')
print(X)

print('\nsum')
print(X.sum())

print('\nsum-axis')
print(X.sum(axis=0))
print(X.sum(axis=1))

print('\ncumsum-axis')
print(X.cumsum(axis=0))
print(X.cumsum(axis=1))

print('\nprod-axis')
print(X.prod(axis=0))
print(X.prod(axis=1))

print('\nX.cumprod-axis')
print(X.cumprod(axis=0))
print(X.cumprod(axis=1))

----- result -----
X
[[1 3 1 2]
 [3 2 2 1]
 [1 3 3 2]]

sum
24

sum-axis
[5 8 6 5]
[7 8 9]

cumsum-axis
[[1 3 1 2]
 [4 5 3 3]
 [5 8 6 5]]
[[1 4 5 7]
 [3 5 7 8]
 [1 4 7 9]]

prod-axis
[ 3 18  6  4]
[ 6 12 18]

X.cumprod-axis
[[ 1  3  1  2]
 [ 3  6  2  2]
 [ 3 18  6  4]]
[[ 1  3  3  6]
 [ 3  6 12 12]
 [ 1  3  9 18]]

행렬의 통계 연산

X = np.random.randint(1, 4, size=(3, 4))
print('X')
print(X)

print('min-axis=0:', X.min(axis=0))
print('max-axis=0:', X.max(axis=0))
print('mean-axis=0:', X.mean(axis=0))
print('var-axis=0:', X.var(axis=0))
print('std-axis=0:', X.std(axis=0))
print('med-axis=0:', np.median(X, axis=0))
print('p75%-axis=0:', np.percentile(X, 75, axis=0))

----- result -----
X
[[3 2 2 1]
 [2 3 3 2]
 [3 1 3 1]]
min-axis=0: [2 1 2 1]
max-axis=0: [3 3 3 2]
mean-axis=0: [2.66666667 2.         2.66666667 1.33333333]
var-axis=0: [0.22222222 0.66666667 0.22222222 0.22222222]
std-axis=0: [0.47140452 0.81649658 0.47140452 0.47140452]
med-axis=0: [3. 2. 3. 1.]
p75%-axis=0: [3.  2.5 3.  1.5]