跳至主要內容

Numpy

Hirsun大约 30 分钟

Numpy

What is Numpy

NumPy是Python中科学计算的核心库。诸如pandas、SciPy和Matplotlib等Python基础库都建立在NumPy的API之上。机器学习库也是如此,如TensorFlow和scikit-learn,它们使用NumPy数组作为输入。任何在Python中使用数字的人都会接触到NumPy数组.

1655207474267.png

Numpy (Numerical Python的缩写):

  • 一个开源的Python科学计算库
  • 使用Numpy可以方便的使用数组、炬年进行计算
  • 包含线性代数、傅里叶变换、随机数生成等大量函数

Why Numpy

对于同样的数值计算任务,使用Numpy比直接编写Python代码实现,优点:

  • 代码更简洁:Numpy直接以数组、矩阵为粒度计算并且支持大量的数学函数,而Python需要用for循环从底层实现
  • 性能更高效:Numpy的数组存储效率和输入输出计算性能,比Python使用List或者嵌套List好很多
    • 注:(Numpy的数据存储和Python,原生的List是不一样的)
    • 注:Numpy的大部分代码都是C语言实现的,这是Numpyl比纯Python代码高效的原因

Numpy是Python各种数据科学类库的基础库

  • 比如SciPy、Scikit-Learn、。Tensorflow、PaddlePaddle等
  • 如果不会Numpy,这些库的深入理解都会遇到障碍

Cerate np array

Create empty array

import numpy as np
data = np.array()

Creating 1D arrays from lists

python_list = [3, 2, 5, 8, 4, 9, 7, 6, 1]
array = np.array(python_list)
print(array)
print(type(array))
array([3, 2, 5, 8, 4, 9, 7, 6, 1])
numpy.ndarray
1655208610322.png
1655208610322.png

Creating 2D arrays from lists

python_list_of_lists = [[3, 2, 5],[9, 7, 1],[4, 3, 6]]
np.array(python_list_of_lists)
array([[3, 2, 5],[9, 7, 1],[4, 3, 6]])

np.zeros()

# 先说行数,再说列数。即先y再x
np.zeros((5, 3))
array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])

np.random.random()

np.random.random((2, 4))
array([[0.88524516, 0.85641352, 0.33463107, 0.53337117],
[0.69933362, 0.09295327, 0.93616428, 0.03601592]])

np.arange()

np.arange(-3, 4)
np.arange(4)
array([-3, -2, -1, 0, 1, 2, 3])
array([0, 1, 2, 3])

这在创建图的x轴方面很好用。

# Create an array of integers from one to ten
one_to_ten = np.arange(1,11)
doubling_array = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]

# Create your scatterplot
plt.scatter(one_to_ten, doubling_array)
plt.show()

3D & 4D arrays

array_1_2D = np.array([[1, 2], [5, 7]])
array_2_2D = np.array([[8, 9], [5, 7]])
array_3_2D = np.array([[1, 2], [5, 7]])
array_3D = np.array([array_1_2D, array_2_2D, array_3_2D])

array_4D = [array_A_3D, array_B_3D, array_C_3D, array_D_3D, array_E_3D, array_F_3D, array_G_3D, array_H_3D, array_I_3D]

Create from pd

# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file).head()

# Build a numpy array from the DataFrame: data_array
data_array = data.value

# Print the datatype of data_array to the shell
print(type(data_array))

Matrix and tensor arrays

  • 一维数组中(5,)(,5) 是一样的,一位数组也叫做 向量
  • 二维数组叫 矩阵
  • 三维数组叫 张量
1655230321345.png
1655230321345.png
1655308549835.png
1655308549835.png

.shape 获取形状

array = np.zeros((3, 5))
print(array.shape)
# 表示 3行5列
(3, 5)

.flatten() 数组一维化

array = np.array([[1, 2], [5, 7], [6, 6]])
array.flatten()
array([1, 2, 5, 7, 6, 6])

.reshape() 重设形状

1655830276336.png
1655830276336.png
array = np.array([[1, 2], [5, 7], [6, 6]])
array.reshape((2, 3))
array([[1, 2, 5],
[7, 6, 6]])

向量转二维数组

通常,机器学习模型需要的 X 是二维数组。如果 X是向量,则我们需要将其转为 1列多行 的 二维数组。

X_bmi = X[:, 3]
print(y.shape, X_bmi.shape)
(752,) (752,)
X_bmi = X_bmi.reshape(-1, 1)
print(X_bmi.shape)
(752, 1)

.ndim 维度

获取array的维度

np.array type

data.dtype  #获取数据类型

Sample NumPy data types:

  • np.int64, np.int32, np.int16, np.int8
  • np.float64
  • np.float32
arr1 = np.array([1, 2, 3], dtype=np.float64)
arr2 = np.array([1, 2, 3], dtype=np.int32)

# data.astype()显式转换数据类型
float_arr1 = arr1.astype(np.float64)  
float_arr2 = arr2.astype(arr1.dtype)

注意的是,一个np.array 只有一种数据类型,不一致将发生强制转换。

Indexing and slicing

  • array[ indexArray ]
  • array[ booleanArray ]

Indexing 1D arrays

array = np.array([2, 4, 6, 8, 10])
array[3]
8

Indexing elements in 2D

# 先行后列
sudoku_game[2, 4]

# 取第0行,从0开始数
sudoku_game[0]

# 取第4列,从0开始数
sudoku_game[:, 3]

Slicing 1D arrays

1655307380820.png
array = np.array([2, 4, 6, 8, 10])
array[2:4]
array([6, 8])

Slicing 2D arrays

1655307299479.png
sudoku_game[3:6, 3:6]
array([[0, 0, 2],[0, 0, 7],[0, 8, 3]])

注意的是

  • 截取第2列 [:,2]
  • 截取第2行 [2,] 或者 [2,:]
1655307430443.png
# 设置步长
sudoku_game[3:6:2, 3:6:2]
array([[0, 2],[0, 3]])

Axis

1655307503116.png
1655307503116.png

Sort

1655307732370.png
1655307732370.png

Boolean calculate

>> bmi = array([21.852, 20.975, 21.75 , 24.747, 21.441])
>> bmi > 21

output: array([True, False, True, True, True], dtype=bool)

Filter

Boolean masks

one_to_five = np.arange(1, 6)
mask = one_to_five % 2 == 0
mask
one_to_five[mask]
array([False, True, False, True, False])
array([2, 4])
# with 2d
# Create an array which contains row data on the largest tree in tree_census
largest_tree_data = tree_census[tree_census[:, 2] == 51]
print(largest_tree_data)

# Slice largest_tree_data to get only the block ID
largest_tree_block_id = largest_tree_data[:, 1]
print(largest_tree_block_id)

# Create an array which contains row data on all trees with largest_tree_block_id
trees_on_largest_tree_block = tree_census[tree_census[:, 1] == largest_tree_block_id]
print(trees_on_largest_tree_block)
<script.py> output:
    [[    61 501882     51      0]]
    [501882]
    [[    60 501882      8      0]
     [    61 501882     51      0]
     [    62 501882      7      0]
     [    63 501882      4      0]
     [    64 501882     15      0]
     [    65 501882      3      0]
     [    66 501882      8      0]
     [    67 501882      6      0]
     [    68 501882      6      0]
     [    69 501882      3      0]]

np.where()

返回符合条件的 index

# Create an array of row_indices for trees on block 313879
row_indices = np.where(tree_census[:,1] == 313879)
print(row_indices)

# Create an array which only contains data for trees on block 313879
block_313879 = tree_census[row_indices]
print(block_313879)
<script.py> output:
    (array([921, 922]),)
    [[  1115 313879      3      0]
     [  1116 313879     17      0]]

充当替换功能

# Create and print a 1D array of tree and stump diameters
# np.where(condition, x, y) 满足条件(condition),输出x,不满足输出y, 相当于 java里的三目表达式
trunk_stump_diameters = np.where(tree_census[:,2] == 0, tree_census[:,3], tree_census[:,2])
print(trunk_stump_diameters)
<script.py> output:
    [24 20  ...... 6]

Adding and removing

Concatenating

1655833719669.png
classroom_ids_and_sizes = np.array([[1, 22], [2, 21], [3, 27], [4, 26]])
new_classrooms = np.array([[5, 30], [5, 17]])
np.concatenate((classroom_ids_and_sizes, new_classrooms), axis = 0)
array([[ 1, 22],
[ 2, 21],
[ 3, 27],
[ 4, 26],
[ 5, 30],
[ 5, 17]])
classroom_ids_and_sizes = np.array([[1, 22], [2, 21], [3, 27], [4, 26]])
grade_levels_and_teachers = np.array([[1, "James"], [1, "George"], [3,"Amy"], [3, "Meehir"]])
np.concatenate((classroom_ids_and_sizes, grade_levels_and_teachers), axis=0)
array([['1', '22', '1', 'James'], ['2', '21', '1', 'George'], ['3', '27', '3', 'Amy'], ['4', '26', '3', 'Meehir']])
1655833871407.png

Deleting with np.delete()

1655833927779.png
1655833927779.png

delete 必须指定 axis,否则将会造成结果一维化。

Summarizing data

  • np.median(series或np.array)
  • np.mean(series或np.array)
  • np.max(series或np.array)
  • np.quantile

计算95% 置信空间

# Print the 95% confidence interval
print(np.quantile(cv_results, [0.025, 0.975]))
[0.74141863 0.77191915]

还可以

  • .sum(axis = 0)
  • .min() / .max()
  • .cumsum()

keepdims = True

# Create a 2D array of total monthly sales across industries
monthly_industry_sales = monthly_sales.sum(axis=1, keepdims=True)
print(monthly_industry_sales)

# Add this column as the last column in monthly_sales
monthly_sales_with_total = np.concatenate((monthly_sales,monthly_industry_sales), axis = 1 )
print(monthly_sales_with_total)
<script.py> output:
    [[36716]
     [37133]
     ......
     [52830]]
    [[ 4134 23925  8657 36716]
     ......
     [ 6630 27797 18403 52830]]

.cumsum()

# Find cumulative monthly sales for each industry
cumulative_monthly_industry_sales = monthly_sales.cumsum(axis=0)
print(cumulative_monthly_industry_sales)

# Plot each industry's cumulative sales by month as separate lines
plt.plot(np.arange(1, 13), cumulative_monthly_industry_sales[:,0], label="Liquor Stores")
plt.plot(np.arange(1, 13), cumulative_monthly_industry_sales[:,1], label="Restaurants")
plt.plot(np.arange(1, 13), cumulative_monthly_industry_sales[:,2], label="Department stores")
plt.legend()
plt.show()
<script.py> output:
    [[  4134  23925   8657]
		 ......
     [ 59673 315105 135026]]
1656354114253.png

Calculate

向量化计算

使得每个元素都得到计算

import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
bmi = np_weight / np_height ** 2
for val in bmi :
print(val)

output:
  21.852
  20.975
  21.750
  24.747
  21.441

函数适用于元素

将Python的函数变成适用于np.array的元素的函数

array = np.array(["NumPy", "is", "awesome"])
len(array) > 2
True

变换后

vectorized_len = np.vectorize(len)
vectorized_len(array) > 2
array([ True, False, True])

Broadcastable

1656434903814.png
1656434903814.png
1656434933377.png
1656434933377.png
CleanShot 2022-06-29 at 00.34.34@2x.png
CleanShot 2022-06-29 at 00.34.34@2x.png

一维数组尽管是一维,处理时应看做时1行而不是1列。

不可广播的应当尝试调整形状

Number generators

np.linspace()

# 从0-20以平均分布取样,递增的,首位0 末位20,取样25个
print(np.linspace(0,20))
[ 0.          0.40816327  0.81632653  1.2244898   1.63265306  2.04081633
  2.44897959  2.85714286  3.26530612  3.67346939  4.08163265  4.48979592
  4.89795918  5.30612245  5.71428571  6.12244898  6.53061224  6.93877551
  7.34693878  7.75510204  8.16326531  8.57142857  8.97959184  9.3877551
  9.79591837 10.20408163 10.6122449  11.02040816 11.42857143 11.83673469
 12.24489796 12.65306122 13.06122449 13.46938776 13.87755102 14.28571429
 14.69387755 15.10204082 15.51020408 15.91836735 16.32653061 16.73469388
 17.14285714 17.55102041 17.95918367 18.36734694 18.7755102  19.18367347
 19.59183673 20.        ]
# 从0-20以平均分布取样,递增的,首位1 末位20,取样20个
print(np.linspace(1,20,20))
[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.]

np.arange()

explanatory_data = pd.DataFrame({"length_cm": np.arange(20, 41)})
	length_cm
0 20
1 21
2 22
...
20 40
...

Random generators

import numpy as np

np.random.seed(123) # Starting from a seed
np.random.rand() # Pseudo-random numbers

coin = np.random.randint(0,2) # Randomly generate 0 or 1
jitter = np.random.normal(0, 2, size=len(brfss)) # 以正态分布(mean = 0, sd = 2)产生随机数series

Read file

Much of the time you will need to import datasets which have different datatypes in different columns; one column may contain strings and another floats, for example. The function np.loadtxt() will freak at this.

There is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(), except that its default dtype is None. In this exercise, you'll practice using this to achieve the same result.

data = np.loadtxt(file, delimiter='\t', dtype=str)
data_float = np.loadtxt(file, delimiter="\t", dtype=float, skiprows=1)

data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
d = np.recfromcsv(file)

或者都使用 with open 打开

with open("logo.npy", "rb") as f:
	logo_rgb_array = np.load(f)

Save arrays in many formats:

  • .csv
  • .txt
  • .pkl
  • .npy

RGB arrays

Loading .npy files

1656438801552.png
with open("logo.npy", "rb") as f:
  logo_rgb_array = np.load(f)
  plt.imshow(logo_rgb_array)
  plt.show()

Examining RGB data

  • RGB图片是一个三维数组
    • x = 图片长度
    • y = 图片宽度
    • z = 3, 存放RGB三个值
  • RGB越大颜色越大
    • [255,255,255] 表示纯白
    • [0,0,0] 表示纯无/纯黑/纯透明
1656438826776.png1656438889305.png
red_array[1], green_array[1], blue_array[1]
(array([255, 255, 255, ..., 255, 255, 255]),
array([255, 255, 255, ..., 255, 255, 255]),
array([255, 255, 255, ..., 255, 255, 255]))

Updating RGB data

案例1

dark_logo_array = np.where(logo_rgb_array == 255, 50, logo_rgb_array)
plt.imshow(dark_logo_array)
plt.show()
1656439001104.png

案例2

# Reduce every value in rgb_array by 50 percent
darker_rgb_array = rgb_array * 0.5

# Convert darker_rgb_array into an array of integers
darker_rgb_int_array = darker_rgb_array.astype(int)
plt.imshow(darker_rgb_int_array)
plt.show()
1656440802833.png

Saving arrays as .npy files

with open("dark_logo.npy", "wb") as f:
	np.save(f, dark_logo_array)

Array acrobatics

Flipping an array

  • 可指定axis,表示在哪一个维度上进行对称交换
  • 不指定axis 则所有轴都转换,相当于中心对称
flipped_logo = np.flip(logo_rgb_array)
plt.imshow(flipped_logo)
plt.show()

1656447576765.png1656447584981.png

flipped_rows_logo = np.flip(logo_rgb_array, axis=0)
plt.imshow(flipped_rows_logo)
plt.show()
1656447720395.png
flipped_colors_logo = np.flip(logo_rgb_array, axis=2)
plt.imshow(flipped_colors_logo)
plt.show()
1656448006755.png
flipped_except_colors_logo = np.flip(logo_rgb_array, axis=(0, 1))
plt.imshow(flipped_except_colors_logo)
plt.show()
1656448043029.png

Transposing an array

1656448151676.png
1656448151676.png
transposed_logo = np.transpose(logo_rgb_array, axes=(1, 0, 2))
plt.imshow(transposed_logo)
plt.show()
1656448184853.png

help()

1656439060398.png
1656439060398.png
# Display the documentation for .astype()
help(np.ndarray.astype)

Stacking and splitting

理解多维数组应当以建筑中的横梁来看看待数据。

Slicing dimensions

请见 [Examining RGB data](#Examining RGB data)

Splitting arrays

# 拆包
red_array, green_array, blue_array = np.split(rgb, 3, axis=2)
red_array
red_array.shape
array([ [[255], [255], [255]],  [[255], [ 0], [ 0]],  [[255], [ 0], [ 0]]])
(3, 3, 1)

当我们分割一个数组时,产生的数组的维数与原数组相同,因为含义由z条状数据变成y条状数据,因此需要 Trailing dimensions。

red_array_2D = red_array.reshape((3, 3))
red_array_2D
red_array_2D.shape
array([[255, 255, 255], [255, 0, 0], [ 0, 0, 0]])
(3, 3)

Stacking arrays

1656448948070.png
1656448948070.png
red_array = np.zeros((1001, 1001)).astype(np.int32)
green_array = green_array.reshape((1001, 1001))
blue_array = blue_array.reshape((1001, 1001))

stacked_rgb = np.stack([red_array, green_array, blue_array], axis=2)
plt.imshow(stacked_rgb)
plt.show()
1656449200999.png
1656449200999.png

::: detail 查看另一个案例

  • 上方的案例,每一个z轴是一个数据组,因此axis = 2
  • 下面的案例,每一个y轴是一个数据组,因此axis = 1
  • 记住怎么拆的就怎么合

monthly_sales的第一个维度是三个行业的单月销售额的行,第二个维度是单个行业的月度销售数据的列。你的任务是将这些数据分割成季度销售数据,并将季度销售数据堆叠起来,使新的第三维代表季度销售的四个二维数组。

# Split monthly_sales into quarterly data
q1_sales, q2_sales, q3_sales, q4_sales = np.split(monthly_sales, 4)

# Print q1_sales
print(q1_sales)

# Stack the four quarterly sales arrays
quarterly_sales = np.stack([q1_sales, q2_sales, q3_sales, q4_sales], axis = 0)
print(quarterly_sales)

:::

::: detail 查看莫奈的画 蓝色加深案例

也许你想更好地理解莫奈对蓝色的使用。你的任务是创建一个莫奈rgb_array的版本,通过使它们变得更蓝来强调画中使用大量蓝色的部分你将在这个练习中执行这个任务的分割部分,在下一个练习中执行堆叠部分。

# Split rgb_array into red, green, and blue arrays
red_array, green_array, blue_array = np.split(rgb_array, 3, axis=2)

# Create emphasized_blue_array
emphasized_blue_array = np.where(blue_array > blue_array.mean(), 255, blue_array)

# Print the shape of emphasized_blue_array
print(emphasized_blue_array.shape)

# Remove the trailing dimension from emphasized_blue_array
emphasized_blue_array_2D = emphasized_blue_array.reshape(675,844)

# Print the shapes of blue_array and emphasized_blue_array_2D
print(blue_array.shape, emphasized_blue_array_2D.shape)

# Reshape red_array and green_array
red_array_2D = red_array.reshape((675, 844))
green_array_2D = green_array.reshape((675, 844))

# Stack red_array_2D, green_array_2D, and emphasized_blue_array_2D
emphasized_blue_monet = np.stack([red_array_2D, green_array_2D, emphasized_blue_array_2D], axis = 2)
plt.imshow(emphasized_blue_monet)
plt.show()
1656451550115.png

:::

1656451664803.png