Skip to content

Commit 1affc70

Browse files
authored
Merge pull request #1 from ACGpp/main
更新教程
2 parents 9238fe2 + 2e34b32 commit 1affc70

34 files changed

+4931
-372
lines changed

.github/workflows/deploy.yml

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
name: Deploy to GitHub Pages
2+
3+
on:
4+
push:
5+
branches: [main]
6+
workflow_dispatch:
7+
8+
permissions:
9+
contents: read
10+
pages: write
11+
id-token: write
12+
13+
concurrency:
14+
group: "pages"
15+
cancel-in-progress: false
16+
17+
jobs:
18+
build:
19+
runs-on: ubuntu-latest
20+
steps:
21+
- name: Checkout
22+
uses: actions/checkout@v4
23+
with:
24+
fetch-depth: 0
25+
26+
- name: Setup Node
27+
uses: actions/setup-node@v4
28+
with:
29+
node-version: '18'
30+
cache: npm
31+
cache-dependency-path: './website/package-lock.json'
32+
33+
- name: Setup Pages
34+
uses: actions/configure-pages@v4
35+
36+
- name: Install dependencies
37+
run: |
38+
cd website
39+
npm ci
40+
npm list vitepress
41+
42+
- name: Build
43+
run: |
44+
cd website
45+
npm run docs:build
46+
ls -la docs/.vitepress/dist
47+
48+
- name: Upload artifact
49+
uses: actions/upload-pages-artifact@v3
50+
with:
51+
path: website/docs/.vitepress/dist
52+
53+
deploy:
54+
environment:
55+
name: github-pages
56+
url: ${{ steps.deployment.outputs.page_url }}
57+
needs: build
58+
runs-on: ubuntu-latest
59+
name: Deploy
60+
steps:
61+
- name: Deploy to GitHub Pages
62+
id: deployment
63+
uses: actions/deploy-pages@v4

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,4 +145,6 @@ simple-ml-code/
145145

146146
## 致谢
147147

148-
感谢所有为这个教程做出贡献的开发者们!
148+
感谢https://github.com/datawhalechina/machine-learning-toy-code 教程的创作者们
149+
150+
感谢https://github.com/datawhalechina/pumpkin-book 南瓜书作者Sm1les
Binary file not shown.

datasets/MNIST/raw/load_data.py

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
#!/usr/bin/env python
2+
# coding=utf-8
3+
'''
4+
Author: JiangJi
5+
6+
Date: 2023-01-30 09:31:34
7+
LastEditor: JiangJi
8+
LastEditTime: 2023-01-30 09:31:35
9+
Discription:
10+
'''
11+
'''
12+
此脚本提供两种方法,一种为load_local_mnist,即将本地的.gz文件解码为数据,
13+
一种是利用keras在线下载mnist
14+
'''
15+
import numpy as np
16+
from struct import unpack
17+
import gzip
18+
import os
19+
20+
21+
def __read_image(path):
22+
with gzip.open(path, 'rb') as f:
23+
magic, num, rows, cols = unpack('>4I', f.read(16))
24+
img = np.frombuffer(f.read(), dtype=np.uint8).reshape(num, 28*28)
25+
return img
26+
27+
28+
def __read_label(path):
29+
with gzip.open(path, 'rb') as f:
30+
magic, num = unpack('>2I', f.read(8))
31+
lab = np.frombuffer(f.read(), dtype=np.uint8)
32+
# print(lab[1])
33+
return lab
34+
35+
36+
def __normalize_image(image):
37+
'''__normalize_image 将image的像素值(0-255)归一化
38+
Args:
39+
image ([type]): [description]
40+
Returns:
41+
[type]: [description]
42+
'''
43+
img = image.astype(np.float32) / 255.0
44+
return img
45+
46+
47+
def __one_hot_label(label):
48+
'''__one_hot_label 将label进行one-hot编码
49+
Args:
50+
label ([type]): 输入为0-9,表示数字标签
51+
Returns:
52+
[type]: 输出为二进制编码,比如[0,0,1,0,0,0,0,0,0,0]表示数字2
53+
'''
54+
lab = np.zeros((label.size, 10))
55+
for i, row in enumerate(lab):
56+
row[label[i]] = 1
57+
return lab
58+
59+
60+
def load_local_mnist(x_train_path=os.path.dirname(__file__)+'/train-images-idx3-ubyte.gz', y_train_path=os.path.dirname(__file__)+'/train-labels-idx1-ubyte.gz', x_test_path=os.path.dirname(__file__)+'/t10k-images-idx3-ubyte.gz', y_test_path=os.path.dirname(__file__)+'/t10k-labels-idx1-ubyte.gz', normalize=True, one_hot=True):
61+
'''load_mnist 读取.gz格式的MNIST数据集
62+
Args:
63+
x_train_path ([type]): [description]
64+
y_train_path ([type]): [description]
65+
x_test_path ([type]): [description]
66+
y_test_path ([type]): [description]
67+
normalize (bool, optional): [description]. Defaults to True.
68+
one_hot (bool, optional): one_hot为True的情况下,标签作为one-hot数组返回
69+
one-hot数组是指[0,0,1,0,0,0,0,0,0,0]这样的数组
70+
Returns:
71+
[type]: (训练图像, 训练标签), (测试图像, 测试标签)
72+
训练集数量为60000,每行包含维度为784=28*28的向量
73+
'''
74+
image = {
75+
'train': __read_image(x_train_path),
76+
'test': __read_image(x_test_path)
77+
}
78+
79+
label = {
80+
'train': __read_label(y_train_path),
81+
'test': __read_label(y_test_path)
82+
}
83+
84+
if normalize:
85+
for key in ('train', 'test'):
86+
image[key] = __normalize_image(image[key])
87+
88+
if one_hot:
89+
for key in ('train', 'test'):
90+
label[key] = __one_hot_label(label[key])
91+
92+
return (image['train'], label['train']), (image['test'], label['test'])
93+
94+
95+
96+
def load_online_data(): # categorical_crossentropy
97+
from keras.datasets import mnist
98+
from keras.utils import np_utils
99+
import numpy as np
100+
(x_train, y_train), (x_test, y_test) = mnist.load_data()
101+
number = 10000
102+
x_train, y_train = x_train[0:number], y_train[0:number]
103+
x_train = x_train.reshape(number, 28 * 28)
104+
x_test = x_test.reshape(x_test.shape[0], 28 * 28)
105+
x_train = x_train.astype('float32')
106+
x_test = x_test.astype('float32')
107+
108+
# convert class vectors to binary class matrices
109+
y_train = np_utils.to_categorical(y_train, 10)
110+
y_test = np_utils.to_categorical(y_test, 10)
111+
x_test = np.random.normal(x_test) # 加噪声
112+
113+
x_train, x_test = x_train / 255, x_test / 255
114+
115+
return (x_train, y_train), (x_test, y_test)
116+
117+
118+
if __name__ == "__main__":
119+
120+
(x_train, y_train), (x_test, y_test) = load_local_mnist()
7.48 MB
Binary file not shown.
1.57 MB
Binary file not shown.
9.77 KB
Binary file not shown.
4.44 KB
Binary file not shown.
44.9 MB
Binary file not shown.
9.45 MB
Binary file not shown.
58.6 KB
Binary file not shown.
28.2 KB
Binary file not shown.

datasets/README.ipynb

Lines changed: 163 additions & 0 deletions
Large diffs are not rendered by default.

docs/chapter2/逻辑回归.md

Lines changed: 43 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# 逻辑回归
22
## 代码块
33
import numpy as np
4-
from sklearn.datasets import fetch_openml
54
from sklearn.linear_model import LogisticRegression
5+
import os
66
## 逐行解释
77
import numpy as np
88
这行代码导入了numpy库,别名为np。numpy是一个强大的数学计算库,就像是我们的计算器,帮助我们进行各种数值运算。
@@ -20,32 +20,53 @@
2020
、函数、方法等,声明了这些东西和它们的用法后,拿来做什么事情,就像是我在文章中交代时间地点人物,然后用它们告诉你,这里发生了什么事情,所以,不要畏惧语言本身,尝试着去将它与你自身经历结合在一起,你会发现广阔天地。
2121

2222
## 加载数据集
23-
mnist=fetch_openml('mnist_784')
24-
X,y=mnist['data'],mnist['target']
25-
X_train=np.array(X[:60000],dtype=float)
26-
y_train=np.array(y[:60000],dtype=float)
27-
X_test=np.array(X[60000:],dtype=float)
28-
y_test=np.array(y[60000:],dtype=float)
23+
# 从本地加载MNIST数据集
24+
def load_mnist_data():
25+
from datasets.MNIST.raw.load_data import load_local_mnist
26+
base_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), 'datasets', 'MNIST', 'raw')
27+
(X_train, y_train), (X_test, y_test) = load_local_mnist(
28+
x_train_path=os.path.join(base_path, 'train-images-idx3-ubyte.gz'),
29+
y_train_path=os.path.join(base_path, 'train-labels-idx1-ubyte.gz'),
30+
x_test_path=os.path.join(base_path, 't10k-images-idx3-ubyte.gz'),
31+
y_test_path=os.path.join(base_path, 't10k-labels-idx1-ubyte.gz'),
32+
normalize=True,
33+
one_hot=False
34+
)
35+
return X_train, y_train, X_test, y_test
36+
37+
# 加载数据
38+
X_train, y_train, X_test, y_test = load_mnist_data()
2939
## 逐行解释
30-
mnist=fetch_openml('mnist_784')
31-
这行代码通过fetch_openml函数加载了名为mnist_784的数据集,该数据集是一个包含手写数字(0-9)的图像数据集,784表示每个图像有784个像素值(28*28像素)。
32-
mnist是加载后的数据集对象,即把数据集赋值给mnist,此后这个mnist就代表这个数据集,就像是我们说小明很擅长数学,那一提到小明我们就说他数学很好。
40+
def load_mnist_data():
41+
这行代码定义了一个名为load_mnist_data的函数,用于加载本地MNIST数据集。函数的作用就像是一个专门的工具人,我们告诉它数据在哪里,它就帮我们把数据取出来。
3342

34-
X,y=mnist['data'],mnist['target']
35-
这行代码将mnist数据集中的数据分成特征X和标签y,mnist['data']包含了图像数据(每个图像有784个像素),mnist['target']包含了这些图像对应的标签(即它们表示的数字,0-9)。
43+
from datasets.MNIST.raw.load_data import load_local_mnist
44+
base_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), 'datasets', 'MNIST', 'raw')
45+
这两行代码首先导入了我们自定义的load_local_mnist函数,然后设置了数据集的路径。base_path就像是一个地图,告诉程序去哪里找数据文件。os.path.join函数就像是在帮我们连接路径的各个部分,确保在不同的操作系统上都能正确找到文件。
3646

37-
让我们用生活中的例子来理解特征和标签:
38-
特征是事物的特点,就像人的身高、体重、年龄等。在图像识别中,特征就是图像的像素值。
39-
标签则是我们要预测的目标,比如根据一个人的特征判断他是学生还是老师,这里的"学生"和"老师"就是标签。
47+
(X_train, y_train), (X_test, y_test) = load_local_mnist(
48+
x_train_path=os.path.join(base_path, 'train-images-idx3-ubyte.gz'),
49+
y_train_path=os.path.join(base_path, 'train-labels-idx1-ubyte.gz'),
50+
x_test_path=os.path.join(base_path, 't10k-images-idx3-ubyte.gz'),
51+
y_test_path=os.path.join(base_path, 't10k-labels-idx1-ubyte.gz'),
52+
normalize=True,
53+
one_hot=False
54+
)
55+
这段代码调用load_local_mnist函数来加载数据。它需要四个文件路径参数:
56+
- train-images-idx3-ubyte.gz:训练图像数据
57+
- train-labels-idx1-ubyte.gz:训练标签数据
58+
- t10k-images-idx3-ubyte.gz:测试图像数据
59+
- t10k-labels-idx1-ubyte.gz:测试标签数据
4060

41-
X_train=np.array(X[:60000],dtype=float)
42-
y_train=np.array(y[:60000],dtype=float)
43-
我们在这里使用mnist_784的前60000个样本用于训练,第一行代码表示将X(特征)赋值给X_train,np.array(...,dtype=float)表示把数据转换成Numpy数组,并且指定数据类型为float(浮点型),转换成Numpy数组是为了更好地训练,数据类型为float是因为机器学习算法要求输入数据是浮点数类型,那第二行同学们就能自己推测出来了吧,将y中的前60000个样本分别赋值给y_train,并把数据转成Numpy数组且数据类型为浮点数float。
61+
normalize=True表示我们要对图像数据进行归一化处理,将像素值从0-255变成0-1之间的小数,这样可以让模型训练更稳定。
62+
one_hot=False表示我们不使用独热编码来表示标签,而是直接使用0-9的数字标签。
4463

45-
X_test=np.array(X[60000:],dtype=float)
46-
y_test=np.array(y[60000:],dtype=float)
47-
经过了上一段代码的学习,同学们是否能推测出这一段代码的意思呢?
48-
将从第60001个开始的样本,用作测试数据,分别赋值给X_test和y_test,这就是我们用来测试模型性能的测试集啦,并且数据类型为浮点型,且数据转换为Numpy数组。
64+
X_train, y_train, X_test, y_test = load_mnist_data()
65+
这行代码调用我们定义的函数来获取数据。数据集被分成了训练集(X_train, y_train)和测试集(X_test, y_test):
66+
- X_train:训练图像数据,包含60000张图片
67+
- y_train:训练图片对应的标签
68+
- X_test:测试图像数据,包含10000张图片
69+
- y_test:测试图片对应的标签
4970

5071
print(X_train.shape)
5172
print(y_train.shape)

0 commit comments

Comments
 (0)