以pytorch为例,梳理一下深度学习中,数据的读取,神经网络的搭建,NMS,以及各个指标的计算流程。

main 函数,程序入口,以及代码配置

通常main函数中,通过实现argparse功能包,从函数的外部接受参数的传入,对数据,网络等进行一些基本的配置。argparse的使用方法:https://docs.python.org/zh-cn/3/library/argparse.html

main函数中一些常用的配置项:

  • 数据集的格式:coco,csv,pascal voc等等
  • 数据的路径,包括训练集,测试集的路径等等
  • 网络的一些细节配置,如深度,backbone 类型
  • 一些功能的开关设置,如数据的增强等
  • 训练过程中,一些变量的设置,比如epoch的设置,batch_size的设置等等

数据读取部分

数据读取部分的操作包括数据集文件的读取,对图片进行数据的增强,继承dataloader实现数据的批量读取。

数据文件的读取

这部分读取任务主要包括读取annotation文件,以及class_id文件,这里以csv格式的数据集文件为例。

首先实现一个CSVDataset类,继承至torch.utils.data.Dataset类。该类必须实现__len__,__getitem__两个方法。

在CSVDataset方法的__init__中,进行数据集文件的读取,最终将得到:

  • self.classes
  • self.image_names : list 包含所有的数据集图片路径
  • self.image_data: dict[image_name] = [ {x1,y1,x2,y2,class_name},…]

__getitem__函数中需要实现的方法有根据下标来得到image,以及其对应的标注。最终返回的格式为:

sample = {'img': img, 'annot': annot}。在返回之前,如果有数据增强部分,还需要进行数据的增强。

数据增强

数据增强的方法有很多种,常用的图片的翻转,切割,resize,归一化等等。数据增强利用一张图片,得到它的许多副本,有效的增大数据集。数据增强能够起效果的一个本质因素在于,卷积操作对位移,视角,图片大小,光照等因素具有不变性。数据增强有线下增强和线上增强两种方式,后一种方式在dataloader提取数据的时候,才对数据进行增强。

数据增强的方法通常可以写成一个类,通过pytorch中的transforms.Compose([Augumenter(),Resizer()]) 来对所有的增强方法进行整合。

Normalizer

实现一个Normalizer类,覆盖其中的__call__方法,对每张图片做一个正则化。

1
2
3
4
5
6
7
8
9
10
11
class Normalizer(object):

def __init__(self):
self.mean = np.array([[[0.485, 0.456, 0.406]]])
self.std = np.array([[[0.229, 0.224, 0.225]]])

def __call__(self, sample):

image, annots = sample['img'], sample['annot']

return {'img':((image.astype(np.float32)-self.mean)/self.std), 'annot': annots}

argument

实现对图片的翻转,需要注意对标注也要进行处理。

Resizer

该方法意图将图片的大小限制在一定范围以内。因此在缩放的时候,需要找到最大的缩放比例,同时保证图片能够被32整除。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class Resizer(object):
"""Convert ndarrays in sample to Tensors."""

def __call__(self, sample, min_side=608, max_side=1024): #将图片resize到608,1024以下的大小
image, annots = sample['img'], sample['annot'] # 不能超过这个尺寸(有一边等于这个尺寸)

rows, cols, cns = image.shape

smallest_side = min(rows, cols)

# rescale the image so the smallest side is min_side
scale = min_side / smallest_side

# check if the largest side is now greater than max_side, which can happen
# when images have a large aspect ratio
largest_side = max(rows, cols)

if largest_side * scale > max_side:
scale = max_side / largest_side

# resize the image with the computed scale
image = skimage.transform.resize(image, (int(round(rows*scale)), int(round((cols*scale)))))
rows, cols, cns = image.shape

pad_w = 32 - rows%32
pad_h = 32 - cols%32

new_image = np.zeros((rows + pad_w, cols + pad_h, cns)).astype(np.float32)
new_image[:rows, :cols, :] = image.astype(np.float32) # 两个边长需要保证被32整除,少掉的的那部分使用0来补全

annots[:, :4] *= scale

return {'img': torch.from_numpy(new_image), 'annot': torch.from_numpy(annots), 'scale': scale}

数据调用 dataloader

pytorch通过实现dataloader方法来实现网络训练时,每次iteration的数据的输出。dataloader的逻辑是,每次从dataset中调用__getitem__()获取单个数据,然后组合成batch,在使用collate_fn参数对batch进行一些操作。

torch.utils.data.Dataloader中的参数

dataset(Dataset) – dataset from which to load the data.

batch_size(int, optional) – how many samples per batch to load (default: 1).

shuffle(bool, optional) – set to Trueto have the data reshuffled at every epoch (default: False).

sampler(Sampler, optional) – defines the strategy to draw samples from the dataset. If specified, shufflemust be False.

batch_sampler(Sampler, optional) – like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.

num_workers(int, optional) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)

collate_fn(callable, optional) – merges a list of samples to form a mini-batch.

pin_memory(bool, optional) – If True, the data loader will copy tensors into CUDA pinned memory before returning them.

drop_last(bool, optional) – set to Trueto drop the last incomplete batch, if the dataset size is not divisible by the batch size. If Falseand the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

timeout(numeric, optional) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0)

worker_init_fn(callable, optional) – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)

算法中使用如下参数:

1
dataloader_train = DataLoader(dataset_train, num_workers=3, collate_fn=collater, batch_sampler=sampler)

其中dataset_trainDataset类的对象,如上实现数据问价读取的部分。num_workers设置了这个类的线程数。batch_sampler 设置了每次从数据集中返回一个batch的sample的策略。collate_fn 将一系列的样本融合成一个小的mini-batch。

首先是batch_sampler:

继承至采样器类,需要实现其中的__len__方法,__iter__方法。该参数的作用是将数据集做成许多group组成的一个list。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class AspectRatioBasedSampler(Sampler):
def __init__(self, data_source, batch_size, drop_last):
self.data_source = data_source
self.batch_size = batch_size
self.drop_last = drop_last
self.groups = self.group_images()

def __iter__(self):
random.shuffle(self.groups)
for group in self.groups:
yield group

def __len__(self):
if self.drop_last:
return len(self.data_source) // self.batch_size
else:
return (len(self.data_source) + self.batch_size - 1) // self.batch_size

def group_images(self):
# determine the order of the images
order = list(range(len(self.data_source)))
order.sort(key=lambda x: self.data_source.image_aspect_ratio(x))

# divide into groups, one group = one batch
return [[order[x % len(order)] for x in range(i, i + self.batch_size)] for i in range(0, len(order), self.batch_size)]

如上,这个方法将数据分别存入group中,然后组成一个groups的list。通过一个__iter__()方法,迭代的方式将数据输出。每次输出一个batch大小的数据。

collate_fn参数:

该参数接受来自batch_sampler的数据,对数据进行进一步的处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def collater(data):
imgs = [s['img'] for s in data]
annots = [s['annot'] for s in data]
scales = [s['scale'] for s in data]
widths = [int(s.shape[0]) for s in imgs]
heights = [int(s.shape[1]) for s in imgs]
batch_size = len(imgs)
max_width = np.array(widths).max()
max_height = np.array(heights).max()
padded_imgs = torch.zeros(batch_size, max_width, max_height, 3)
for i in range(batch_size):
img = imgs[i]
padded_imgs[i, :int(img.shape[0]), :int(img.shape[1]), :] = img
max_num_annots = max(annot.shape[0] for annot in annots)
if max_num_annots > 0:
annot_padded = torch.ones((len(annots), max_num_annots, 5)) * -1
if max_num_annots > 0:
for idx, annot in enumerate(annots):
#print(annot.shape)
if annot.shape[0] > 0:
annot_padded[idx, :annot.shape[0], :] = annot
else:
annot_padded = torch.ones((len(annots), 1, 5)) * -1
padded_imgs = padded_imgs.permute(0, 3, 1, 2)
return {'img': padded_imgs, 'annot': annot_padded, 'scale': scales}

上面的操作,将同一个batch中的图片的大小统一同样的大小。annotation的维度也统一到同样大小的维度。然后进行RGB通道的变换之后,放回一个dict。

上面这些步骤就完成了数据的loader,通过for循环从其中取得元素。

retinanet网络结构

下面从数据流动的角度分析一下retinanet的各个结构的组成。

retinanet的特征提取部分,使用的是resnet,resnet有多种深度的选择,分别有18,34,50,101,152五种深度。常用的网络深度为50,101:

1
2
3
4
5
6
7
8
9
def resnet50(num_classes, pretrained=False, **kwargs):
"""Constructs a ResNet-50 model.
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
"""
model = ResNet(num_classes, Bottleneck, [3, 4, 6, 3], **kwargs)
if pretrained:
model.load_state_dict(model_zoo.load_url(model_urls['resnet50'], model_dir='.'), strict=False)
return model

让我们一行一行来看,第一个调用了ResNet()类,创建了一个ResNet对象。ResNet继承至nn.Module,需要实现函数__init__以及forward()两个方法,通常将可学习的参数放到构造函数__init__()中,在forward中实现网络数据的流动,即可实现网络的自动求导机制。

ResNet

resnet首次提出残差的思想,传统的卷积网络或者全连接网络在信息传递的时候或多或少会存在信息丢失,损耗等问题,同时还有导致梯度消失或者梯度爆炸,导致很深的网络无法训练。ResNet通过学习残差的方式,在一定程度上解决了网络退化和梯度消失的问题。ResNet通过大量叠加残差块的方式,加深网络的深度的同时,保证了网络的梯度不消失。ResNet有着两种不同的残差单元。分别是basicBlock 和 bottleneck结构。深层次网络使用bottleneck结构,每次经过残差结构之前都对数据进行一次降维,大大降低了网络的参数量。

bottleneck的结构feature经过第一个1x1的卷积层,将特征的维度压缩,对压缩后的特征进行3x3的卷积,然后经过1x1卷积层,将特征的维度放大到原来的大小。

bottleneck的代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
class Bottleneck(nn.Module):
expansion = 4
def __init__(self, inplanes, planes, stride=1, downsample=None):
super(Bottleneck, self).__init__()
self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)
self.bn1 = nn.BatchNorm2d(planes)
self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,
padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(planes)
self.conv3 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
self.bn3 = nn.BatchNorm2d(planes * 4)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
self.stride = stride

def forward(self, x):
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)

out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)

out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = self.relu(out)
return out

pytorch中常用的搭建网络的函数如下:

Conv2d卷积:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import torch.nn as nn
nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)
参数:
in_channels(int) – 输入信号的通道
out_channels(int) – 卷积产生的通道
kerner_size(int or tuple) - 卷积核的尺寸
stride(int or tuple, optional) - 卷积步长
padding(int or tuple, optional) - 输入的每一条边补充0的层数
dilation(int or tuple, optional) – 卷积核元素之间的间距
groups(int, optional) – 从输入通道到输出通道的阻塞连接数
bias(bool, optional) - 如果bias=True,添加偏置
输入:
input: (N,C_in,H_in,W_in)
输出:
output: (N,C_out,H_out,W_out)
计算公式:Fout = (Fin + 2*padding-kernel)/stride + 1

batchNorm2d:

在训练时,该层计算每次输入的均值与方差,并进行移动平均。移动平均默认的动量值为0.1。

在验证时,训练求得的均值/方差将用于标准化验证数据。

1
2
3
4
5
6
7
8
BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True)
参数:
num_features: 来自期望输入的特征数,该期望输入的大小为'batch_size x num_features x height x width'
eps: 为保证数值稳定性(分母不能趋近或取0),给分母加上的值。默认为1e-5
momentum: 动态均值和动态方差所使用的动量。默认为0.1
affine: 一个布尔值,当设为true,给该层添加可学习的仿射变换参数。
输入:(N, C,H, W) - 输出:(N, C, H, W)
值得至于的是,参数num_feature写channel数即可。

ReLU:修正线性单元函数

1
2
3
4
nn.ReLU(inplace=False)
参数:
inplace:表示是否进行覆盖计算,节省内存
不会引起数据维度的变化

MaxPool2d 层

1
2
3
4
5
6
7
8
9
10
11
nn.MaxPool2d(kernel_size, stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=False)
参数:
kernel_size(int or tuple) - max pooling的窗口大小
stride(int or tuple, optional) - max pooling的窗口移动的步长。默认值是kernel_size
padding(int or tuple, optional) - 输入的每一条边补充0的层数
dilation(int or tuple, optional) – 一个控制窗口中元素步幅的参数
return_indices - 如果等于True,会返回输出最大值的序号,对于上采样操作会有帮助
ceil_mode - 如果等于True,计算输出信号大小的时候,会使用向上取整,代替默认的向下取整的操作
输入: (N,C,H_{in},W_in)
输出: (N,C,H_out,W_out)
计算公式:Fout = (Fin + 2*padding - kernel)/stride + 1

nn.Upsample 上采样操作对channel进行采样:

1
2
nn.Upsample(size=None, scale_factor=None, mode='nearest', align_corners=None)
给定上采样策略mode,上采样的大小:scale_factor

nn.Sequential一个有序的容器,神经网络模块将按照在传入构造器的顺序依次被添加到计算图中执行,同时以神经网络模块为元素的有序字典也可以作为传入参数。

1
2
3
4
5
downsample = nn.Sequential(
nn.Conv2d(self.inplanes, planes * block.expansion,
kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(planes * block.expansion),
)

网络结构类继承至nn.Module,需要实现函数__init__以及forward()两个方法,通常在init中完成网络层的初始化工作,定义各类的网络层。在forward中完成网络层数据的流动。

retinanet金字塔模型的代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class PyramidFeatures(nn.Module):
def __init__(self, C3_size, C4_size, C5_size, feature_size=256):
super(PyramidFeatures, self).__init__()
# upsample C5 to get P5 from the FPN paper
self.P5_1 = nn.Conv2d(C5_size, feature_size, kernel_size=1, stride=1, padding=0)
self.P5_upsampled = nn.Upsample(scale_factor=2, mode='nearest')
self.P5_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=1, padding=1)

# add P5 elementwise to C4
self.P4_1 = nn.Conv2d(C4_size, feature_size, kernel_size=1, stride=1, padding=0)
self.P4_upsampled = nn.Upsample(scale_factor=2, mode='nearest')
self.P4_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=1, padding=1)

# add P4 elementwise to C3
self.P3_1 = nn.Conv2d(C3_size, feature_size, kernel_size=1, stride=1, padding=0)
self.P3_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=1, padding=1)

# "P6 is obtained via a 3x3 stride-2 conv on C5"
self.P6 = nn.Conv2d(C5_size, feature_size, kernel_size=3, stride=2, padding=1)

# "P7 is computed by applying ReLU followed by a 3x3 stride-2 conv on P6"
self.P7_1 = nn.ReLU()
self.P7_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=2, padding=1)

def forward(self, inputs):

C3, C4, C5 = inputs

P5_x = self.P5_1(C5)
P5_upsampled_x = self.P5_upsampled(P5_x)
P5_x = self.P5_2(P5_x)

P4_x = self.P4_1(C4)
P4_x = P5_upsampled_x + P4_x
P4_upsampled_x = self.P4_upsampled(P4_x)
P4_x = self.P4_2(P4_x)

P3_x = self.P3_1(C3)
P3_x = P3_x + P4_upsampled_x
P3_x = self.P3_2(P3_x)

P6_x = self.P6(C5)

P7_x = self.P7_1(P6_x)
P7_x = self.P7_2(P7_x)

return [P3_x, P4_x, P5_x, P6_x, P7_x]

retinanet在金字塔之后,接了一个回归网络以及分类网络,分别对边框位置以及类别进行分类。

回归网络简单的接了五个卷积层,保持feature的大小不变,每一个channel的维度最终降为num_anchors x 4,即每一个channel需要回归出num_anchors x 4 个坐标点。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class RegressionModel(nn.Module):
def __init__(self, num_features_in, num_anchors=9, feature_size=256):
super(RegressionModel, self).__init__()

self.conv1 = nn.Conv2d(num_features_in, feature_size, kernel_size=3, padding=1)
self.act1 = nn.ReLU()

self.conv2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, padding=1)
self.act2 = nn.ReLU()

self.conv3 = nn.Conv2d(feature_size, feature_size, kernel_size=3, padding=1)
self.act3 = nn.ReLU()

self.conv4 = nn.Conv2d(feature_size, feature_size, kernel_size=3, padding=1)
self.act4 = nn.ReLU()

self.output = nn.Conv2d(feature_size, num_anchors*4, kernel_size=3, padding=1)

def forward(self, x):

out = self.conv1(x)
out = self.act1(out)

out = self.conv2(out)
out = self.act2(out)

out = self.conv3(out)
out = self.act3(out)

out = self.conv4(out)
out = self.act4(out)

out = self.output(out)

# out is B x C x W x H, with C = 4*num_anchors
out = out.permute(0, 2, 3, 1)

return out.contiguous().view(out.shape[0], -1, 4)

上诉最后一行值得注意一下view()函数相当于numpy中的reshape函数,但是要求数据必须在内存中是连续存储的。由于permute函数,改变了数据的分布(浅拷贝)。因此在使用view之前,需要执行contiguous函数使得数据内存连续分布。最终out的shape为[batch_size,w x h ,4]。上诉得到的out最终输入criterion中,计算loss。

分类模型的网络结构和回归模型的结构相同,唯一不同的地方在于最终输出的channel的大小。分类模型输出的channel大小为anchor的数量乘以类别(num_anchor x num_classes)。即每一个框都要预测一个类别信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class ClassificationModel(nn.Module):
def __init__(self, num_features_in, num_anchors=9, num_classes=80, prior=0.01, feature_size=256):
super(ClassificationModel, self).__init__()

self.num_classes = num_classes
self.num_anchors = num_anchors

self.conv1 = nn.Conv2d(num_features_in, feature_size, kernel_size=3, padding=1)
self.act1 = nn.ReLU()

self.conv2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, padding=1)
self.act2 = nn.ReLU()

self.conv3 = nn.Conv2d(feature_size, feature_size, kernel_size=3, padding=1)
self.act3 = nn.ReLU()

self.conv4 = nn.Conv2d(feature_size, feature_size, kernel_size=3, padding=1)
self.act4 = nn.ReLU()

self.output = nn.Conv2d(feature_size, num_anchors*num_classes, kernel_size=3, padding=1)
self.output_act = nn.Sigmoid()

def forward(self, x):

out = self.conv1(x)
out = self.act1(out)

out = self.conv2(out)
out = self.act2(out)

out = self.conv3(out)
out = self.act3(out)

out = self.conv4(out)
out = self.act4(out)

out = self.output(out)
out = self.output_act(out)

# out is B x C x W x H, with C = n_classes + n_anchors
out1 = out.permute(0, 2, 3, 1)

batch_size, width, height, channels = out1.shape

out2 = out1.view(batch_size, width, height, self.num_anchors, self.num_classes)

return out2.contiguous().view(x.shape[0], -1, self.num_classes)

最后一行首先将out的维度控制在anchor x num_classes,然后通过一个view将其变为[x.shape[0],W x H x anchor, num_classes],每一个值表示一个框的类别,然后到criterion中去做预测。

Torch.cat 用法:https://blog.csdn.net/qq_39709535/article/details/80803003

接下来需要生成anchor。

anchor的生成

anchor的设置上面,对于retinaNet最终的P3,P4,P5,P6,P7均有一个不同的设置。anchor的长宽比和scale的大小分别有三种设置,一共有9种组合。anchor的大小与feature map的大小也是相关的。

1
2
self.ratios = np.array([0.5,1,2])
self.scales = np.array([2**0,2**(1.0/3.0),2**(2.0/3.0)])

几个常用的函数:

1
2
3
4
a = [1,2,3]
a = np.tile(a,(2,3))
# a = [[1,2,3,1,2,3,1,2,3]
[1.2,3,1,2,3,1,2,3]]

np.repeat

1
2
3
4
a = [1,2,3]
a = np.repeat(a,2)
# a = [1,1,2,2,3,3]
# 与np.tile的区别是,他是一个元素一个元素的增加后进行排序的。tile则是一起增加。

生成anchor的代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
class Anchors(nn.Module):
def __init__(self, pyramid_levels=None, strides=None, sizes=None, ratios=None, scales=None):
super(Anchors, self).__init__()

if pyramid_levels is None:
self.pyramid_levels = [3, 4, 5, 6, 7]
if strides is None:
self.strides = [2 ** x for x in self.pyramid_levels]
if sizes is None:
self.sizes = [2 ** (x + 2) for x in self.pyramid_levels]
if ratios is None:
self.ratios = np.array([0.5, 1, 2])
if scales is None:
self.scales = np.array([2 ** 0, 2 ** (1.0 / 3.0), 2 ** (2.0 / 3.0)])

def forward(self, image):

# image = [2,3,640,832]
image_shape = image.shape[2:]
image_shape = np.array(image_shape)
image_shapes = [(image_shape + 2 ** x - 1) // (2 ** x) for x in self.pyramid_levels]

# compute anchors over all pyramid levels
all_anchors = np.zeros((0, 4)).astype(np.float32)

for idx, p in enumerate(self.pyramid_levels):
anchors = generate_anchors(base_size=self.sizes[idx], ratios=self.ratios, scales=self.scales)
shifted_anchors = shift(image_shapes[idx], self.strides[idx], anchors)
all_anchors = np.append(all_anchors, shifted_anchors, axis=0)

all_anchors = np.expand_dims(all_anchors, axis=0)

return torch.from_numpy(all_anchors.astype(np.float32)).cuda()

def generate_anchors(base_size=16, ratios=None, scales=None):
"""
Generate anchor (reference) windows by enumerating aspect ratios X
scales w.r.t. a reference window.
"""

if ratios is None:
ratios = np.array([0.5, 1, 2])

if scales is None:
scales = np.array([2 ** 0, 2 ** (1.0 / 3.0), 2 ** (2.0 / 3.0)])

num_anchors = len(ratios) * len(scales) # 9个点

# initialize output anchors
anchors = np.zeros((num_anchors, 4)) # 每一个位置上都有9个点,每个点都有四个坐标值

# scale base_size,feature 的大小与scale相乘,得到每一层anchor的大小
anchors[:, 2:] = base_size * np.tile(scales, (2, len(ratios))).T

# compute areas of anchors
areas = anchors[:, 2] * anchors[:, 3]

# correct for ratios 构造长宽比
anchors[:, 2] = np.sqrt(areas / np.repeat(ratios, len(scales)))
anchors[:, 3] = anchors[:, 2] * np.repeat(ratios, len(scales))

# transform from (x_ctr, y_ctr, w, h) -> (x1, y1, x2, y2)
anchors[:, 0::2] -= np.tile(anchors[:, 2] * 0.5, (2, 1)).T
anchors[:, 1::2] -= np.tile(anchors[:, 3] * 0.5, (2, 1)).T

return anchors

def shift(shape, stride, anchors):
shift_x = (np.arange(0, shape[1]) + 0.5) * stride
shift_y = (np.arange(0, shape[0]) + 0.5) * stride

shift_x, shift_y = np.meshgrid(shift_x, shift_y)

# shifts = [shape[0]*shape[1],4]
shifts = np.vstack((
shift_x.ravel(), shift_y.ravel(),
shift_x.ravel(), shift_y.ravel()
)).transpose()

# add A anchors (1, A, 4) to
# cell K shifts (K, 1, 4) to get
# shift anchors (K, A, 4)
# reshape to (K*A, 4) shifted anchors
A = anchors.shape[0]
K = shifts.shape[0]
# 下面这一行进行了广播赋值,每一行都赋予维度不同的行进行广播,
# 最终形成[1,A,4] + [k,1,4] = [k,A,4],其中k = shape[0]*shape[1]
# 也就是说每一个像素位置都将产生9个anchor,每个anchor有四个坐标。 shape的大小则是由计算产生的
# 每张图片在每个level处的大小在__init__处进行初始化
all_anchors = (anchors.reshape((1, A, 4)) + \
shifts.reshape((1, K, 4)).transpose((1, 0, 2)))
all_anchors = all_anchors.reshape((K * A, 4))
return all_anchors

每一行进行分析就是先设置每一层feature map的level,stride,sizes,ratios,scales的值。然后在forward里面generate_anchor(),对每一个level的feature生成符合要求的size的anchor,长宽比组合后共9种anchor。具体的设置可看代码。

然后进入shift()函数,shift()函数的作用是将anchor散布到每一个位置上。流程大概是,一张图片进来,分别计算出这种图片在每一层level上的size大小,然后根据每一层的anchor的大小,每一个像素点位置取9个anchor,然后返回一个$[shape[0]shape[1]9,4]$ 大小的矩阵。

几个函数:

1
2
3
4
5
6
7
8
9
np.meshgrid(x,y)# 将x中元素与y中元素一一对应起来组合成坐标的形式。
np.vstack((x,y))# 将x,y中元素按照垂直方向叠加
#ravel()
a = [[2,2],[1,1]]
a.ravel() # 将多维数组拉平,不存生新的副本 a = [2,2,1,1]
a.flatten() # 作用与上面函数相同,将返回一个数据副本
np.squeeze([[1],[2],[3]]) # 对维度为1的数据进行压缩,得到[1,2,3]
a = a.reshape(-1) # 同样能够得到1维的数据
a.transpose() # 不指定参数表示对矩阵进行转置

经过上面的过程,在for循环部分,将5层的anchor全部装入一个list中,anchor生成完毕。

torch.cat函数

1
2
3
4
a = [1,2,3]
b = [3,4,5]
torch.cat((a,b),0) # 垂直方向 [[1,2,3],[3,4,5]]
torch.cat((a,b),1) # 水平方向 [[1,2,3,4,5,6]]

focalLoss部分

focalLoss紧接着上面的一部分。现在回过头来梳理一下网络中数据流动到的位置:

将图片输入ResNet中,通过一个多层金字塔结构,输出5个不同深度feature map(P3,P4,P5,P6,P7),依次将这些层输入到regression网络和classification网络中,每一层都将得到$[batch,wh,4]$的输出和$[batch,wh*anchors,class_nums]$的输出,然后将所有结果cat到一起(水平拼接),即所有level上的anchor 的预测框会被cat到regression_anchor 和classification_anchor中。接下来要做的是判断这些anchor的好坏。根据我们的先验知识,我们产生了一部分anchor的设置,我们将网络产生的anchor和我们预生成的anchor输入focalLoss中,对anchor进行过滤,计算产生的loss。

下面介绍focalLoss:

focalLoss部分按batch为单位,每次输入一个batch的数据,然后进行loss的计算。首先计算预设置的anchor与当前图片GT的IoU。(重叠部分 / 相并部分)

1
2
3
4
5
6
7
8
9
10
11
12
13
def calc_iou(a, b):
area = (b[:, 2] - b[:, 0]) * (b[:, 3] - b[:, 1])
iw = torch.min(torch.unsqueeze(a[:, 2], dim=1), b[:, 2]) -\
torch.max(torch.unsqueeze(a[:, 0], 1), b[:, 0])
ih = torch.min(torch.unsqueeze(a[:, 3], dim=1), b[:, 3]) -\
torch.max(torch.unsqueeze(a[:, 1], 1), b[:, 1])
iw = torch.clamp(iw, min=0)
ih = torch.clamp(ih, min=0)
ua = torch.unsqueeze((a[:, 2] - a[:, 0]) * (a[:, 3] - a[:, 1]), dim=1) + area - iw * ih
ua = torch.clamp(ua, min=1e-8)
intersection = iw * ih
IoU = intersection / ua
return IoU

focalLoss主要对每一个anchor进入classification的分类结果,focalLoss的原理如下:

整个网络的loss其实由两部分组成,一部分是分类loss,一部分是回归loss。分类loss即focal loss,回归部分的loss为边框回归的loss。实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
class FocalLoss(nn.Module):
#def __init__(self):

def forward(self, classifications, regressions, anchors, annotations):
alpha = 0.25
gamma = 2.0
batch_size = classifications.shape[0]
classification_losses = []
regression_losses = []

anchor = anchors[0, :, :]

anchor_widths = anchor[:, 2] - anchor[:, 0]
anchor_heights = anchor[:, 3] - anchor[:, 1]
anchor_ctr_x = anchor[:, 0] + 0.5 * anchor_widths
anchor_ctr_y = anchor[:, 1] + 0.5 * anchor_heights

for j in range(batch_size):

classification = classifications[j, :, :]
regression = regressions[j, :, :]

bbox_annotation = annotations[j, :, :]
bbox_annotation = bbox_annotation[bbox_annotation[:, 4] != -1]

if bbox_annotation.shape[0] == 0:
regression_losses.append(torch.tensor(0).float().cuda())
classification_losses.append(torch.tensor(0).float().cuda())

continue

classification = torch.clamp(classification, 1e-4, 1.0 - 1e-4)

IoU = calc_iou(anchors[0, :, :], bbox_annotation[:, :4]) # num_anchors x num_annotations

IoU_max, IoU_argmax = torch.max(IoU, dim=1) # num_anchors x 1

#import pdb
#pdb.set_trace()

# compute the loss for classification
# target 的维度为类别的个数
targets = torch.ones(classification.shape) * -1
targets = targets.cuda()

# lt : less than 如果IoU_max的面积小于0.4,那么就认为没有匹配上
targets[torch.lt(IoU_max, 0.4), :] = 0

positive_indices = torch.ge(IoU_max, 0.5)

num_positive_anchors = positive_indices.sum()

# IoU_argmax记录着当前的anchor与哪一个GT比较匹配
# 下面这个赋值语句就是给对应的anchor选择一个GT
# 第一个参数选择候选的anchor,第二个参数将候选anchor的坐标值都取到
assigned_annotations = bbox_annotation[IoU_argmax, :]

targets[positive_indices, :] = 0
# 下面一句表明对每个满足IoU条件的anchor,赋予一个类别。形成一个one hot编码(原先target的维度长度等于类别的个数)
targets[positive_indices, assigned_annotations[positive_indices, 4].long()] = 1

alpha_factor = torch.ones(targets.shape).cuda() * alpha


alpha_factor = torch.where(torch.eq(targets, 1.), alpha_factor, 1. - alpha_factor)

# 对focal weight进行统一的计算,然后赋值
focal_weight = torch.where(torch.eq(targets, 1.), 1. - classification, classification)
focal_weight = alpha_factor * torch.pow(focal_weight, gamma)
# 当y=1,即只有targets=1参与计算 当y=0,即只有targets=0参与
bce = -(targets * torch.log(classification) + (1.0 - targets) * torch.log(1.0 - classification))

# cls_loss = focal_weight * torch.pow(bce, gamma)
cls_loss = focal_weight * bce

# 注意对target的处理,当IoU在【0.4,0.5】之间时target=-1,不提供loss,其他情况均赋予一个cls_loss
cls_loss = torch.where(torch.ne(targets, -1.0), cls_loss, torch.zeros(cls_loss.shape).cuda())

# 计算所有的loss在正例中的平均值
classification_losses.append(cls_loss.sum()/torch.clamp(num_positive_anchors.float(), min=1.0))

# compute the loss for regression
#只有预测为正例的部分参与边框的回归,下面一部分为回归loss。

if positive_indices.sum() > 0:
assigned_annotations = assigned_annotations[positive_indices, :]

anchor_widths_pi = anchor_widths[positive_indices]
anchor_heights_pi = anchor_heights[positive_indices]
anchor_ctr_x_pi = anchor_ctr_x[positive_indices]
anchor_ctr_y_pi = anchor_ctr_y[positive_indices]

gt_widths = assigned_annotations[:, 2] - assigned_annotations[:, 0]
gt_heights = assigned_annotations[:, 3] - assigned_annotations[:, 1]
gt_ctr_x = assigned_annotations[:, 0] + 0.5 * gt_widths
gt_ctr_y = assigned_annotations[:, 1] + 0.5 * gt_heights

# clip widths to 1
gt_widths = torch.clamp(gt_widths, min=1)
gt_heights = torch.clamp(gt_heights, min=1)

targets_dx = (gt_ctr_x - anchor_ctr_x_pi) / anchor_widths_pi
targets_dy = (gt_ctr_y - anchor_ctr_y_pi) / anchor_heights_pi
targets_dw = torch.log(gt_widths / anchor_widths_pi)
targets_dh = torch.log(gt_heights / anchor_heights_pi)

targets = torch.stack((targets_dx, targets_dy, targets_dw, targets_dh))
targets = targets.t()

targets = targets/torch.Tensor([[0.1, 0.1, 0.2, 0.2]]).cuda()


negative_indices = 1 - positive_indices

regression_diff = torch.abs(targets - regression[positive_indices, :])

regression_loss = torch.where(
torch.le(regression_diff, 1.0 / 9.0),
0.5 * 9.0 * torch.pow(regression_diff, 2),
regression_diff - 0.5 / 9.0
)
regression_losses.append(regression_loss.mean())
else:
regression_losses.append(torch.tensor(0).float().cuda())

return torch.stack(classification_losses).mean(dim=0, keepdim=True), torch.stack(regression_losses).mean(dim=0, keepdim=True)

边框回归部分学习一个边框的平移以及缩放关系:

最终将得到的分类loss以及regression loss的平均值整合成一个stack,返回下一步。

几个函数:

1
2
3
4
5
torch.cat(a,b) #水平方向将a与b进行拼接
torch.clamp(a,min_val,max_val) # 将a中的值控制在min_val与max_val之间,小于取min_val,大于取max_val
max_val, max_index = torch.max(a,dim = 1) # 返回每一列最大值以及每一列的最大值的索引
torch.lt(a,0.4) # 返回a中值小于0.4的元素的下标,ge均类似
torch.where(condition,true_val,false_val) # 如果满足条件者该位置为true_val,否则为false_val,其中参数的维度均相同(比如都为三维)

训练阶段

训练部分有几个需要完成的工作:

  1. 初始化网络,设置优化器等等
  2. 将数据从dataloader中取出来
  3. 将数据输入网络中,得到网络的loss值
  4. 对loss进行反向传播,一些操作如learning rate的降低,梯度的裁剪可以在其中完成
  5. 打印出每个batch训练的结果
  6. 当训练次数到达一定的epoch时,对网络进行evaluate
  7. 保存mAP较高的网络

下面通过代码来解读:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# 将训练过程迁移到gpu上
use_gpu = True
if use_gpu:
retinanet = retinanet.cuda()
retinanet = torch.nn.DataParallel(retinanet).cuda()
retinanet.training = True
# 设置优化器为adam
optimizer = optim.Adam(retinanet.parameters(), lr=1e-5)
# ;learning rate的缩减器
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, verbose=True)
loss_hist = collections.deque(maxlen=500) # 实现了两端的快速添加删除
retinanet.train()
retinanet.module.freeze_bn()
print('Num training images: {}'.format(len(dataset_train)))
# 从dataloader中取数据
for epoch_num in range(parser.epochs):
retinanet.train()
retinanet.module.freeze_bn()
epoch_loss = []
for iter_num, data in enumerate(dataloader_train):
try:
# 清空梯度,由于pytorch在每次backward的时候,
# 会进行梯度的累积,这样的做法方便训练RNN模型
# 但是在训练普通模型的时候,需要将累积的梯度清空。
# 清空后做backward梯度方向有利于梯度的整体下降
optimizer.zero_grad()
# 将数据传入网络中,得到loss
classification_loss, regression_loss = retinanet([data['img'].cuda().float(), data['annot']])
classification_loss = classification_loss.mean()
regression_loss = regression_loss.mean()
loss = classification_loss + regression_loss
if bool(loss == 0):
continue
# 误差的反向传播
loss.backward()
# 梯度裁剪函数,第二个参数表明允许最大的梯度为0.1
torch.nn.utils.clip_grad_norm_(retinanet.parameters(), 0.1)
optimizer.step()
loss_hist.append(float(loss))
epoch_loss.append(float(loss))
print('Epoch: {} | Iteration: {} | Classification loss: {:1.5f} | Regression loss: {:1.5f} | Running loss: {:1.5f}'.format(epoch_num, iter_num, float(classification_loss), float(regression_loss), np.mean(loss_hist)))
del classification_loss
del regression_loss
except Exception as e:
print(e)
continue
if parser.dataset == 'coco':
print('Evaluating dataset')
# 验证集验证模型的有效性
coco_eval.evaluate_coco(dataset_val, retinanet)
elif parser.dataset == 'csv' and parser.csv_val is not None:
print('Evaluating dataset')
mAP = csv_eval.evaluate(dataset_val, retinanet)
scheduler.step(np.mean(epoch_loss))
# 保存训练好的模型
torch.save(retinanet.module, '{}_retinanet_{}.pt'.format(parser.dataset, epoch_num))
retinanet.eval()
torch.save(retinanet, 'model_final.pt'.format(epoch_num))

需要注意的点:

在网络进行训练或验证时,通常先进行一次:

1
2
3
model.train()
# or evaluate
model.eval()

这样的目的是模型在train和eval的时候,需要执行的操作是不一样的。例如batchNorm和Dropout在eval的时候是不需要执行的。因此需要提前对网络进行设置。

eval 验证

eval作为验证网络的性能,被安排在网络执行的最后,在每个batch结束,或者达到设定的epoch的时候,对网络进行测试。并以此为依据,是否对网络进行存储。

eval部分常用的指标是mAP,该指标通过计算recall以及precision的值来得到最终的结果。首先得到网络的eval的结果,然后从标注数据中得到anno的结果,进行mAP的计算。

得到网络的结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def _get_detections(dataset, retinanet, score_threshold=0.05, max_detections=100, save_path=None):
""" Get the detections from the retinanet using the generator.
The result is a list of lists such that the size is:
all_detections[num_images][num_classes] = detections[num_detections, 4 + num_classes]
# Arguments
dataset : The generator used to run images through the retinanet.
retinanet : The retinanet to run on the images.
score_threshold : The score confidence threshold to use.
max_detections : The maximum number of detections to use per image.
save_path : The path to save the images with visualized detections to.
# Returns
A list of lists containing the detections for each image in the generator.
"""
all_detections = [[None for i in range(dataset.num_classes())] for j in range(len(dataset))]
retinanet.eval()
with torch.no_grad():
for index in range(len(dataset)):
data = dataset[index]
scale = data['scale']
# run network
scores, labels, boxes = retinanet(data['img'].permute(2, 0, 1).cuda().float().unsqueeze(dim=0))
scores = scores.cpu().numpy()
labels = labels.cpu().numpy()
boxes = boxes.cpu().numpy()
# correct boxes for image scale
boxes /= scale
# select indices which have a score above the threshold
indices = np.where(scores > score_threshold)[0]
if indices.shape[0] > 0:
# select those scores
scores = scores[indices]
# find the order with which to sort the scores
# 得到score从大到小的下标,然后选择其中的max_detections那么多个
scores_sort = np.argsort(-scores)[:max_detections]
# select detections score从大到小
image_boxes = boxes[indices[scores_sort], :]
image_scores = scores[scores_sort]
image_labels = labels[indices[scores_sort]]
image_detections = np.concatenate([image_boxes, np.expand_dims(image_scores, axis=1), np.expand_dims(image_labels, axis=1)], axis=1)
# copy detections to all_detections
for label in range(dataset.num_classes()):
# 每一张图片均表示成一个index,对所有的label都遍历一边,每个label保存若干个anchor,没有的话则不保存
all_detections[index][label] = image_detections[image_detections[:, -1] == label, :-1]
else:
# copy detections to all_detections
for label in range(dataset.num_classes()):
all_detections[index][label] = np.zeros((0, 5))
print('{}/{}'.format(index + 1, len(dataset)), end='\r')
return all_detections

从标注文件中读取图片的标注信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def _get_annotations(generator):
""" Get the ground truth annotations from the generator.
The result is a list of lists such that the size is:
all_detections[num_images][num_classes] = annotations[num_detections, 5]
# Arguments
generator : The generator used to retrieve ground truth annotations.
# Returns
A list of lists containing the annotations for each image in the generator.
"""
all_annotations = [[None for i in range(generator.num_classes())] for j in range(len(generator))]
for i in range(len(generator)):
# load the annotations
annotations = generator.load_annotations(i)
# copy detections to all_annotations
for label in range(generator.num_classes()):
all_annotations[i][label] = annotations[annotations[:, 4] == label, :4].copy()
print('{}/{}'.format(i + 1, len(generator)), end='\r')
return all_annotations

得到标注数据之后,开始计算mAP指标,mAP指标由recall(判断正确的占所有正确类别的百分比),precision(判断正确的占预测结果中认为正确的百分比)。通过对这两个指数的积分来计算最终的mAP结果。

recall = TP/(TP + FN) 即真正预测对的,占所有正类的比例

precision = TP/(TP + FN) 即真正预测对的,占预测结果为正的比例

TP,FP,TN,FN这几个指标第一个字母表示预测是不是对的,第二个字母表示,预测的内容是什么(正类或者负类)。关于mAP的计算可以看: 这里

下面代码计算mAP的内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def compute_overlap(a, b):
"""
Parameters
----------
a: (N, 4) ndarray of float
b: (K, 4) ndarray of float
Returns
-------
overlaps: (N, K) ndarray of overlap between boxes and query_boxes
"""
area = (b[:, 2] - b[:, 0]) * (b[:, 3] - b[:, 1])
iw = np.minimum(np.expand_dims(a[:, 2], axis=1), b[:, 2]) - np.maximum(np.expand_dims(a[:, 0], 1), b[:, 0])
ih = np.minimum(np.expand_dims(a[:, 3], axis=1), b[:, 3]) - np.maximum(np.expand_dims(a[:, 1], 1), b[:, 1])
iw = np.maximum(iw, 0)
ih = np.maximum(ih, 0)
ua = np.expand_dims((a[:, 2] - a[:, 0]) * (a[:, 3] - a[:, 1]), axis=1) + area - iw * ih
ua = np.maximum(ua, np.finfo(float).eps)
intersection = iw * ih
return intersection / ua


def _compute_ap(recall, precision):
""" Compute the average precision, given the recall and precision curves.
Code originally from https://github.com/rbgirshick/py-faster-rcnn.
# Arguments
recall: The recall curve (list).
precision: The precision curve (list).
# Returns
The average precision as computed in py-faster-rcnn.
"""
# correct AP calculation
# first append sentinel values at the end
mrec = np.concatenate(([0.], recall, [1.]))
mpre = np.concatenate(([0.], precision, [0.]))
# compute the precision envelope
for i in range(mpre.size - 1, 0, -1):
mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])
# to calculate area under PR curve, look for points
# where X axis (recall) changes value
i = np.where(mrec[1:] != mrec[:-1])[0]
# and sum (\Delta recall) * prec
ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
return ap

def evaluate(
generator,
retinanet,
iou_threshold=0.5,
score_threshold=0.05,
max_detections=100,
save_path=None
):
""" Evaluate a given dataset using a given retinanet.
# Arguments
generator : The generator that represents the dataset to evaluate.
retinanet : The retinanet to evaluate.
iou_threshold : The threshold used to consider when a detection is positive or negative.
score_threshold : The score confidence threshold to use for detections.
max_detections : The maximum number of detections to use per image.
save_path : The path to save images with visualized detections to.
# Returns
A dict mapping class names to mAP scores.
"""
# gather all detections and annotations
all_detections = _get_detections(generator, retinanet, score_threshold=score_threshold, max_detections=max_detections, save_path=save_path)
all_annotations = _get_annotations(generator)
average_precisions = {}
for label in range(generator.num_classes()):
false_positives = np.zeros((0,))
true_positives = np.zeros((0,))
scores = np.zeros((0,))
num_annotations = 0.0
for i in range(len(generator)):
detections = all_detections[i][label]
annotations = all_annotations[i][label]
num_annotations += annotations.shape[0]
detected_annotations = []
for d in detections:
scores = np.append(scores, d[4])
if annotations.shape[0] == 0:
# 表示当前图片没有标注,因此你的标注结果都是错误的
false_positives = np.append(false_positives, 1)
true_positives = np.append(true_positives, 0)
continue
overlaps = compute_overlap(np.expand_dims(d, axis=0), annotations)
assigned_annotation = np.argmax(overlaps, axis=1) # 对每个框找出覆盖最多的一个标注,返回标注所在的下标
max_overlap = overlaps[0, assigned_annotation]
if max_overlap >= iou_threshold and assigned_annotation not in detected_annotations:
false_positives = np.append(false_positives, 0)
true_positives = np.append(true_positives, 1)
detected_annotations.append(assigned_annotation)
else:
false_positives = np.append(false_positives, 1)
true_positives = np.append(true_positives, 0)
# no annotations -> AP for this class is 0 (is this correct?)
if num_annotations == 0:
average_precisions[label] = 0, 0
continue
# sort by score
indices = np.argsort(-scores)
false_positives = false_positives[indices]
true_positives = true_positives[indices]
# compute false positives and true positives
# 得到一个累加的数组的结果
false_positives = np.cumsum(false_positives)
true_positives = np.cumsum(true_positives)
# compute recall and precision
recall = true_positives / num_annotations
precision = true_positives / np.maximum(true_positives + false_positives, np.finfo(np.float64).eps)
# compute average precision
average_precision = _compute_ap(recall, precision)
average_precisions[label] = average_precision, num_annotations
print('\nmAP:')
for label in range(generator.num_classes()):
label_name = generator.label_to_name(label)
print('{}: {}'.format(label_name, average_precisions[label][0]))
return average_precisions

几个函数:

1
2
3
np.argsort(scores) # 根据从小到大返回元素的下标,小的在前
np.argmax(overlaps,axis = 1) # 找出每一列的最大值,返回他的下标
np.cumsum(nums) # 返回一个数组,数组中内容从头开始累加到当前位置

总结

经过上面几个流程我们大致梳理了一下一个网络的搭建,数据的传递,loss的计算,以及最后的验证的过程。

总结一下:

  1. 构造dataloader,在这里头完成数据的读取,增强等工作
  2. 完成网络的搭建
  3. 完成网络的训练
  4. 完成验证集的测试工作