DJL下运行YOLOv5报"detect layer output is not supported yet"错误的解决方案

现象

使用非官方代码，在DJL框架内载入YOLOv5模型时报错，错误内容为”detect layer output is not supported yet, check correct YoloV5 export format”，采用不同格式包括Torchscript和ONNX格式均不能解决。调用原版YOLOv5模型能够正常运行，查看DJL源码以及模型输出发现是检测模型输出的张量大小不正确而报错。

问题原因

按照YOLO模型的结构，最后一层输出应该是包含对预测框的变换以及非极大值抑制，而这个版本的非官方代码并没有把这部分转换整合进入导出的模型中。
根据在YOLOv3的原始论文中提到，模型会在原始输入的32倍下采样，16倍下采样以及8倍下采样三个尺度上进行预测输出，每个尺度包含三个预测方框，而其中每个方框的参数包括方框位置和大小（左上角坐标以及长宽），是否存在物体(objectness)以及物体类别输出。以官方原版的YOLOv5为例，640*640的图像输入，共检测80类物体，则三个尺度的输出分别为(640/32)*(640/32)*(3*(4+1+80)),(640/16)*(640/16)*(3*(4+1+80)),(640/8)*(640/8)*(3*(4+1+80))，即输出的张量大小为20*20*255,40*40*255,80*80*255。这几个张量的值不能作为图像中的最终输出方框的坐标和大小，原版YOLOv5的模型会将这三个尺度的输出经过处理得到真实的方框，最终输出的经过拼接张量大小（忽略batch dim）为
(3*(640/32)*(640/32))*(4+1+80)+(3*(640/16)*(640/16))*(4+1+80)+(3*(640/8)*(640/8))*(4+1+80)=1200*85+4800*85+19200*85=25200*85。而DJL框架报错则是由于直接将未经过变换的三个尺度拼接(255*20*20+255*40*40+255*80*80)，得到错误的张量大小，在检测到缺少detect layer后报错。

解决方案

在原有模型上补上该变换即可，参考代码中的utils_bbox.py编写Pytorch Module作为detect layer，如下所示：

class DecodeBox(nn.Module):
    def __init__(self, anchors, num_classes, input_shape, anchors_mask=None,index=0):
        super(DecodeBox, self).__init__()
        if anchors_mask is None:
            anchors_mask = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
        self.anchors = anchors
        self.num_classes = num_classes
        self.bbox_attrs = 5 + num_classes
        self.input_shape = input_shape

        if index > 2 or index < 0:
            index = 0
        self.index = index
        self.batch_size = 1

        self.input_height = int(input_shape[0]/(8*(2**(2-index))))
        self.input_width = int(input_shape[1]/(8*(2**(2-index))))
        stride_h = self.input_shape[0] / self.input_height
        stride_w = self.input_shape[1] / self.input_width
        
        self._scale = torch.Tensor([self.input_width, self.input_height
            , self.input_width, self.input_height])
        # -----------------------------------------------------------#
        #   20x20的特征层对应的anchor是[116,90],[156,198],[373,326]
        #   40x40的特征层对应的anchor是[30,61],[62,45],[59,119]
        #   80x80的特征层对应的anchor是[10,13],[16,30],[33,23]
        # -----------------------------------------------------------#
        self.anchors_mask = anchors_mask
        self.scaled_anchors = [(anchor_width / stride_w, anchor_height / stride_h) 
            for anchor_width, anchor_height in self.anchors[anchors_mask[self.index]]]

    def forward(self, x):
        # -----------------------------------------------#
        #   输入的input一共有三个，他们的shape分别是
        #   batch_size = 1
        #   batch_size, 3 * (4 + 1 + 80), 20, 20
        #   batch_size, 255, 40, 40
        #   batch_size, 255, 80, 80
        # -----------------------------------------------#

        prediction = x.view(self.batch_size, len(self.anchors_mask[self.index]),
                                self.bbox_attrs, self.input_height, self.input_width).permute(0, 1, 3, 4, 2).contiguous()

        # -----------------------------------------------#
        #   先验框的中心位置的调整参数
        # -----------------------------------------------#
        box_x = torch.sigmoid(prediction[..., 0])  
        box_y = torch.sigmoid(prediction[..., 1])
        # -----------------------------------------------#
        #   先验框的宽高调整参数
        # -----------------------------------------------#
        w = torch.sigmoid(prediction[..., 2]) 
        h = torch.sigmoid(prediction[..., 3]) 
        # -----------------------------------------------#
        #   获得置信度，是否有物体
        # -----------------------------------------------#
        conf = torch.sigmoid(prediction[..., 4])
        # -----------------------------------------------#
        #   种类置信度
        # -----------------------------------------------#
        pred_cls = torch.sigmoid(prediction[..., 5:])

        FloatTensor = torch.cuda.FloatTensor if box_x.is_cuda else torch.FloatTensor
        LongTensor = torch.cuda.LongTensor if box_x.is_cuda else torch.LongTensor

        # ----------------------------------------------------------#
        #   生成网格，先验框中心，网格左上角 
        #   batch_size,3,20,20
        # ----------------------------------------------------------#
        grid_x = torch.linspace(0, self.input_width - 1, self.input_width).repeat(self.input_height, 1).repeat(
            self.batch_size * len(self.anchors_mask[self.index]), 1, 1).view(box_x.shape).type(FloatTensor)
        grid_y = torch.linspace(0, self.input_height - 1, self.input_height).repeat(self.input_width, 1).t().repeat(
            self.batch_size * len(self.anchors_mask[self.index]), 1, 1).view(box_y.shape).type(FloatTensor)

        # ----------------------------------------------------------#
        #   按照网格格式生成先验框的宽高
        #   batch_size,3,20,20
        # ----------------------------------------------------------#
        anchor_w = FloatTensor(self.scaled_anchors).index_select(1, LongTensor([0]))
        anchor_h = FloatTensor(self.scaled_anchors).index_select(1, LongTensor([1]))
        anchor_w = anchor_w.repeat(self.batch_size, 1).repeat(1, 1, self.input_height * self.input_width).view(w.shape)
        anchor_h = anchor_h.repeat(self.batch_size, 1).repeat(1, 1, self.input_height * self.input_width).view(h.shape)

        # ----------------------------------------------------------#
        #   利用预测结果对先验框进行调整
        #   首先调整先验框的中心，从先验框中心向右下角偏移
        #   再调整先验框的宽高。
        #   x 0 ~ 1 => 0 ~ 2 => -0.5, 1.5 => 负责一定范围的目标的预测
        #   y 0 ~ 1 => 0 ~ 2 => -0.5, 1.5 => 负责一定范围的目标的预测
        #   w 0 ~ 1 => 0 ~ 2 => 0 ~ 4 => 先验框的宽高调节范围为0~4倍
        #   h 0 ~ 1 => 0 ~ 2 => 0 ~ 4 => 先验框的宽高调节范围为0~4倍
        # ----------------------------------------------------------#
        pred_boxes = FloatTensor(prediction[..., :4].shape)
        pred_boxes[..., 0] = box_x * 2. - 0.5 + grid_x
        pred_boxes[..., 1] = box_y * 2. - 0.5 + grid_y
        pred_boxes[..., 2] = (w * 2) ** 2 * anchor_w
        pred_boxes[..., 3] = (h * 2) ** 2 * anchor_h

        # ----------------------------------------------------------#
        #   将输出结果归一化成小数的形式
        # ----------------------------------------------------------#
        
        output = torch.cat((pred_boxes.view(self.batch_size, -1, 4) / self._scale.type(FloatTensor),
                            conf.view(self.batch_size, -1, 1), pred_cls.view(self.batch_size, -1, self.num_classes)), -1)
        return output

其中初始化参数中的index为不同尺度对应的下标。在原有的YoloBody中的初始化函数中分别初始化三个尺度对应的DecodeBox，在YoloBody的forward函数末尾将原有的返回值

1	return self.head(P2, P3, P4, P5)

修改为经过调用DecodeBox后拼接的结果

1 2	output = self.head(P2, P3, P4, P5) return torch.cat([self.decode_box0(output[0]),self.decode_box1(output[1]),self.decode_box2(output[2])], 1)

这样模型输出的尺寸与原版的YOLOv5输出一致，经过测试在DJL框架下调用就能够正常运行