3.1 The M-SSD model
In this paper, an improved SSD model is designed. The SSD approach produces a fixed-size
collection of bounding boxes and scores in the presence of object class instances,
using a feed-forward convolutional network, followed by a non-maximum suppression
step to perform the object detection. The model utilizes the visual geometry group
(VGG-16) as its basic structure. However, it casts away the last fully connected layers,
adds a set of auxiliary convolutional layers to extract features at multiple scales,
and decreases the input size to each subsequent layer. It can improve the detection
accuracy of small objects, compared with other existing algorithms. For this kind
of model structure, the number of network architecture weights is large, and much
disk space is required. Furthermore, the detecting speed is slow. Therefore, it is
not suitable for limited computing platforms and small-storage real-time detection
systems.
Wei Liu (17) analyzed the SSD model structure and pointed out that the forward time is costed
mainly on the base network (i.e. nearly 80%). Therefore, for real-time applications,
using a faster basic network can reduce the amount of calculation and greatly improve
the speed. ResNet (18) was first proposed by Kaiming He and proven to be an efficient network. Lili Chen
(19) replaced the basic feature extraction model to ResNet-34 and got fast detection speed
on vehicle counting. Note that, in our single former USV object detection system,
it is unnecessary to utilize too many network layers for feature extraction. We choose
ResNet-18 as its basic feature extraction network, in order to obtain a real-time
detection performance.
The whole model structure of ResNet-18 comprises a convolutional layer, four basic
block layers and a final fully connected layer, which is shown in detail in Fig. 1. This structure avoids the problem of gradient disappearance caused by the deepening
of the neural network layers. Its efficiency has also been simultaneously improved
due to the introduced basic blocks.
Fig. 1. The flowchart of ResNet-18
In real-time object detecting tasks, large-sized and excessive- convolution kernels
increase the computational cost, dilute the effective features and reduce the real-time
control accuracy. The authors of
(20),
(21) prove that the kernel sizes of 1×1 and 3×3 have fewer parameters but stronger feature
generalization abilities than the 5×5 and 7×7 kernel size. In addition, a block of
two convolutional layers with a 3×3 kernel size plays the same role as one 5×5 convolutional
layer, as the convolutional window is scanning the input. The original throughput
is kept. However, it results in a lighter number of parameters, while the stacked
convolutional layers yield a better result.
Table 1. Parameters of M-SSD from FC6 to conv9_2 layer
Layer
|
Input size
|
Output size
|
Kernel size
|
Input channel
|
Output channel
|
FC6
|
38×38
|
19×19
|
3×3
|
256
|
512
|
FC7
|
19×19
|
19×19
|
1×1
|
512
|
512
|
conv6_1
|
19×19
|
10×10
|
1×1
|
512
|
256
|
conv6_2
|
19×19
|
10×10
|
3×3
|
128
|
256
|
conv7_1
|
10×10
|
5×5
|
1×1
|
256
|
128
|
conv7_2
|
10×10
|
5×5
|
3×3
|
64
|
128
|
conv8_1
|
5×5
|
3×3
|
1×1
|
128
|
128
|
conv8_2
|
5×5
|
3×3
|
3×3
|
64
|
64
|
conv9_1
|
3×3
|
1×1
|
3×3
|
128
|
128
|
conv9_2
|
3×3
|
1×1
|
1×1
|
64
|
64
|
Inspired by these literature methods, two modifications are performed herein compared
with the original SSD model: (a) we retain the SSD structure, use ResNet-18 as the
basic feature extraction network, but discard the VGG-16, followed by some convolutional
layers to detect the object; (b) we replace the convolutional kernels from FC6 to
conv9_2 layers and use convolutional kernels of 1×1 size to classify the object. The
layer’s specification from FC6 to conv9_2 is presented in
Table 1 in detail, while the M-SSD model structure is presented in
Fig. 2.
Fig. 2. The overall structure of the M-SSD
In contrast to the SSD model, we choose the layers of res3d, fc6, fc7, conv6_1, conv7_1,
conv8_1 and conv9_1 as the regression feature map layers to classify the object. In
each feature map layer, 1×1 represent the size of the convolutional kernel, 3 or 6
represents the numbers of prior box and 4 represents the values of the bounding box.
Afterwards, the M-SSD model parameters are set for the proposed real-time detection
system as follows:
Ⅰ. Select default box parameters: the feature maps located in different layers have
different sizes of receptive fields in a CNN. To correctly detect targets with different
scales when they are moved, some algorithms convert the input image to different scales,
then process the converted image and fuse the detection results (22), (23). The strategy proposed in (24) is based on the fact that the default frame does not need to be mapped one to one
with the feature map receptive.
The default frame at different positions corresponds to different regions and target
sizes. Assuming that $m$ feature maps should be predicted, the default frame size
in each feature map is calculated as:
where $S_{\min}$ is the default frame size of the lowest layer having a value of 0.1
and $S_{\max}$ is the default frame size of the highest layer having a value of 0.96
in the network structure.
The different layers are sorted at regular intervals. The width- to-height ratio of
the default frame is $a_{r}\in\{1,\: 2,\: 3,\: 1/2,\: 1/3\}$. The width and height
of each default frame are respectively given by:
Ⅱ. Choose the matching strategy: this strategy selects the default box for each true
label box to match it when it generates the M-SSD detection model. It then finds the
highest Jaccard for each true label from all the candidate default boxes, by re-adjusting
the Jaccard overlap coefficient.
Ⅲ. Select the loss function: Softmax $l_{i}= -\log(e^{S_{y_i}}/\sum_{j}
e^{S_{j}})$ is selected as the loss function, $S_{j}$ is the score of class $j$ and
$y_{i}$ is the true label of the real object. Then the formula for the total loss
function $L$ is as follows:
where $N$ is the total number of images.
An object function always exists during model training. We should optimize the loss
function to minimize the loss value until the value becomes the lowest. The M-SSD
training model is developed based on the TensorFlow deep learning framework.
Based on this design, the algorithm complexity is reduced. The advantage of the proposed
design will be shown in the following comparative analysis.
The reason why we use SSD for object detection is because the SSD network framework
is designed to be independent of the basic network and is used to accurately classify
and locate targets. It can run on any basic network(such as VGG, ResNet, MobileNet).
Therefore, we can use different basic networks for neural network learning and different
regression layers(from 6 to 8) to estimate their accuracy. It is a very useful neural
network framework to improve the detection accuracy and speed. YOLO and its improved
edition YOLO v3, YOLO v5 have been proposed for multiple objects detection. But, for
real-time detection, they are especially performed for tasks on mobile terminal. SSD
network framework is still a better choice, since its performance in terms of comprehensive
consideration of accuracy and speed is particularly outstanding when used as a network
with light structure to detect objects.
3.2 M-SSD model training/testing
The next step consists in training/testing the proposed M-SSD model for object detection.
The hardware specifications of the experiment environment are shown in Table 2. CPU is used to train the M-SSD model with 16G RAM. The GPU can highly improve the
training speed. Note that some Library Functions of CUDA 10.0/CUDNN 8.0.0, and some
platforms such as Python 3.6/TensorFlow 1.8, are used to quickly and effectively train
the model. The trained model runs on Ubuntu 18.04 operating system, using a camera
to capture real-time objects with a resolution of 1024×768.
Table 2. Hardware specification
Hardware device
|
Parameter
|
CPU
|
Inter(R) Core(TM) i7-8750H
|
RAM
|
16GB
|
GPU
|
NVIDIA GeForce GTX1060
|
Operate system
|
Ubuntu 18.04
|
CUDA/CUDNN
|
CUDA 10.0/CUDNN 8.0.0
|
Platform
|
Python, TensorFlow
|
Camera
|
USB HD, resolution1024×768
|
Table 3. The parameters initialization
Parameters
|
Value
|
base_lr
|
0.0001
|
max_iter
|
50000
|
Ir_policy
|
Step
|
Gamma
|
0.1
|
Momentum
|
0.9
|
weight_decay
|
0.0005
|
image_size
|
300×300
|
Type
|
SGD
|
BN
|
32
|
An image database containing 2000 images was built. These images were collected under
different external environments and illumination intensities, with a ratio of 3:1
(positive images, including the USV: negative images without the object). A part of
the images was flipped, stretched or compressed to enhance the data set universality.
Accordingly, 80% of the images were used for training. The remaining 20% were used
for the network testing. In the base network, the images captured by the camera were
re-sized to 300×300 before inputting them to the net structure model. The model is
trained using stochastic gradient descent (SGD) with a 0.0001 initial learning rate
(base_lr), 0.9 momentum, 0.0005 weight decay and a batch normalization (BN) of 32.
The network was trained for 50,000 iterations and successfully converged. Other parameters
are detailed in
Table 3.
A part of the labeled images for training/validating is illustrated in Fig. 3. The experiment is implemented in a pool area of Kyungnam University in South Korea.
The training/validating accuracy of the proposed model is presented in Fig. 4. It can be seen that the classification accuracy can reach 96.75%. Some classification
and accuracy results, in the case of a successful detection, are shown in Fig. 5.
To evaluate the performance of the proposed detection system, the following four evaluation
criteria are used:
where $a$ and $n$ respectively represent the number of misclassified samples and the
total number of samples, TP (true positive) refers to a positive sample which is predicted
to be a correct result, FP (false positive) refers to a negative sample which is predicted
to be a false alarm, FN (false negative) refers to a positive sample which is predicted
to be a missed detection, and TN (true negative) refers to a negative sample which
is predicted to be negative.
Fig. 3. Part of the images used for training
The proposed M-SSD model is compared with SSD
(10), R-SSD
(24) and F-SSD
(18), using the previously mentioned four parameters; precision, recall, accuracy and
F1. The results are shown in
Fig. 6. It can be observed that the proposed M-SSD model results in a higher detection performance
than SSD, which can reach an accuracy of 96.75\%. This is due to the fact that ResNet-18,
which has a stronger feature extraction residual structure, is used to extract the
basic feature infor- mation. However, M-SSD has a lower detection performance than
R-SSD and F-SSD. This is due to the fact that the proposed model has fewer layers
than R-SSD of ResNet-50 and F-SSD of ResNet-34. This inversely proves that a higher
accuracy requires deeper network layers. However, this does not mean that a higher
accuracy results in a better detection performance. The computation time, given in
Table 4, is another parameter for performance estimation. It can be seen from
Table 4 that the computation time of the proposed M-SSD model is 424.36s, which is 26.35%
less than that of the SSD model, and much less than that of R-SSD and F-SSD. The proposed
design improves the detection performance and the detection speed. It can also be
implemented on mobile terminals, such as Rasberry Pi and Jetson Nano, for example.
Fig. 4. Accuracy results of the proposed model
Fig. 5. Output of the M-SSD testing
For our collected USV data set, the FPS of SSD is about 67 with the input resolution
300×300, and the FPS of our proposed M-SSD model is about 86 with the same input resolution.
When we download the trained file to the mobile terminal Jeston Nano, the FPS of our
proposed model is about 32, which achieves real-time former USV detection.
Fig. 6. Performance comparison of different models
Table 4. Computation time of the methods (s)
Method
|
Basic network
|
Time
|
SSD (10)
|
VGG-16
|
576.25
|
R-SSD (24)
|
ResNet-50
|
824.36
|
F-SSD (18)
|
ResNet-34
|
720.64
|
M-SSD
|
ResNet-18
|
424.36
|
Part of the failure detection images are shown in
Fig. 7. It can be observed that the unobvious characteristics and sharp changes of the ambient
light around the detected object, may cause a failure detection. Another labeled image
data set is used to verify our conjecture and to train a high accuracy model for further
studies. This collected data set mainly comprises images that we previously failed
to detect, as well as images collected under the situation of a similar environment.
A part of the new data set is shown in
Fig. 8. The re-train loss for the new data set is presented in
Fig. 9. It can be seen from
Fig. 9that the training loss is slightly high during the re-training process. We are not
able to obtain a better train loss after 50,000 iterations. This is due to the fact
that the basic net structure cannot obtain more features of the USV object to train
the model, because of an unclear feature data set. This leads to a low object classification.
In summary, for blurred or unclear images, the network cannot learn enough features
and the loss function can not converge to zero. It is concluded that images with clear
features are required to train the model and then the network models can achieve good
accuracy. For former object detection, the paper gets higher detection accuracy and
faster speed than original SSD model through replacing the basic network VGG-16 with
ResNet-18 and utilizing 1×1 as the convolutional kernel to return 6 feature maps.
Although there is no significant improvement in accuracy, the computational time is
reduced 26.35% less than former SSD structure. In addition, it has an advantage in
that it can utilize a network with reduced computing efficiency.
Fig. 7. Part of the failure detection images
Fig. 8. New data set for training images
Fig. 9. The re-train loss for the new data set