# 3d object detection

### Table of contents

- HowNet paper reading notes
- Keywordskeywords
- Current situation summary fragmented knowledge points
- 3D detection algorithm classification
- There are three categories according to the data types used:
- Classification according to different feature expression methods of point clouds [1]:
- Other classification methods[1]:
- According to sensor classification [2]:
- Classification according to usage scenarios[2]:
- Bird’s Eye View (BEV) target detection method based on voxel
- Target detection method based on point-wise features
- Target detection method based on camera view

- Evaluation criteria (evaluation indicators)
- 3D detection algorithm theory learning
- 3D public dataset
- Introduction to 3D target recognition technology (current situation)
- Introduction to 3D target detection technology (current situation)
- Three-dimensional target detection algorithm based on point cloud data
- PointNet network
- PointNet++ Network

# HowNet paper reading notes

The following notes are taken from CNKI documents. The reading order is to read the latest ones first and then the previously published ones. The quality of the papers is to read first the excellent journals and universities, then the common journals and ordinary universities. The type of the papers is the first. Principles of reading master’s thesis after doctoral thesis. The summary is not in place, I hope you readers will correct me.

# Keywordskeywords

Lidar; point cloud; three-dimensional target detection; attention mechanism; data enhancement

3D Object Detection 3D bounding box Augmented Reality (AR)

# Current situation summary fragmented knowledge points

1 Object detection is one of the basic tasks in the field of computer vision;

2 can be divided into**two-dimensional**target detection and**three dimensional**Target Detection;

3 The current two-dimensional target detection algorithm based on deep learning tends to**Mature**,by**Faster R-CNN**and**YOLO**The two-dimensional detection algorithm represented by is widely used in actual production and daily life;

4 The field of three-dimensional target detection is in the ascendant, and various algorithms are emerging one after another. point cloud based**Point by point cloud**Detection method, with higher detection accuracy but slower speed; based on**voxel**The bird’s-eye view detection method has faster detection speed but lower detection accuracy [1].

5 Three steps of the autonomous driving system: environment perception, behavioral decision-making, and vehicle control;

6. The perception system detects surrounding scene information: relevant information (position, direction, speed, etc.) of the surrounding environment (roads, vehicles, pedestrians, etc.);

There are currently two mainstream perception technology routes: the weak perception route based on cameras, and the strong perception route based on LiDAR (Light Detection And Ranging, Li DAR);

7 **2D detection algorithm**Vehicle-mounted cameras are usually used as sensing devices. The weak sensing system based on vehicle-mounted cameras has the characteristics of low cost and high implementation performance, but cannot provide accurate three-dimensional spatial information of the surrounding environment;

8 **3D target detection algorithm**Lidar is generally used as a sensing device. The sensing algorithm based on three-dimensional point cloud data contains information such as the position, distance, depth and angle of the target object, and the data structure is more consistent with the real world situation;

9 At present, the main lidar manufacturers at home and abroad include: Velodyne, IBEO, Quanergy, Silan Technology and other companies, among which Velodyne is the most famous in the industry;

10 Advantages and Disadvantages: Camera cost is low, radar cost is high;

11 Lidar introduction: Lidar, the full name of laser detection and ranging, also known as optical radar, is a comprehensive light detection and measurement equipment. When working, the radar actively emits a laser beam, and then the receiver compares the received reflected signal echo with the transmitted signal, and processes the relevant information of the target (position, angle, reflection intensity, etc.), thereby achieving detection and tracking of the target. and identification. Currently, the mainstream laser radars on the market include**32 wire harness, 64 wire harness, 128 wire**Shu et al. The more lidar harnesses there are, the higher the measurement accuracy. Unlike traditional cameras with dense pixel imaging, lidar imaging uses sparse point clouds. In addition, point clouds are continuous, while images are discrete. Point clouds can reflect the shape and attitude information of real-world targets, but lack texture and color information. The corresponding images are discretized expressions of real-world targets and lack the real size information of the targets [1];

12 Three-dimensional target detection should include four aspects of output: 2D Bounding Box, 3D Pose, 3D Location and 3D Bounding Box. Traditional 3D target detection algorithms usually only output the 2D Bounding Box and 3D Pose of the object. Until recent years, only a few 3D target detection algorithms can completely output the 3D Bounding Box, and most of them rely on the CAD model of the object or Training multi-view templates [2].

13 Three-dimensional target detection technology is based on three-dimensional target recognition technology to a certain extent [3].

# 3D detection algorithm classification

## There are three categories according to the data types used:

Image-based methods, point cloud-based methods, point cloud and image-based methods.

Image-based three-dimensional target detection methods mainly obtain image data through cameras. Camera types can be subdivided into monocular cameras, binocular cameras, multi-camera cameras, infrared cameras, etc. Classified according to the way of processing point cloud data, there are three main types: voxelization, projection and maintaining the original point cloud state.

## Classification according to different feature expression methods of point clouds [1]:

(1) Target detection method based on point-wise feature;

(2) Bird’s Eye View (BEV) target detection method based on voxel;

(3) Target detection method based on camera view;

(5) Joint detection technology solution based on monocular camera and low-beam lidar;

## Other classification methods[1]:

(1) Single-stage detection (one-stage): Single-stage detection directly provides the category and location information of the detection target through the backbone network, without using the Region Candidate Network (RPN). Therefore, the algorithm is faster, but the detection accuracy is slightly lower than the two-step detection network.

Such as: YOLO

(2) Two-stage detection: Two-stage detection usually includes a multi-task learning problem: 1) distinguish the foreground object frame from the background and generate a series of sample candidate frames and category labels; 2) regression through a convolutional neural network A set of coefficients maximizes the Intersection over Union (IoU) between the detection frame and the target frame. Finally, redundant bounding boxes (duplicate detections of the same object) are removed through the Non-Maximum Suppression (NMS) process.

Such as: Fast R-CNN,

In recent years, with the rapid development of deep neural networks, target detection has evolved from traditional manual feature-based detection methods to deep learning-based detection methods, among which two-stage network frameworks represented by RCNN, Fast RCNN, and Mask RCNN have emerged. and one-stage network frameworks represented by YOLO and SSD [2].

Advantages and Disadvantages: Compared with single-step detection algorithms, double-step detection algorithms usually have slower detection speed, but the detection accuracy is usually higher than single-step detection.

## According to sensor classification [2]:

According to the image format obtained by the sensor, 3D target detection can be divided into four categories: based on monocular cameras, based on binocular/depth cameras, based on lidar and based on multi-sensor fusion.

(1) Three-dimensional target detection based on monocular cameras refers to the use of images captured by monocular cameras for detection. Since monocular images lack depth information of the scene, depth information needs to be obtained through the geometric model of the object or multi-view constraints.

(2) Three-dimensional target detection based on binocular/depth cameras. Since binocular/depth cameras can not only obtain original images, but also directly obtain corresponding depth images, detection is easier than monocular cameras.

(3) Three-dimensional target detection based on lidar refers to using point cloud data obtained by lidar to detect three-dimensional objects. It is especially common in the field of unmanned driving.

(4) Three-dimensional target detection based on multi-sensors refers to the fusion of information from multiple sensors, such as using monocular images and laser radar point cloud information at the same time to detect three-dimensional objects. This is the future of unmanned driving. Where the trend lies.

Although multi-sensor fusion can be used in the industrial field, three-dimensional target detection based on camera sensors still has the advantages of low cost and wide application range, and is still an important research field.

## Classification according to usage scenarios[2]:

### Indoor 3D target detection algorithm

Indoor scenes have significantly different characteristics than outdoor scenes. Indoor scenes contain a wide variety of targets, such as seats, desks, wardrobes, etc. There may also be large differences in appearance characteristics between similar targets, and the scale of indoor scenes is relatively small, mainly dealing with close-range targets. In terms of sensor selection, indoor scenes prefer the use of binocular/depth cameras. Compared with monocular cameras, in addition to obtaining rich texture information of the target, they can also obtain accurate depth information of the scene. The range measurement range of general binocular/depth cameras is within tens of meters, which is more suitable for indoor scenes. The depth map refers to an image that contains distance information. Each pixel value of it saves the actual distance between the sensor and the photographed object. It looks similar to a grayscale image. Based on the 3D target detection algorithm based on binocular/depth vision, the distance of the object from the camera is known, which can effectively reduce the difficulty of designing the 3D target detection algorithm. Table 2.1 analyzes the types, mechanisms, advantages and disadvantages of different three-dimensional target detection algorithms in indoor scenes.

In indoor scenes, three-dimensional target detection algorithms based on binocular/depth vision can be divided into two categories according to the convolution type of region extraction, 2.5D region proposal network and 3D region proposal network. The 2.5D region proposal network refers to using the traditional 2D detection network to process binocular images and depth maps, and combining the two-dimensional features to return the three-dimensional information of the object. Its advantage is that the network structure is relatively simple, and it uses a complete 2D detection network to quickly realize target extraction and regression of three-dimensional detection frames. The 3D region proposal network is an end-to-end network model that directly uses three-dimensional convolution to extract the three-dimensional spatial features of the target. The advantage is that it makes full use of the three-dimensional spatial information of objects provided by the binocular/depth camera. However, three-dimensional convolution requires more calculations than traditional two-dimensional convolution and is insufficient in real-time performance.

Three-dimensional target detection methods based on monocular vision in indoor environments usually require depth estimation, and their detection accuracy is often lower than binocular/depth vision detection algorithms. However, the advantage of the monocular vision algorithm is that the sensor cost is low and the application range is wide. The completion of three-dimensional target detection through a single sensor makes the system more stable.

### Outdoor 3D target detection algorithm

An important application field of 3D target detection in outdoor scenes is automatic driving, which mainly targets the regression problem of 3D bounding boxes for multiple targets such as vehicles, pedestrians, and signs in road scenes. Compared with indoor scenes, 3D target detection in outdoor scenes is more challenging, mainly in the following two aspects:

① Outdoor scenes are affected by weather and lighting, and the background environment is more complex than indoors.

② The field of view in outdoor scenes changes even more, the target distance is relatively far, and there are cases of occlusion and truncation.

In autonomous driving, the accuracy of target three-dimensional spatial positioning and size estimation is very important for safety, which puts forward higher requirements for three-dimensional target detection. Table 2.2 (quoted from literature [13]) analyzes the mechanisms, advantages and disadvantages of different three-dimensional target detection algorithms for monocular vision and binocular/depth vision in outdoor environments.

### Three-dimensional target detection based on monocular vision

The two-dimensional target detection algorithm on monocular images is relatively mature and can quickly classify and locate targets on the image plane. However, there are still limitations in three-dimensional target detection, especially for outdoor scenes, where the depth estimation of the target is more difficult. It is difficult to determine the pose information of the target object in the three-dimensional space through a single texture feature. Therefore, some 3D detection algorithms need to combine geometric features, 3D model matching, prior information fusion, depth estimation network and other methods to return the 3D geometric features of the target. Due to the development of technologies such as autonomous driving and augmented reality, three-dimensional target detection algorithms based on monocular vision have increasingly become the focus of research. Their research methods can be roughly divided into two categories: one is to generate a target based on two-dimensional image area extraction A series of 3D target candidate frames are combined with defined prior information to return an accurate 3D detection frame; the other type is to directly extract the 3D features of the target through a neural network, and then combine it with 3D template matching, reprojection error optimization, depth estimation and other methods to calculate The three-dimensional pose of the target is obtained to obtain an accurate three-dimensional detection frame. Methods based on monocular vision lack the depth information of the target, and are always insufficient in the accuracy of three-dimensional target detection, and there are still many difficulties in dealing with occlusion, truncation and long-distance targets. However, monocular vision has great advantages in data processing and sensor cost, and it is still the mainstream choice in many fields.

### Three-dimensional target detection based on binocular/depth vision

The main reason for the low detection accuracy of the monocular vision three-dimensional target detection algorithm in outdoor scenes is that the depth estimation of the target has a large deviation, especially for targets that are occluded, truncated, and far away. One solution is to use binocular/depth vision cameras. With accurate depth information, there is a significant improvement in accuracy in target detection and positioning. Binocular images can use the matching relationship between the left and right images to establish image depth and obtain a depth map, so the processing methods of binocular vision and depth vision are the same. Three-dimensional target detection algorithms based on binocular/depth vision can be divided into two categories: one is a method based on dual-channel convolutional neural network fusion of monocular images and depth maps; the other is a method based on three-dimensional space convolution. The three-dimensional target detection algorithm in outdoor scenes is similar to the solution in indoor scenes. However, since the outdoor scene range is wider and the scale of the target changes in a larger range, directly using three-dimensional space convolution will greatly increase the amount of calculation, so the first method is used. Class detection methods are better.

## Bird’s Eye View (BEV) target detection method based on voxel

Voxel Net

SECOND

PointPillars

Patch Refinement

PIXOR

HDNET

Voxel-FPN

MV3D

JPGVA

Frustum ConvNet

Object as Hotspots

PV-RCNN

## Target detection method based on point-wise features

Point RCNN

STD

3DSSD

CLOCs

## Target detection method based on camera view

SqueezeSeg

PointSeg

LaserNet

RangeRCNN

# Evaluation criteria (evaluation indicators)

In the field of target detection, there are some common evaluation indicators used to analyze the advantages and disadvantages of target detection algorithms. For example, in traditional two-dimensional target detection tasks, precision, recall, and average precision ( Average Precision (AP) to quantitatively analyze the effectiveness of the target detection algorithm and the accuracy of the detection results.

# 3D detection algorithm theory learning

## Some concepts and definitions

### Target positioning Target regression Border regression Target box regression

What is regression?

Find a y=a0x1+a1x1+…+anxn+b

Come to your relationship with the real function: y=f(x)

The same goes for border regression in target detection:

Find (train) a bounding box

Fit the real border of the target. This so-called real border is the border we calibrate manually.

references:https://www.cnblogs.com/boligongzhu/p/15066380.html

## Types of 3D data

- Point cloud (based on data obtained from lidar)
- Depth image
- 3D mesh
- voxel voxel

## 3D data representation

Unlike two-dimensional images that use a unified representation of pixel matrices, three-dimensional data exists in a variety of different data representation forms, including:

(1) Polygon mesh representation

(2) Point cloud representation

(3) Depth image representation

(4) Voxel representation

(5) Multi-view representation

（6）and so on …

Various data representations can be converted into each other. Figure 1.4 shows examples of different three-dimensional data representations.

### polygon mesh representation

Polygonal mesh represents a three-dimensional surface using a series of vertices, edges, and polygonal surface elements. The surface element forms include triangular surface elements, quadrilateral surface elements and other simple convex polygonal surface elements. Among them, the mesh representation using triangular surface elements is the most widely used [23]. The triangular mesh representation can be expressed mathematically as M =< V, F >, where V and F are the vertex set and the face element set respectively. The vertex set can be represented by a two-dimensional matrix VNv∗3, and each row stores the three-dimensional value of the vertex. Coordinates, Nv is the number of vertices, triangular surface element information can be represented by a two-dimensional matrix FNf*3, each row stores the index of the three vertices of the triangular surface element in the vertex set V, Nf is the number of surface elements. Polygonal mesh representation is well suited for applications such as three-dimensional modeling and rendering, and is widely used in the field of computer graphics.

### point cloud representation

Point cloud is represented as a collection of three-dimensional space points sampled from the surface of an object. Each point records the three-dimensional coordinates (x, y, z) information of the point in the three-dimensional space and other attributes (such as reflection information). ). Point clouds can be mathematically represented by a two-dimensional matrix PN*M, where N is the number of points in the point cloud, and M corresponds to the three-dimensional coordinates, reflectivity and other attribute information of each point (M ≥ 3). Usually LiDAR,

Sensors such as depth cameras acquire point clouds of objects. Point cloud representation retains the original sensor data to the greatest extent, without quantization loss or projection loss. For point clouds represented in matrix form PN*M, changing the arrangement order (row order) of points in PN*M does not change the point cloud properties.

### Depth image representation

The depth image (Depth image) represents a single-channel image, and each pixel value is the depth/distance from the scene point corresponding to the pixel to the sensor imaging plane. Mathematically, the depth image can be represented by a two-dimensional matrix of one IN*M, where N and M are the total number of rows and total number of columns of the image respectively.

### voxel representation

Volumetric representation or Voxel grids are a generalization of the pixelated representation of two-dimensional images in three-dimensional space, and are a structured representation of three-dimensional data. Voxel representation divides the entire three-dimensional space into three-dimensional grids in three mutually perpendicular directions. Each grid is called a voxel, which is the smallest unit of voxel representation. The voxel representation can be mathematically represented by a three-dimensional matrix of size N ∗ M ∗ H, where N, M and H are the length, width and height of the three-dimensional “image” respectively. Voxel representation has regularity and uniformity, but it can be regarded as a quantitative processing of point clouds, so the detailed information of the object is inevitably lost. At the same time, there are a lot of redundant representations in voxel representation, and a large number of voxels are used to represent “empty” positions that are not on the surface of the object, which consumes a lot of storage. Therefore, the resolution of voxel representation is generally low.

### multi-view representation

Multi-view representation is a collection of a series of two-dimensional images obtained by imaging a three-dimensional target from different perspectives. The three-dimensional target can be reconstructed using these perspective images.

# 3D public dataset

## Indoor and outdoor data set

For two-dimensional target detection, commonly used data sets include ImageNet, PASCAL VOC, MS COCO, etc. Three-dimensional target detection also has corresponding data sets used to evaluate the performance of the algorithm. This chapter introduces commonly used data sets from both indoor and outdoor aspects, and introduces some evaluation criteria for three-dimensional target detection algorithms.

The SUN RGB-D dataset is proposed to achieve high-level scene understanding tasks and contains 47 different indoor scenes and 19 object categories.

The LineMod data set is an indoor industrial supplies data set. It contains 15 types of textureless objects and corresponding 3D models. There are 15 videos containing 18,273 images. Each image is marked with the true 6D posture of the object, and the camera is provided. internal parameters.

The NYU Depth data set is a data set released by New York University for indoor target detection tasks. This data set divides indoor scenes into various categories such as kitchen, living room, bathroom, bedroom, etc., and uses Kinect sensors to collect image data.

The KITTI data set is one of the most commonly used public data sets for research on autonomous driving environment perception algorithms. It can be used for research on computer vision algorithms such as 2D/3D target detection, visual ranging, and target tracking.

The NuScenes dataset is a large-scale autonomous driving dataset. Different from the KITTI data set, this data set not only contains image data and laser point cloud data, but also radar data.

The PASCAL 3D+ data set is a public data set for 3D target annotation based on PASCAL VOC 2012. It is mainly aimed at three-dimensional target detection and attitude estimation tasks.

## 3D target recognition data set

Three-dimensional data sets are the basis for the research and evaluation of three-dimensional vision algorithms. There are currently many public three-dimensional data sets [24]. Since this article focuses on the tasks of 3D target detection and 3D target recognition, this section only introduces commonly used data sets for these two tasks, as shown in Figure 1.5.

ModelNet dataset ModelNet dataset [15] is a large-scale 3D CAD model dataset published by Princeton University Vision & Robotics Labs in 2015, containing approximately 662 categories with a total of 120,000 3D CAD models. At the same time, ModelNet also contains two subsets with different numbers of categories, namely ModelNet10 and ModelNet40. ModelNet10 contains a total of 4,899 CAD models of 10 types of targets, and all models are manually aligned along the gravity axis. ModelNet40 contains a total of 12,311 CAD models of 40 types of targets, without pose alignment. Currently, ModelNet10 and ModelNet40 are the most commonly used benchmark data sets for 3D object recognition.

ShapeNet dataset ShapeNet dataset [25] is a large-scale three-dimensional shape dataset jointly released by Stanford University, Princeton University, and Toyota Technical Research Center in Chicago in 2015. It contains approximately 3,135 categories and a total of 220,000 artificially synthesized three-dimensional models. At the same time, ShapeNet also contains two subsets: ShapeNetCore and ShapeNetSem. ShapeNetCore contains approximately 51,300 3D models in 55 categories, and is mainly used for performance evaluation of 3D shape retrieval. ShapeNetSem contains approximately 270 categories with a total of 12,000 three-dimensional models, which in addition to category labels also includes physical size information of the target.

Sydney Urban Objects data set The Sydney Urban Objects data set [26] is a three-dimensional point cloud data set disclosed by the University of Sydney in 2013. The data is obtained by scanning the city streets of Sydney’s central business district with the lidar Velodyne HDL-64E. It contains various common urban street objects, such as vehicles, pedestrians, trees, etc. This data set contains point cloud data of a total of 631 targets in 14 categories. Since the target point cloud data is intercepted from the scene, the target point cloud contains background interference. In addition, since lidar only scans from one perspective, the shape of the target in the point cloud is incomplete. For recognition tasks, this dataset is more challenging than datasets such as ModelNet and hapeNet.

ScanNet dataset ScanNet dataset [[27] is a real indoor scene three-dimensional dataset jointly released by Stanford University, Princeton University, and Technical University of Munich in Germany in 2017. 707 different indoor scenes were recorded through an RGB-D depth camera, and finally 1,500 scans of up to 2.5 million views of data were obtained, and annotation information such as camera three-dimensional pose, surface reconstruction, and instance semantic segmentation were provided, which can be used for scenes Evaluation of various tasks such as classification, semantic segmentation, and target detection. In addition, ScanNet also provides a three-dimensional target recognition task data set, containing 49 categories, using voxel representation with a resolution of 32x32x32. Each sample is rotated 12 times. The final training set contains 111,660 samples and the test set contains 31,272 Samples and target data also contain background interference.

## 3D target detection data set

KITTI data set KITTI data set [16] is a large data set jointly released by Karlsruhe Institute of Technology in Germany and Toyota Institute of Technology in 2012. It is currently the world’s largest computer vision algorithm evaluation data set for autonomous driving applications. . All data is collected through LiDAR, high-resolution video cameras, GPS/IMU and other sensors mounted on the vehicle, including urban, rural and highway scenes, and can be used for stereo vision (stereo) and optical flow estimation (flow). , algorithm evaluation for various tasks such as target detection, target tracking, and semantic segmentation. Among them, the 3D target detection data set also provides synchronized RGB images, point cloud data, camera correction parameters, and 2D and 3D detection frame annotation information of the target, including 7481 frames of training data and 7518 frames of test data, mainly focusing on cars. ), pedestrians (Pedestrians) and cyclists (Cyclists). The targets in the data set have a certain degree of occlusion and truncation. Based on the degree of occlusion and target scale, three evaluation benchmarks are established: simple, medium and difficult. In addition, the 3D target detection data set also provides detection performance evaluation under Bird’s Eye View.

SUN RGB-D Dataset The SUN RGB-D data set [28] is an RGB-D image data set released by the Vision and Robotics Laboratory of Princeton University in 2015. By using multiple depth sensors such as RealSense, Xtion, and Kinect to image multiple indoor scenes, a total of 10,355 RGB-D images were acquired. The data set contains 47 scene categories and about 800 categories of targets, and can be used for multiple evaluation tasks such as scene classification, semantic segmentation, and three-dimensional detection.

# Introduction to 3D target recognition technology (current situation)

Considering that 3D target detection technology is based on 3D target recognition technology to a certain extent, this section will first introduce the current research status of 3D target recognition technology.

The essence of three-dimensional target recognition is to determine the category information of the target through the analysis of three-dimensional target data, which can be summarized as**two stages**,include:

(1) Feature extraction.

(2) Classification and identification.

The core is to obtain high-resolution feature representation of the target shape. According to the different representation forms of three-dimensional data,**Three-dimensional target recognition method based on deep learning**mainly divided:

(1) Method based on voxel convolutional neural network.

(2) Method based on projection graph convolutional neural network.

(3) Methods based on point cloud neural networks, etc.

(4) Learning method based on mesh representation.

## Learning method based on voxel convolutional neural network

The method based on three-dimensional voxel convolutional neural network takes the voxel grid representation of three-dimensional data as input, and uses the voxel convolutional neural network for feature learning and classification recognition. This method can be seen as the generalization of the target recognition network in the two-dimensional image field to the three-dimensional field. The input data is extended from pixels to voxels, and the convolution operation of the convolution network is extended from two-dimensional convolution to three-dimensional convolution.

The pioneering work of this type of method is 3DShapeNets proposed by Wu et al. [15] in 2015. This method regards the three-dimensional shape as a binary probability distribution on a three-dimensional voxel grid, and the value of each voxel is It is 0 or 1, 1 means that the voxel contains target data, and 0 means that it does not contain target data, thus obtaining a binary voxel grid representation of the three-dimensional data. This voxel representation is used as input through a three-dimensional convolutional deep belief network (CDBN) for target recognition and best view prediction (as shown in Figure 1.6). The entire network adopts a two-stage training method of pre-training and fine-tuning similar to the deep belief network. At the same time, the author also disclosed the currently widely used three-dimensional data set ModelNet. Maturana and Scherer[36] further studied three different voxel grid representation methods, and proposed a relatively simple and easier to train voxel convolutional neural network VoxNet (Figure 1.7), which can directly use the standard BP algorithm. After training, its performance on the ModelNet dataset greatly exceeds 3DShapeNets. Brock et al. [37] introduced Inception, residual network connection and other technologies in the field of image recognition, constructed a deep voxel convolutional residual network VRN (Voxception-ResNet), and combined multiple VRNs to obtain a 45 The layer recognition network has achieved the current best performance on the ModelNet data set.

Some research works utilize multi-task learning to improve the recognition performance of voxel-wise convolutional neural networks. Sedaghat et al. [38] added a target direction prediction network based on the target classification network, and simultaneously learned the main task of target classification and the auxiliary task of target direction prediction. The proposed network structure ORION has achieved good results on ModelNet10, Sydney UrbanObjects and other data sets. performance. Qi et al. [39] proposed an improved three-dimensional voxel convolutional neural network based on the analysis of the shortcomings of existing voxel convolutional neural networks. They improved the performance of the network by using auxiliary training, NIN (Network In Network) and other means. Generalization. Zhi et al. [40, 41] combined the advantages of the above two methods and designed a relatively shallow network LightNet by combining NIN and multi-task learning mechanisms, which improved the computational efficiency of the network while maintaining recognition performance.

In recent years, with the development of unsupervised learning represented by generative models, some work has used unsupervised learning to solve the problem of three-dimensional feature learning without labels. Wu et al. [42] introduced the generative adversarial mechanism into the voxel convolutional neural network for the first time and proposed an unsupervised three-dimensional generative adversarial network 3D-GAN, which demonstrated the feasibility of generative adversarial learning in three-dimensional target recognition tasks. Sharma et al. [43] designed a voxel fully convolutional autoencoder network VConv-DAE inspired by the denoising auto-encoder (DAE) to learn from noisy data by estimating the voxel grid state. 3D shape feature representation of objects.

In view of the sparseness and redundancy of three-dimensional voxel representation, some work uses the voxel sparse representation of three-dimensional data to improve the efficiency of voxel convolutional networks. Riegler et al. [44] proposed using an octree to divide the three-dimensional space step by step, using voxel representations of different resolutions for local point clouds of different densities, and finally obtaining a mixed-resolution three-dimensional voxel representation. At the same time, the team designed adaptable network operations such as convolution and pooling, and built an octree neural network structure (Octree Network, OctNet), which significantly improved the efficiency of the network. In addition, the network structure enables inputs with up to 2563 voxel resolution. However, the network efficiency of this method will be severely reduced when the voxel resolution is reduced. Graham et al. [45] proposed a three-dimensional sparse convolutional network that can perform convolution operations only on activated voxels and their adjacent voxels. Inactive voxels (generally “empty” voxels) are not Perform calculations, thereby greatly improving the computational efficiency of three-dimensional convolution operations. However, as the convolutional layer goes deeper, the activated voxels will continue to spread, the sparsity of the voxel representation will continue to decrease, and the efficiency of this method will seriously decrease. Wang et al. [46] fully drew on the advantages of the first two works and proposed an octree-based convolutional neural network O-CNN. By representing the three-dimensional shape into an octree form, only the surface of the three-dimensional shape is located. Sparse leaf nodes perform three-dimensional convolution operations, which greatly improves the efficiency of the network, but this method lacks certain flexibility in network design.

## Learning method based on projection graph convolutional neural network

The learning method based on projection graph convolutional neural network converts three-dimensional data into one or more two-dimensional images through projection, thereby converting the three-dimensional vision task to two-dimensional images, and performs feature learning and fusion on the two-dimensional projection images. Obtain feature representation of three-dimensional objects. Projection methods include multi-view projection, panoramic projection, slice projection and other methods.

The multi-view projection method is used to project three-dimensional shapes from different viewing angles around the object to obtain a series of two-dimensional projection images. Research work mainly focuses on how to more effectively fuse feature information between multiple views. The classic masterpiece is the Multi-view Convolutional Neural Network (MVCNN) proposed by Su et al. [47] in 2015, as shown in Figure 1.8. This method obtains multiple two-dimensional projection images from different viewing angles by placing multiple virtual cameras around the three-dimensional model at different viewing angles, then uses a convolutional network with shared network parameters to extract projection image features, and finally uses a view pooling layer (view- Pooling) fuses feature maps from multiple perspectives to obtain a three-dimensional shape feature representation. The projected image features extracted using the convolutional network VGGNet pre-trained in ImageNet [14] have stronger feature expression capabilities. MVCNN’s 3D object recognition and retrieval performance on the ModelNet40 dataset far exceeds that of 3DShapeNets [15].

Qi et al. [39] used multi-resolution three-dimensional convolution kernels to obtain multi-scale information of three-dimensional data based on MVCNN, further improving the performance of the network on ModelNet40. Johns et al. [48] decomposed the multi-view image sequence into multiple groups of image pairs, used a convolutional neural network to classify each group of image pairs independently, and synthesized all classification results by learning the contribution weight of each group of image pairs. This method has higher accuracy and does not rely on the position of multiple views. Inspired by human visual cognition, Ma et al. [49] proposed the concept of view saliency to measure the contribution of perspective images, and based on this, they constructed a multi-view convolutional network structure VS-MVCNN based on perspective saliency. Bai et al. [50] designed a real-time three-dimensional shape based on a multi-view projection learning method.

The search engine GIFT has high three-dimensional shape retrieval efficiency. Cao et al. [51] proposed a spherical projection to project a three-dimensional shape onto multiple “longitude strips” and “latitude strips”. Wang et al. [52] proposed a cyclic clustering pooling operation, extending the one-level pooling of MVCNN to multi-level pooling, which increased the information of fused features, but its calculation amount increased significantly.

The panoramic projection method is used to obtain a single panoramic image by performing cylinder projection around a three-dimensional object. It only needs to process a single image and requires less calculation. However, this method relies heavily on the main axis of the cylinder projection. choose. Shi et al. [53] first projected the three-dimensional shape onto the outer surface of the cylinder of the object, and then expanded the side surface of the cylinder to obtain a single two-dimensional panoramic view. In order to eliminate the influence of object rotation, the author designed a row-wise max-pooling (RWMP) layer to obtain a rotation-invariant feature representation of the panoramic view image. However, the cylindrical projection of this method needs to specify the main axis, which is more sensitive to objects whose postures are not normalized. At the same time, its rotational deformation can only cope with target rotation around the main axis. Sfikas et al. [54] first used the SYMPAN algorithm [55] based on reflection symmetry to normalize the attitude of the three-dimensional model, and then projected the three-dimensional model in the spatial domain and direction domain respectively. This method has better rotation invariance than the method of Shi et al. [53]. Furthermore, the team obtained a more expressive network structure by integrating multiple such networks [56] to achieve better three-dimensional target recognition performance.

The sliced projection method is used to “slice” the three-dimensional shape in parallel along a certain direction to obtain multiple slice images. Although this type of method has high efficiency, the slice representation loses a lot of information and generally has low accuracy. Xu et al. [57] “sliced” the three-dimensional shape through 16 planes parallel to the coordinate system Two-dimensional convolutional neural network. Gomez-Donoso et al. [58] used three planes:

In addition, there are some methods [59, 60] that use geometric diagrams to represent three-dimensional shapes. They also train two-dimensional convolutional neural networks in the two-dimensional image domain and achieve good results in non-rigid three-dimensional target recognition.

## Learning method based on point cloud network

The learning method based on point cloud network directly takes the point cloud representation as input and learns the feature representation of the three-dimensional target through a deep network that can process the point cloud. This type of method does not require voxelization, multi-view and other transformations on the point cloud data, and better retains the original information of the three-dimensional data, and has become a hot research direction at present.

Qi et al. [61] first proposed a deep network PointNet in 2017 that can directly process point cloud data (Figure 1.9). It achieved good performance in both three-dimensional target recognition and point cloud semantic segmentation tasks, opening up the Research boom in this direction. PointNet organizes point clouds in the format of N ∗ 3 (N is the number of points). It first uses a multi-layer perceptron (MLP) with shared parameters to extract the features of each point, and then uses a maximum pooling layer to fuse all points. The characteristics of the points obtain the global characteristics of the entire point cloud. In order to deal with the disorder problem of scattered point clouds, the author introduced a symmetric function that is insensitive to sorting into the network, and designed a small network T-Net that can rotate point clouds. Although the network uses MLP as the core component and has a simple structure, its design ideas provide a very valuable reference for designing point cloud networks. Yang et al. [62] built a deep autoencoder based on PointNet to achieve unsupervised learning of point cloud feature representation. The performance on the ModelNet data set exceeded the previous unsupervised network 3D-GAN [42] and VConv-DAE. [43]. Achlioptas et al. [63] introduced generative adversarial learning in PointNet and designed a deep generation network for point clouds. The performance on ModelNet also exceeded 3D-GAN [42] and VConv-DAE [43].

In response to the problem of PoinNet’s lack of local structure information, Qi et al. soon launched an upgraded version of PointNet, PointNet++ [64]. PointNet++ defines a spherical neighborhood of points in Euclidean space, and designs operations such as point cloud sampling and grouping. It uses PointNet to extract point cloud features in each spherical neighborhood to obtain local neighborhood information of the sampling points. In addition, the author also designed two point set extraction grouping strategies, multi-scale grouping and multi-resolution grouping, which can improve the network’s generalization problem for point cloud non-uniformity. However, PointNet++ has a complex network structure and serious memory consumption, which limits its use in large-scale point cloud scenarios. Wang et al. [65] designed a network module EdgeConv to construct a local neighborhood of points through KNN, and perform an operation similar to a convolution operation between points to obtain the local geometric information of the points. Li et al. [66] proposed a network called SO-Net based on PointNet. They modeled the spatial distribution of point clouds by establishing a self-organizing map [67] and adjusted the network through k-nearest neighbor search from points to SOM nodes. Perceptual domain, improve the local perception ability of the network. Aiming at the problem that PointNet’s T-Net module cannot cope with large rotations, Jiang et al. [68] were inspired by the SIFT algorithm [69] and proposed a point cloud network framework based on the PointSIFT operator. By describing eight key directions Encode the information in each direction of the point cloud, while obtaining multi-scale representation by stacking multiple direction encoding units. pointSIFT has stronger feature expression capabilities and is more robust to factors such as rotation and scale.

There are also some works using different ideas to design neural networks that directly process point clouds. Klokov et al. [70] proposed a point cloud network KD-network based on the KD tree structure, using the KD tree structure to organize the point cloud, and simulating the convolutional neural network operation on the KD tree to construct the calculation graph. Due to the limitation of KD tree scale, KD-network lacks certain flexibility and is currently only applied to small-scale point clouds. Savchenkov [71] defined the neighborhood of the points to be processed and established the relationship between points in the neighborhood, and obtained the feature representation of the point cloud by iteratively merging features through multiple random point selections. The number of network model parameters of this method is extremely small, but there are deficiencies in the design of candidate point selection, spatial feature extraction, etc., and its performance is average. Hua et al. [72] proposed a convolution operation that can be directly applied to point clouds, namely point-wise convolution. Based on this convolution operation, a three-dimensional semantic segmentation and target recognition task was constructed. point cloud network. Li et al. [73] proposed a transformation (X-transformation) that can learn from the input point cloud to convert the order of points into a universal order, and then use typical convolution operations to process the unified order point cloud. The target recognition performance of the network PointCNN built based on this method exceeds PointNet on data sets such as ModelNet [15] and ScanNet [27].

## Learning method based on mesh representation

The learning method based on mesh representation takes the polygon mesh representation as input and uses graph method or spectral analysis to learn the geometric information of the surface of the three-dimensional object. Currently, this type of method is mainly applied to non-rigid objects (such as the human body). Masci et al. [74] established a geodesic polar coordinate system on a non-rigid surface and constructed a geodesic convolutional neural network (Geodesic CNN). However, this method relies on the performance of fast matching algorithms and does not have an effective pooling layer. Bruna et al. [75] extended the convolutional neural network to a more general spectral domain and proposed a spectral convolutional neural network (spectral CNN) that can be directly applied to the manifold surface represented by the graph. Boscaini et al. [76] extended the convolutional neural network to non-Euclidean space and proposed an anisotropic convolutional neural network that can effectively learn dense correspondence between non-rigid shapes. In response to the problem that spectralCNN does not have effective pooling processing, Yi et al. [77] proposed the network SyncSpecCNN, which introduced the spectral parameters of the dilated convolution kernel and a spectral transformation network.

In addition to the above methods, there are also some works that combine multiple methods to achieve higher target recognition performance. Hegde and Zadeh [78] combined a multi-view neural network and a voxel convolutional neural network and achieved better performance than two independent networks. Bu et al. [79] used convolutional neural networks and convolutional deep belief networks to extract features from multi-view representation and voxel representation respectively, and simultaneously obtain geometric information and visual information of three-dimensional shapes. Although these methods have higher recognition accuracy, the complexity of the network greatly increases.

# Introduction to 3D target detection technology (current situation)

The task of 3D target detection is to find as many objects of interest as possible from the 3D scene and determine their 3D position, size and posture in the scene. Traditional methods are mainly based on manually designed point cloud features, and use point cloud and model registration, sliding window and classifier combination to achieve target detection [80]. In recent years, methods based on deep learning have developed rapidly and gradually become the mainstream of research. Since the detection method of RGB/RGB-D images is more inclined to the processing method in the field of two-dimensional images, this section will focus on the research work on point cloud data (and fusion with RGB images). The main methods can be divided into learning methods based on voxel convolutional networks, learning methods based on forward/bird’s-eye view, learning methods based on point cloud networks, and learning methods based on multi-sensor data [21].

## Learning method based on voxel convolutional network

The method based on voxel convolutional neural network uses a three-dimensional grid to represent the three-dimensional scene, and then uses a three-dimensional convolutional neural network for three-dimensional target detection. This type of method can draw lessons from the design ideas of target detection networks in the two-dimensional image field, but its calculation amount and memory consumption are huge, and there is a lot of redundant calculations.

Song [81] et al. converted RGB-D images into three-dimensional voxel representation, constructed a three-dimensional candidate area network and a target recognition network to perform target similarity detection and target recognition respectively, and achieved good results in three-dimensional target detection in indoor scenes. Li et al. [82] used binary voxel representation for laser point cloud scenes, and used a single-stage target detection method based on a three-dimensional voxel fully convolutional neural network to estimate the target position and attitude. However, the efficiency of this method is very low, with a maximum processing speed of 1fps. In order to improve processing efficiency, Engelcke et al. [83] used prior information to fix the size of the detection frame for each type of target, which can simplify the structure of the network, and also used a sparse convolution algorithm to reduce the complexity of the model.

## Learning method based on forward/bird’s-eye view

The method based on the forward/bird-view view projects the point cloud to the forward or bird-view view to obtain a two-dimensional projection image, and then uses a method similar to the two-dimensional target detection network for target detection, and estimates the target through parameter regression 3D bounding box. This type of method has high computational efficiency and less occlusion in the bird’s-eye view. However, the point cloud projection loses a large amount of geometric information of the target, especially the information loss of small targets such as pedestrians and bicycles in the bird’s-eye view.

Li et al. [84] used cylindrical projection to obtain a two-dimensional projection map. The two channels of the projection map encoded the height and distance information of the point respectively, and then used a fully convolutional neural network to estimate the three-dimensional bounding box of the vehicle. On this basis, Minemura et al. [85] proposed a fully convolutional neural network structure with dilated convolution [86], which reduced the calculation time by about 30%, but sacrificed a certain detection accuracy. . Beltrán et al. [87] projected point clouds into a bird’s-eye view and used density, height, intensity and other information to encode the three-dimensional projection image.

channels, then use the classic Faster R-CNN [6] network structure to detect the target, and finally estimate the three-dimensional bounding box of the target in the post-processing stage. In order to achieve real-time processing, Simon et al. [88] used the high-efficiency YOLO network [5] as the basic detection network for bird’s-eye view, and also proposed an angle estimation method for three-dimensional bounding boxes. The final network Complex-YOLO running speed reached 50fps. Zeng et al. [89] proposed a pre-RoI pooling convolution technology, which can move most of the repeated convolution operations in the bird’s-eye view object detection network to

Before the RoI pooling operation, the operating efficiency of the network is improved. Feng et al. [90] used information such as cognitive uncertainty in the bird’s-eye view detection network and accidental uncertainty in observation noise to improve detection accuracy. Yang et al. [91] divided the point cloud in three mutually perpendicular directions to obtain a 36-channel bird’s-eye view, and used the single-stage two-dimensional target detection network RetinaNet [92] to improve the detection performance of small targets. On this basis, the team introduced high-definition map information, increased the network’s learning of scene semantic information, and further improved the performance of the algorithm [93].

## Learning method based on point cloud network

Methods based on point cloud networks directly use point clouds as input and use point cloud networks to extract features of point cloud data. This type of method does not lose the geometric information of the input data, but large-scale point clouds lead to high network calculations, large memory consumption, and low method efficiency. The VoxelNet network proposed by Zhou et al. [94] (as shown in Figure 1.10) first spatially divides the point cloud according to a three-dimensional regular grid to obtain D × H × W individual elements (voxels), and uses a point cloud network based on PointNet to classify each Features are extracted from point clouds in non-empty voxels. All voxels together form a three-dimensional voxel representation of multi-feature channels, and then a voxel convolution network is used to perform vertical convolution to obtain a multi-channel two-dimensional bird’s-eye view feature map. , and finally a two-dimensional target detection network is used to detect the target. The voxel convolutional network used in the VoxelNet network increases the computational complexity and memory consumption of the network. Yan et al. [95] used a sparse convolutional neural network [45] to replace the voxel convolutional neural network in the VoxelNet network, which improved the operating efficiency of the algorithm and reduced memory consumption.

## Methods based on multi-sensor data

Learning methods based on multi-sensor data comprehensively utilize the color information of RGB images and geometric information of point clouds for three-dimensional target detection, and can be divided into two types: multi-view fusion and multi-modal network methods.

### Multi-view fusion method

This type of method projects the point cloud to the forward/bird’s-eye view to obtain a projected image, and then fuses the projected image with the RGB image for target detection. Chen et al. [96] first estimated the three-dimensional candidate target area on the bird’s-eye view, then fused the features of the candidate area corresponding to the forward image, bird’s-eye view and RGB image, and finally used the fused features to estimate the target’s position, size and attitude. and other information for refined estimation. Ku et al. [97] proposed the AVOD network based on MV3D. They first fuse the features corresponding to each preset three-dimensional anchor (3D anchor) on the RGB image and the bird’s-eye view, and then use the fused features to obtain the three-dimensional candidate target area. The detection performance is further improved. In addition, this method also applies a pyramid structure in the image feature extraction network, which facilitates the detection of small targets. Liang et al. [98] used ResNet [4] to extract the features of the RGB image and the bird’s-eye view respectively, and then projected the feature map of the RGB image onto the feature map of the bird’s-eye view through a PCCN (Parametric ContinuousConvolutional Network), and finally used the fused Bird’s-eye view feature map for object detection.

### Multimodal network approach

This type of method does not perform point cloud projection, but uses point cloud networks and two-dimensional convolutional neural networks to perform feature learning on point cloud data and RGB images respectively. The three-dimensional target detection network F-PointNet proposed by Qi et al. [99] first uses a two-dimensional target detection network to detect targets in RGB images, and obtains the corresponding two-dimensional target frame in three-dimensional space based on the geometric relationship between the RGB image and the point cloud. The point cloud within the cone (frustum) and its interior is then used to perform 3D instance segmentation and 3D bounding box estimation on the point cloud within the cone using a point cloud network based on PointNet++ [64]. This method pre-detects RGB images and reduces the processing scale and difficulty of subsequent point cloud networks. However, the serial network structure causes this method to rely heavily on the detection results of the two-dimensional detection network and is easily affected by factors such as lighting and occlusion. . Xu et al. [100] combined the characteristics of F-PointNet [99] and AVOD [97] and first fused the global features of the point cloud, the global features of the RGB image and the point cloud features in the proposed PointFusion network, and then Then the point cloud network is used to estimate the three-dimensional bounding box for the fused point cloud features. Du et al. [101] proposed a universal vehicle 3D detection process. They first perform 2D target detection on the RGB image and obtain the point cloud corresponding to the 2D bounding box. Then, they use the RANSAC algorithm [102] and the vehicle 3D CAD model to obtain the 3D candidate frame, and finally use a two-dimensional convolutional network to refine the parameters. However, this method relies heavily on the vehicle model, is sensitive to noise, and template matching is time-consuming.

## Advantages and Disadvantages

Currently, various three-dimensional target detection algorithms have their own advantages and disadvantages:

• Methods based on forward/bird’s-eye view and multi-view fusion convert the problem into a two-dimensional image domain for processing, which has certain efficiency advantages, but there is inevitable information loss in the projection process. Design a better bird’s-eye view representation method or even use learning Methods to obtain a bird’s-eye view will be a future research direction.

• The learning method based on point cloud network does not have input data information loss and has great development potential in pure point cloud target detection. However, the application of this method in large-scale scenarios is limited. Pre-detection is used to reduce point cloud network processing. Scale will be a future research direction. In the second chapter of this paper, some research work is carried out to address the efficiency issues of this type of method.

• The multimodal network method utilizes more types of information through multi-source data and has great development potential in terms of performance. However, the current algorithm has problems such as simple feature fusion method and large calculation amount. A better feature fusion method will It is an important direction of research. In Chapter 3, this paper carries out some research work on the feature fusion problem of multi-modal networks.

# Three-dimensional target detection algorithm based on point cloud data

## Problem description and analysis

The task of 3D target detection is to find as many targets of interest as possible from the 3D scene and determine the target’s position, size and posture in the scene, as shown in Figure 2.1.

### Input data

The input data is a point cloud, which consists of a series of unordered three-dimensional space points. Each point records the three-dimensional coordinates (x, y, z) information of the point in the three-dimensional space. It is usually represented by a two-dimensional matrix IN*3. , where N is the number of points in the point cloud, and each row of the matrix IN∗3 corresponds to the three-dimensional space coordinates (x, y, z) of a point. It should be noted that the point cloud is scattered and disordered, and changing the order of points in the point cloud (ie, changing the row order of matrix I) does not change the original point cloud.

### Output results

The output of three-dimensional target detection is a three-dimensional bounding box (3D bounding box) with category and posture of one or more targets. For the three-dimensional bounding box, there can be many forms, such as directly using the coordinates of its eight vertices to represent it. Since the three-dimensional bounding box is usually a standard cube, the currently commonly used representation method is through the center position of the three-dimensional bounding box cx, cy, cz, the three-dimensional dimensions h, w, l of the three-dimensional bounding box and the rotation around the three-dimensional space coordinate axis. Represented by angles θ, ϕ, ψ. Therefore, the task of three-dimensional target detection is to classify the target K (K is the number of types of targets to be detected) and predict the three-dimensional bounding box parameters (cx, cy, cz, h, w, l, θ, ϕ, ψ) estimate. Generally speaking, point cloud acquisition equipment such as lidar will be installed parallel to the horizontal plane. Therefore, for simplicity, only the rotation angle θ about the vertical axis is usually considered.

### Evaluation criteria

The evaluation of target detection algorithm performance usually uses Mean Average Precision (mAP) as the evaluation index, that is, the average AP value of each category of targets to be detected.

Among them, APi is the average precision (AP) of the i-th category target, corresponding to the area under the “precision-recall curve” (PR curve)

For a prediction result, it is necessary to determine the accuracy of target classification and target location information. The accuracy of target location information is usually measured by the overlap IoU (Intersection over Union) [6, 106]. IoU describes the degree of overlap between the predicted bounding box (box) and the real bounding box (box), that is

Among them, Area of Overlap is the intersection value between the predicted target bounding box (box) and the real target bounding box (box), and Area of Union is the intersection value between the predicted target bounding box (box) and the real target bounding box (box). ). For two-dimensional image target detection, the intersection value and union value in the formula correspond to the area of the intersection and union respectively; for three-dimensional target detection, the intersection value and union value in the formula correspond to the volume of the intersection and union respectively. . Figure 2.2 is a schematic diagram of the calculation of IoU. The intersection in the two-dimensional bounding box IoU is a rectangle, while the intersection of the three-dimensional bounding box IoU is a polyhedron due to the three-dimensional attitude angle. The larger the IoU value, the higher the overlap between the predicted bounding box (box) and the real bounding box (box), and the more accurate the predicted position information is. When evaluating target detection algorithms, an IoU threshold is usually preset. Only detection results that are greater than this threshold and have the correct prediction category are a correct positive detection (True Positive).

# PointNet network

## Founder’s Documents:

Qi C R, Su H, Mo K, et al. PointNet: Deep learning on point sets for 3D classi-

ﬁcation and segmentation [C]. In 2017 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR). jul 2017: 4.

## some concepts

Multilayer Perceptron MPL:

Candidate area:

Candidate 3D detection box:

Voxelization: divides point cloud data into voxel three-dimensional grids with spatial dependencies. By voxelizing three-dimensional space, spatial dependencies are introduced into otherwise unordered point clouds.

Anchor, prior box, default box:

# PointNet++ Network

## Founder’s Documents:

[64] Qi C R, Yi L, Su H, et al. PointNet++: Deep Hierarchical Feature Learning onPoint Sets in a Metric Space [C]. In Advances in Neural Information Processing Systems.

2017: 5105–5114.

### Algorithm process

(1) The algorithm first uses the voxelization method to divide the point cloud space into equal scales.

(2) Then extract the point cloud features within the column from the sampled and divided point cloud space, and stack them to generate a bird’s-eye view (BEV).

(3) Finally, a multi-scale deep convolutional neural network is used to extract scene target features from the bird’s-eye view to achieve target classification and regression.

### convolutional neural network

Purpose: feature extraction

Composition: A convolutional neural network usually consists of a convolution layer, an activation layer and a pooling layer. It inputs RGB image data and outputs a specific feature space of the image.

The role of each layer:

（1）**convolution**Operation is the most basic operation in convolutional neural network, mainly including convolution kernel size, step size, filling

(padding) and other parameters, by inputting the spatial coordinates (x, y) of the image, use the convolution kernel to calculate the corresponding area

Numerical sum.

（2）**activation layer**The main function is to introduce nonlinear activation. The current mainstream activation functions are: ReLU, Sigmoid, tanh, etc.

(3) Downsampling is**Pooling layer**The main function is to reduce the feature space (resolution) of the feature map. Too many image details are not conducive to the extraction of high-level features. Currently, the main pooling operations are maximum pooling and average pooling. Pooling operations can reduce the number of parameters and feature map resolution.

## Introduction to several 3D networks

### PointNet

Scope of application:

PointNet is the first deep learning network that directly inputs 3D point cloud output segmentation results. Many subsequent studies have used PointNet as the baseline (BaseLine) for algorithms in the field of 3D target detection. The algorithm uses N×3 point clouds as input, first aligns the point clouds in space through T-Net, and then maps them to a 64-dimensional space through a multi-layer perceptron (MLP), and Align again, and finally map the point cloud to a 1024-dimensional space. For each point, there will eventually be a 1024-dimensional vector representation, and such a vector representation is obviously redundant for a 3-dimensional point cloud, so the algorithm introduces the maximum pooling operation (Max-Pooling), in the 1024-dimensional Only the largest one is retained on the channel, and a 1×1024 vector is finally output, which is the global feature of N point clouds. For classification problems, the algorithm directly feeds the obtained global features into the multi-layer perceptron (MLP) to output class probabilities; for segmentation problems, since it needs to output point-by-point categories, it splices the global features into a 64-dimensional point cloud. For point-by-point features, the multi-layer perceptron is finally used to output the point-by-point classification probability. PointNet can extract a global feature from all point cloud data. Figure 2-3 is the PointNet structure diagram.

### SSD

Scope of application: It can be adapted to multi-scale target detection tasks, and is more in line with the characteristics of large scale changes in point cloud data.

## references

[1] Zhan Weiqin. Three-dimensional target detection based on deep learning [D]. Changzhou University, 2021.[2] Zhang Jie. Research on three-dimensional target detection methods in augmented reality [D]. Chongqing University, 2020.

[3] Ma Chao. Research on three-dimensional target detection and recognition technology based on deep neural network [D]. National University of Defense Technology, 2019.

[4] Yao Yue. Research and implementation of deep learning three-dimensional target detection method based on point cloud and image features [D]. Nanjing University of Science and Technology, 2020.