Flying Vehicle Perception under Complex Conditions: A Large-scale Open-source Suite and Benchmark Approach
Submitted to IEEE Transactions on Intelligent Transportation Systems
Xunkuai Zhou, Yijun Huang ,
Li Li,
Jie Chen
and Ben M. Chen
Paper
Abstract
Flying vehicle perception in complex scenes presents a significant challenge. Existing works primarily focus on flying vehicle perception using RGB imagery, limiting its effectiveness in real-life applications, particularly under challenging conditions like low-light and cluttered backgrounds. A promising approach involves the integration of RGB and thermal infrared (RGBT) images, providing complementary information and have shown potential in various computer vision tasks.
Progress in RGBT flying vehicle perception is impeded by the lack of a large-scale dataset and a comprehensive benchmark for evaluation. To address this research gap, we introduce an open-source benchmark suite called FT55k, which encompasses diverse scenarios and consists of over 55000 spatially aligned RGBT image pairs with meticulously annotated ground truth, enabling comprehensive evaluation and exploration of algorithmic robustness.
We propose a series of baseline approaches that can be deployed on devices with varying computing capabilities, providing a solid foundation for further research. Extensive experiments were conducted on the FT55k and seven other challenging public datasets, demonstrating our proposed approaches' superiority over state-of-the-art methods.
This work presents the first comprehensive benchmark for multiple types of flying vehicle perception methods across multiple scenarios. We also make corrections to the existing datasets and establish a new benchmark. Our method achieves the computational cost of only 0.49 BFLOPs. To the best of our knowledge, this is the first flying vehicle perception method with a computational cost below 0.5 BFLOPs. Moreover, our approach achieves an inference speed of 62.3 fps on edge computing devices, confirming their reliability and feasibility.
Our datasets and demos are publicly accessible at http://www.fvp-ftnet.com/project/ftnet.

Our methods provide a more accurate prediction of flying vehicles in various challenging situations, i.e., low lighting conditions (second column), tangled jungle (fourth column), and tiny vehicles (all columns). The red numbers in the figure represent confidence scores. Please zoom in for the best view.
The Dataset of
FT55k
FT55k encompasses diverse scenarios and consists of over 55k spatially aligned RGBT image pairs with meticulously annotated ground truth, enabling comprehensive evaluation and exploration of algorithmic robustness.

Det-Fly
Visualization
The figure below illustrates the qualitative visualization of actual aerial UAV perception, where each column represents the same scene, and each row represents the visualization of a specific method. In the third column, except for FTNet- h, the remaining methods fail to detect the target. While all methods successfully detect the UAVs under the other three situations, our approach exhibits higher confidence scores. Based on the analysis above on Det-Fly, Our method achieves superior accuracy and inference speed with minimal computational resource consumption, which is compet- itive for devices with weaker computational capabilities.

The Architecture of
FTNet-h
Our FTNet-h framework for accurate flying vehicle detection experiences five downsampling stages, followed by spatial attention to filter out redundant information. The SPP (Spatial Pyramid Pooling) module is employed to expand the receptive field. Additionally, PANet-like operations are applied to reduce the semantic gap between high-level and low-level semantic information. Multiple scale detection heads are utilized to generate the final output. The dx module is used during the downsampling process, where the d1, d2, d3 and d4 stages each employ one dx module. The d5 and C stages consist of a convolutional layer with two ResNet modules, while the r1 stage contains two ResNet modules. Here, ResNet×m in the dx module indicates the presence of 𝑚 ResNet modules. The d1, d2, d3, and d4 stages contain 2, 8, 8, and 8 ResNet modules, respectively.

Data
Distributions
(a) Target positional distribution on the three sub-dataset. Box positions are mostly concentrated in the central area of the picture (b) Statistical data of the target size in our dataset. The dark blue points correspond to the vehicle images whose width and height are less than 5% of the image size. The blue points represent the data that the sizes are less than 10% of the image size. The remaining red points are samples greater than 10%. Since the attitude of the camera may frequently change during the moving process, there are various height-width ratios of bounding boxes.

Qualitative
visualization
Our methods provide a more accurate prediction of flying vehicles in various challenging situations, i.e., low lighting conditions (second column), tangled jungle (fourth column), and tiny vehicles (all columns). The red numbers in the figure represent confidence scores. Please zoom in for the best view.

Paper
Citation
@article{,
title={Flying Vehicle Perception under Complex Conditions: A Large-scale Dataset and Benchmark Approach},
author={Xunkuai Zhou, Yijun Huang, Li Li, Jie Chen and Ben M. Chen},
journal={IEEE Transactions on Intelligent Transportation Systems},
year={2024},
publisher={IEEE}
}
Related
Projects
-
ADMNet: Anti-Drone Real-Time Detection and Monitoring
2023 IEEE/RSJ International Conference on Intelligent Robots and Systems , 2023 (IROS)
[IEEE]
