The field of computer vision has experienced substantial progress recently, owing largely to advances in deep learning, specifically convolutional neural nets (CNNs). Image classification, where a computer classifies or assigns labels to an image based on its content, can often see great results simply by leveraging pre-trained neural nets and fine-tuning the last few throughput layers.
Classifying and finding an unknown number of individual objects within an image, however, was considered an extremely difficult problem only a few years ago. This task, called object detection, is now feasible and has even been productized by companies like Google and IBM. But all of this progress wasn’t easy! Object detection presents many sizable challenges beyond what is required for image classification.
The overall architecture of the Convolutional Neural Network (CNN) source
The ultimate purpose of object detection is to locate important items, draw rectangular bounding boxes around them, and determine the class of each item discovered. Applications of object detection arise in many different fields including detecting pedestrians for self-driving cars, monitoring crops, and even real-time ball tracking for sports.
Dual priorities: object classification and localization
The first major complication of object detection is its added goal: not only do we want to classify image objects but also to determine the objects’ positions, generally referred to as the object localization task.
Speed for real-time detection
Object detection algorithms need to not only accurately classify and localize important objects, but they also need to be incredibly fast at prediction time to meet the real-time demands of video processing.
Multiple spatial scales and aspect ratios
For many applications of object detection, items of interest may appear in a wide range of sizes and aspect ratios. Practitioners leverage several techniques to ensure detection algorithms can capture objects at multiple scales and views.
Instead of selective search, Faster R-CNN’s updated region proposal network uses a small sliding window across the image’s convolutional feature map to generate candidate ROIs. Multiple ROIs may be predicted at each position and are described relative to reference anchor boxes.
Multiple feature maps
Single-shot detectors must place special emphasis on the issue of multiple scales because they detect objects with a single pass through the CNN framework. If objects are detected from the final CNN layers alone, only large items will be found as smaller items may lose too much signal during downsampling in the pooling layers.
Feature pyramid network
The feature pyramid network (FPN) takes the concept of multiple feature maps one step further. Images first pass through the typical CNN pathway, yielding semantically rich final layers. Then to regain better resolution, FPN creates a top-down pathway by upsampling this feature map. While the top-down pathway helps detect objects of varying sizes, spatial positions may be skewed. Lateral connections are added between the original feature maps and the corresponding reconstructed layers to improve object localization. FPN currently provides one of the leading ways to detect objects at multiple scales, and YOLO was augmented with this technique in its 3rd version.
Object detection is customarily considered to be much harder than image classification, particularly because of these five challenges: dual priorities, speed, multiple scales, limited data, and class imbalance. Researchers have dedicated much effort to overcoming these difficulties, yielding oftentimes amazing results; however, significant challenges persist.
For detailed information, click here!