MSRP-TODNet: a multi-scale reinforced region wise analyser for tiny object detection

Bikku, Thulasi; Sree, K. P. N. V. Satya; Thota, Srinivasarao; Kumar, Malligunta Kiran; Shanmugasundaram, P.

doi:10.1186/s13104-025-07263-7

Research Note
Open access
Published: 30 April 2025

MSRP-TODNet: a multi-scale reinforced region wise analyser for tiny object detection

Thulasi Bikku¹,
K. P. N. V. Satya Sree²,
Srinivasarao Thota³,
Malligunta Kiran Kumar⁴ &
…
P. Shanmugasundaram⁵

BMC Research Notes volume 18, Article number: 200 (2025) Cite this article

244 Accesses
Metrics details

Abstract

Objective

Detecting small, faraway objects in real-time surveillance is challenging due to limited pixel representation, affecting classifier performance. Deep Learning (DL) techniques generate feature maps to enhance detection, but conventional methods suffer from high computational costs. To address this, we propose Multi-Scale Region-wise Pixel Analysis with GAN for Tiny Object Detection (MSRP-TODNet). The model is trained and tested on VisDrone VID 2019 and MS-COCO datasets. First, images undergo two-fold pre-processing using Improved Wiener Filter (IWF) for artifact removal and Adjusted Contrast Enhancement Method (ACEM) for blurring correction. The Multi-Agent Reinforcement Learning (MARL) algorithm splits the pre-processed image into four regions, analyzing each pixel to generate feature maps. These are processed by the Enhanced Feature Pyramid Network (EFPN), which merges them into a single feature map. Finally, a Generative Adversarial Network (GAN) detects objects with bounding boxes.

Results

Experimental results on the DOTA dataset demonstrate that MSRP-TODNet outperforms existing state-of-the-art methods. Specifically, it achieves an mAP @0.5 of 84.2%, mAP @0.5:0.95 of 54.1%, and an F1-Score of 84.0%, surpassing improved TPH-YOLOv5, YOLOv7-Tiny, and DRDet by margins of 1.7%–6.1% in detection performance. These results demonstrate the framework’s effectiveness for accurate, real-time small object detection in UAV surveillance and aerial imagery.

Peer Review reports

Introduction

Object detection has advanced significantly, with applications in autonomous driving, real-world object recognition, and video surveillance. Deep Learning (DL) models, particularly Convolutional Neural Networks (CNNs), use regional proposals and masks for segmentation [1]. Faster R-CNN generates feature maps using a Deep Convolutional Neural Network (DCNN) backbone, followed by a detection-specific network [2]. Despite improvements in accuracy, these models struggle with detecting small objects due to pooling layers that cause pixel-level detail loss. Small object detection has gained research attention due to accuracy and reliability challenges [3]. Objects smaller than 32 × 32 pixels are difficult to detect, and limited data availability further hinders performance. Datasets like MS COCO, Pascal VOC, and VisDrone 2019 mainly contain larger objects, covering about 70% of images. Conventional DL models perform well on large objects but struggle with smaller ones due to high-resolution images from drones and surveillance cameras [4]. Feature Pyramid Network (FPN) enhances detection by merging multi-scale features, but its heuristic mapping limits precise small object recognition [5]. Small object detection remains challenging due to spatial detail loss from pooling layers in CNNs, limited representation of small targets in benchmark datasets, and the inability of traditional models like Faster R-CNN and YOLO to effectively detect objects smaller than 32 × 32 pixels in high-resolution surveillance or aerial imagery. This study addresses these issues by proposing MSRP-TODNet, a real-time framework designed to enhance multi-scale features, improve detection accuracy and recall for tiny and distant objects, and reduce false positives without increasing computational cost. Increasing feature resolution helps recover small object details. Some approaches use super-resolution techniques, but improving input image resolution increases computational costs [6, 7]. Generative Adversarial Networks (GANs) enhance small object feature resolution but may introduce artificial textures and artifacts, leading to false positives [8,9,10]. To address these issues, this paper presents a novel end-to-end framework combining two Artificial Intelligence (AI) methods for small object detection in Computer Vision (CV). The key contributions of this work are outlined below.

Related works

Several studies have explored small object detection across various domains. In [11], spatio-temporal information improves tracking and detection for video surveillance, but limited evaluation affects generalizability [12]. In [13], cascaded sparse queries enhance accuracy and efficiency in high-resolution small object detection for remote sensing and medical imaging, though data availability remains a challenge. In [14], YOLOv5 is optimized for infrared images, aiding security and search-and-rescue but lacks evaluation in visible spectrum scenarios. In [15], an object detection framework for autonomous driving improves safety but is not optimized for small object detection in other contexts. In [16], a model for aerial imagery enhances detection in agriculture and disaster response but struggles in other domains. In [17], a UAV-based small object detection model benefits surveillance but faces performance issues in complex environments. In [18], a scale-improvement pyramid network aids UAV image detection but struggles with computational constraints. In [19], a multi-scale feature fusion approach enhances defect detection and quality control but requires extensive fine-tuning. In [20], an anchor-free proposal generator simplifies detection but faces robustness challenges in cluttered environments. In [21], a Transformer-enhanced YOLOv5 improves drone-based detection but may suffer under real-time constraints. In [22], the EDN model employs extremely down sampled networks for fine-grained object detection but has high computational demands. In [23], a semantic segmentation model for remote sensing imagery uses foreground activation but may struggle with lighting and terrain variations.

This paper builds on these approaches by addressing small object detection challenges in UAV images, providing a scalable solution for aerial surveillance [25, 26].

Research methodology

In this section, the proposed research model named MSRP-TODNet explained in three distinct process such as (i) pre-processing, (ii) Region wise reinforced pixel analysis, and (iii) Feature maps processing, and object detection.

Two-fold pre-processing

Initially, we perform two-fold pre-processing for the acquired images such as noise removal, and contrast enhancement respectively.

Noise removal, artifact removal and blur correction

The Improved Wiener Filter (IWF) adaptively handles pixel noise and compression artifacts by dynamically estimating local variance, reducing visual distortion. Our approach for denoising involves the utilization of the Improved Wiener Filtering (IWF) method. The resulting filtered image can be customized based on the local variance within the image. The formulation IWF is

$$\text{IWF}\left(p\right)=\frac{\text{DfL}(p)}{\text{N}}+\text{w}*\left(p-\frac{\text{DfL}({p}^{2})}{\text{N}}\right)$$

(1)

From the above equation, $\text{DfL}(.)$ denotes the differencing filter of X and Y axis of the pixel matrix of the input $p$ image, ${p}^{2}$ denotes the input matrix square, and $\text{N}$ denotes the DfL filter one’s matrix, and the co-efficient of weight can be denoted as $w$ that can be formulated as,

$$w=\frac{\text{n}}{\text{n}+\updelta }$$

(2)

$$\text{n}=\frac{\text{DfL}({p}^{2})}{\text{N}}-\frac{\text{DfL}(p)}{\text{N}}*\frac{\text{DfL}(p)}{\text{N}}$$

(3)

From the above equation, the $\updelta$ is the hyper parameter which is customized to 0.05. At last, the smoothed and denoised image can be obtained as,

$$rf\left[p\right]=\text{IWF}\left(p, r,\updelta ,\text{N}\right),$$

(4)

where $rf\left[.\right]$ is the refined and noise free image, $N$ is the BF(.) size patch, $\updelta$ is the calibration parameter that is initialized to 0.01, and $r$ is the template radius of the given input image. Figure 1 represents the two-fold pre-processing.

Contrast enhancement

The ACEM is entrenched on parabolic function which are obtained from the given image input. The formulation of parabolic function is provided below,

$${\mathbb{Y}}=L{({\mathbb{X}}-U)}^{2}+M$$

(5)

From the above equation, the ${\mathbb{Y}}$ and ${\mathbb{X}}$ are the two coordinates of the parabolic function, $L$, $U$, and $M$ denotes the minimum (lower), maximum (upper), and mean values of the given image and the formulations are provided as follows,

$$L=\frac{{m}_{o}-{l}_{o}}{({u}_{i}-{l}_{i})({u}_{i}+{l}_{i}-2U)}$$

(6)

$$U=\frac{{{u}_{i}}^{2}(\Finv -{l}_{o})-\text{t}({m}_{o}-{l}_{o})+{{l}_{i}}^{2}({m}_{o}-\Finv )}{2[{u}_{i}\left({m}_{o}-{l}_{o}\right)-{m}_{i}\left({m}_{o}-{l}_{o}\right)+{l}_{i}({m}_{o}-\Finv )]}$$

(7)

$$M=l-L{({l}_{i}-U)}^{2}$$

(8)

From the above Eq. (6)–(8), ${m}_{o}$ and ${l}_{o}$ denotes the mean and minimum value of the output image, ${u}_{i}$, ${l}_{i}$, and ${m}_{i}$ denotes the maximum, minimum, and mean value of the input image respectively. The fig shows the histogram patterns of the contrast enhanced image by the ACEM.

Artificial texture suppression: The Adjusted Contrast Enhancement Method (ACEM) balances pixel intensity distribution, mitigating over-enhancement and suppressing artificial patterns introduced during preprocessing.

Region wise reinforced pixel analysis

The pre-processed images obtained are provided to the Multi Agent Reinforcement Learning (MARL) module for performing pixel analysis. Each step of the MARL framework is designed with specific operational goals, including adaptive region splitting based on visual complexity, state-action mappings for effective local decision-making, and reward modeling to enhance detection accuracy.

Computational cost reduction: The Multi-Agent Reinforcement Learning (MARL) divides the image into four regions based on predicted saliency, so only relevant areas undergo intensive processing. This limits redundant computation and accelerates inference. In our research, let the pre-processed image is denoted as ${\text{P}}_{\text{img}}$ in which the ${\text{P}}_{\text{img}}=({\text{P}}_{\text{img}1},\dots ,{\text{P}}_{\text{imgM}})$ is the one arbitrary image from the pre-processed image. From the arbitrary image, we split the whole image in four parts and designated every part as an agent which is denoted as ${\text{ag}}_{\text{i}}$. The policy of the agent is denoted as ${\daleth }_{{\text{i}}} \left( {{\text{ac}}_{{\text{i}}}^{{\left( {{\text{ti}}} \right)}} |{\text{st}}_{{\text{i}}}^{{\left( {{\text{ti}}} \right)}} } \right)$ in which the ${\text{ac}}_{\text{i}}^{\left(\text{ti}\right)}$ denotes the action of the ${\text{ag}}_{\text{i}}$ which can be adjusted based on the past experience, and the ${\text{st}}_{\text{i}}^{(\text{ti})}$ denotes the state of the ${\text{ag}}_{\text{i}}$ at the step $\text{ti}$ form the pixel wise interaction. With exploiting the convolution kernels, every agent can freely interact with the other agents and their region as well. To be more clear, the four agents performs global action ${\text{ac}}^{(\text{ti})}=({\text{ac}}_{1}^{\left(\text{ti}\right)},\dots ,{\text{ac}}_{\text{M}}^{\left(\text{ti}\right)})$, and the ${\text{ag}}_{\text{i}}$ acquires global state ${\text{st}}^{\text{ti}+1}=({\text{st}}_{1}^{\left(\text{ti}+1\right)},\dots ,{\text{st}}_{\text{M}}^{\text{ti}+1})$ for getting the global reward and can be denoted as ${\text{Re}}^{(\text{ti})}=({\text{Re}}_{1}^{\left(\text{ti}\right)},\dots ,{\text{Re}}_{\text{M}}^{\left(\text{ti}\right)})$. The action, state, and reward of an agent can be defined as below,

(a)State: To design the better problem statement for image analysis, the agent ${\text{ag}}_{\text{i}}$ state at the time step $\text{ti}$ for its corresponding pixel region ${\text{pix}}_{\text{i}}^{(\text{ti})}$. The preceded analysis probability can be denoted as ${\text{pro}}_{\text{i}}^{(\text{ti})}$ with region wise feature maps ${\text{fea}}_{\text{i}}^{(\text{ti})}$; ${\text{st}}_{\text{i}}^{(\text{ti})}=[{\text{pix}}_{\text{i}}^{\left(\text{ti}\right)},{\text{pro}}_{\text{i}}^{\left(\text{ti}\right)},{\text{fea}}_{\text{i}}^{(\text{ti})}]$ of ${\text{pro}}_{\text{i}}^{\left(\text{ti}\right)}\in [\text{0,1}]$. For the preliminary state ${\text{st}}_{\text{i}}^{(0)}$ the analysis probability is denoted as ${\text{pro}}_{\text{i}}^{\left(0\right)}$. The region wise feature map is generated based on background and foreground region pixels on the image respectively. Therefore, the feature map can be generated by amalgamating the pixel information and can be defined as ${\text{fea}}_{\text{g}}^{(\text{ti})}=({\text{fea}}_{\text{g},1}^{\left(\text{ti}\right)},\dots ,{\text{fea}}_{\text{g},\text{M}}^{(\text{ti})})$. The ${\text{fea}}_{\text{g},1}^{\left(\text{ti}\right)}$ on the feature map can be computed by distance among the pixels and can be formulated as,

$${\text{fea}}_{\text{g}}^{(\text{ti})}={\text{mini}}_{\propto }\upbeta ({\text{pix}}_{\text{i}},{\text{pix}}_{2})$$

(9)

From the above equation, $\upbeta$ defines the distance among the two pixels in the region. Likewise, we have computed the feature maps for every pixel in the image. Figure 2 represents the pixel analysis and feature maps processing.

(b)Action: For action ${\text{ac}}_{\text{i}}^{\left(\text{ti}\right)}$ by the ${\text{ag}}_{\text{i}}$ with time step $\text{ti}$ for regulating the analysis probability ${\text{pro}}_{\text{i}}^{\left(\text{ti}\right)}$ in a limited manner. Therefore, at the time step ($\text{ti}+1$), the analysis probability for the ${\text{ac}}_{\text{i}}^{\left(\text{ti}\right)}$ can be formulated as,

$${\text{pro}}_{\text{i}}^{\left(\text{ti}+1\right)}={\text{Cli}}_{0}^{1}({\text{pro}}_{\text{i}}^{\left(\text{ti}\right)}+{\text{ac}}_{\text{i}}^{\left(\text{ti}\right)})$$

(10)

The pixel clips for the agent can be denoted as ${\text{Cli}}_{\text{b}}^{\text{a}}\left(\text{pix}\right)=\text{mini}(\text{maxi}\left(\text{pix},\text{a}\right),\text{a})$ from the ${\text{pro}}_{\text{i}}^{\left(\text{ti}+1\right)}$ which is limited to [0,1]. The defined action $\text{Ac}=\left\{{\text{Ac}}_{\text{z}}\right\}$ $\left\{\text{z}=\text{1,2},\dots ,\text{Z}\right\}$ composed of Z actions. The action set scales the agent to be continuously regulated the action at varied situations in the environment.

(c)Reward: The use of reinforcement learning is justified by its capacity to dynamically adapt to diverse scene structures during region splitting. The reward function in MARL is computed based on:

Gradient magnitude: High gradients typically indicate object boundaries.
Local entropy: Higher entropy suggests information-rich areas more likely to contain small objects.
Historical detection confidence: Used to prioritize previously successful regions.

Pixel values are indirectly utilized through these features, which are normalized and used to train the agents for region selection.

Furthermore, the exploration rate of the agent can be enhanced by adopting cross entropy gain reward function. Overall, the reward can be derived as,

$${\text{Re}}_{\text{i}}^{\left(\text{ti}\right)}={\text{CE}}_{\text{i}}^{(\text{ti}-1)}-{\text{CE}}_{\text{i}}^{(\text{ti})}$$

(11)

$${\text{CE}}_{\text{i}}^{(\text{ti})}=-{\text{y}}_{\text{i}}\text{log}\left({\text{pro}}_{\text{i}}^{\left(\text{ti}\right)}\right)-\left(1-{\text{y}}_{\text{i}}\right)\text{log}(1-{\text{pro}}_{\text{i}}^{\left(\text{ti}\right)})$$

(12)

From the above Eq. (11), the agent can get the positive reward when the probability of agent gets closer to the pixel value and gets negative reward for moves farer. Overall, the gathered reward can be formulated as,

$${\text{Re}}_{\text{i}}={\sum }_{\text{ti}=1}^{\text{TI}}{\eth }^{\text{ti}-1}{\text{re}}_{\text{i}}^{\left(\text{ti}\right)}$$

(13)

From the above Eq. (13), the TI denotes the overall time step with the $\eth$ as discount factor with the (0,1].

Feature maps processing, and object detection

The proposed work adopts Enhanced Feature Pyramid Network (EFPN) for processing the extracted feature maps. At the fusion stage, the spatially enriched features can be captured easily by removing the problem of poor representation. The formulation of designed EFPN can be provided as

$${f}_{i}=\left\{\begin{array}{c}{c}_{3\times 3}(IM\left({\text{bc}}_{\text{i}}\right)+UP({\text{fea}}_{\text{i}+1}))\\ {c}_{3\times 3}(IM\left({\text{bc}}_{\text{i}}\right)+SM({\text{fea}}_{\text{i}+1}))\\ IM({\text{bc}}_{\text{i}})\end{array}\right.$$

(14)

From the above equation, the index of feature map can be denoted as ${f}_{i}$, the operations by the convolution can be denoted as ${c}_{3\times 3}$ with kernel size of $3\times 3$. The Fig. 2 represents the IM and SM in the proposed EFPN respectively. Finally, the combined feature map is then provided to the conventional GAN [24] for object detection in reliable and accurate manner. The input feature map is then provided to the generator network $({\text{G}}_{\text{N}})$ which generates the Moderate Resolution Image (MSR) and provided them to edge preservation network (EPN) and discriminator network $({\text{D}}_{\text{N}})$. Upon receiving the combined feature map, the EPN generates the super resolution image and provided to the ${\text{Dis}}_{\text{Net}}$ which detects the smaller object and points them to the bounding boxes. Since the ${\text{G}}_{\text{N}}$ completely based on the ${\text{D}}_{\text{N}}$, the detection probability of the detected and ground truth images can be formulated as,

$${\text{D}}_{\text{N}}\left(d,g\right)=\Upsilon \left(E\left(d\right)\right)-{M}_{g}[E\left(g\right)]\to 1$$

(15)

$${\text{D}}_{\text{N}}\left(g,d\right)=\Upsilon \left(E\left(g\right)\right)-{M}_{d}[E\left(d\right)]\to 0$$

(16)

The above Eqs. (15) and (16) represents the explanations of higher realistic than fake and lesser realistic than real data respectively. The ${M}_{d}, {M}_{g}$ represents the mean operation of the detected (d) and ground (g) truth images, $E(.)$ output of the discriminator, and $\Upsilon$ denotes the function of sigmoid. The two inputs for the discriminator (i.e. MSR and Super Resolution Image (SRI)) detects the small objects from the combine feature maps with two loss function named adversarial generator loss (${l}_{\text{g}}$) and discriminator loss (${l}_{d}$) and cam be formulated as,

$${l}_{\text{g}}=-{M}_{d}\left[\text{log}\left(1-{\text{D}}_{\text{N}}\left(d,g\right)\right)\right]-{M}_{g}[\text{log}\left({\text{D}}_{\text{N}}\left(g,d\right)\right)]$$

(17)

$${l}_{d}=-{M}_{d}\left[\text{log}\left(1-{\text{D}}_{\text{N}}\left(d,g\right)\right)\right]-{M}_{g}[\text{log}({\text{D}}_{\text{N}}\left(g,d\right))]$$

(18)

In addition to that of the above loss functions for generator and discriminator, we have also computed the loss function for combined feature map and layers in the GAN respectively named perceptual (${l}_{p}$) and content loss (${l}_{\text{c}})$ respectively. Both loss function can be formulated as,

$${l}_{p}={M}_{d}{||{f}_{i}({\text{G}}_{\text{N}}({\text{img}}_{\text{MSR}}-{f}_{i}({\text{img}}_{\text{SRI}}))||}_{1},$$

(19)

$${l}_{c}={M}_{d}{\Vert {\text{G}}_{\text{N}}\left({\text{img}}_{\text{MSR}}\right)-{\text{img}}_{\text{SRI}}\Vert }_{1}.$$

(20)

From the above Eq. (19) and (20), ${l}_{p}$ and ${l}_{c}$ denotes the perceptual and content loss respectively, ${f}_{i}$ denotes the feature maps, and the MSR and SRI is utilized for computing the content loss by calculating norm distance.

Modified GAN framework

Our model integrates a modified GAN to refine feature maps produced by the Enhanced Feature Pyramid Network (EFPN), focusing on enhancing tiny object regions through moderate super-resolution. The generator, aided by an Edge Preservation Network (EPN), improves spatial detail and texture fidelity, which is critical for detecting small or distant objects. The discriminator evaluates both the realism and spatial alignment of the refined feature maps, ensuring accurate object boundaries and reducing false positives. This adversarial refinement process helps overcome limitations of traditional detection networks by boosting resolution and spatial consistency, ultimately improving precision and recall for tiny object detection.

Experimental results

The datasets utilized for validating the performance of the proposed model includes VisDrone-2021 VID dataset, and MS-COCO dataset with the existing works. A higher AUC indicates better classifier performance. The Table 1 below shows the performance of the proposed model in terms of conventional FPN and designed EFPN respectively as below.

Table 1 Comparison of Proposed Model

Full size table

Dataset description

In this research, we utilize two benchmark datasets named VisDrone-2019 VID dataset, and MS-COCO dataset for training and testing the proposed MSRP-TODNet model.

The VisDrone-2019 dataset was initially acquired from the university of Tianjin in China for performing data mining, and machine learning operations respectively. The dataset composed of 10,310 images, 282,119 video frames, and 377 video clips. Figure 3 depicts the samples of VisDrone 2019 dataset.

MS-COCO Dataset that includes 400 million photos for segmentation, object detection, key-point detection, and captioning (Fig. 4).

Evaluation on VisDrone-2019 VID dataset

The performance of the proposed model on VisDrone-2019 VID dataset this research performs well with the existing works such as STDNet [11], Ex-FPN [12], QueryDet [13], and YOLO-FIRI [14]. The Table 2 and Fig. 5a and b shows the performance of the proposed model and the existing model from the images using VisDrone-2019 dataset.

Table 2 Performance of the proposed vs existing models on VisDrone 2019 VID dataset

Full size table

From the Fig. 5a, b, the validation of the both the conventional and existing models are examined using performance metrics such as accuracy, precision, recall, F1-score, and AUC-ROC curve respectively. It is clearly shown that, the performance of the proposed model achieves better performance than the conventional models. Finally, the designing of EFPN and GAN for feature extraction, processing, and classification improves the recall, F1-score, and AUC respectively for the proposed model.

Evaluation of MS-COCO dataset

To assess the proposed model performance on the MS-COCO dataset, this study conducts a evaluation, comparing it both qualitatively and quantitatively to established models like YoLov4-5D [15], FADNet [16], OB-SOD [17], and SEPNet [18]. The research presents performance results in Tables 3 and Fig. 6a and b, demonstrating a comparative analysis between the proposed model and these existing models, all using images from the MS-COCO dataset.

Table 3 Performance of the Proposed Vs Existing Models on MS-COCO Dataset

Full size table

The Fig. 6a–d provide a thorough evaluation of both conventional and established models through various performance metrics, including accuracy, precision, recall, F1-score, AUC-ROC curve on MS-COCO Dataset, Training accuracy and validation loss. It's evident that the proposed model outperforms the conventional models. This improved performance is attributed to several enhancements.

This Table 4 highlights the trade-offs between model complexity and efficiency. Despite having more parameters and FLOPs, MSRP-TODNet achieves the lowest inference time, making it optimal for real-time applications.

Table 4 Computational complexity comparison

Full size table

In addition to classification-based metrics such as Accuracy, Precision, Recall, F1-Score, and AUC-ROC, we incorporate Intersection over Union (IoU) and mean Average Precision (mAP) to provide a more spatially grounded and comprehensive evaluation of object detection performance (Table 5).

Table 5 Ablation Study on Key Hyperparameters

Full size table

Ablation study

Here we evaluate the effect of different hyperparameter choices:

Delta (δ) in IWF: Tested in the range of 0.3 to 0.9. Best performance observed around 0.5.
Multi-resolution levels in FPN: Varied from 2 to 5. We found that 4 scales offered a good trade-off between accuracy and computational efficiency.
Region area split settings in MARL: Compared fixed 25%, 33%, 50%, and adaptive splitting strategies. Adaptive splitting yielded higher recall and precision, justifying its choice (Table 6).

Table 6 Ablation study on key hyperparameters

Full size table

Discussion

The experimental results indicate that MSRP-TODNet offers notable improvements in detecting tiny objects, especially in high-resolution surveillance and aerial imagery scenarios. The two-fold pre-processing strategy, involving enhanced noise reduction and contrast adjustment, significantly contributes to clearer object boundaries and better feature extraction. This pre-processing ensures that even low-resolution regions are enhanced for further analysis.

The region-wise analysis approach, which divides images based on visual complexity, allows the model to focus processing power on more information-rich areas. This results in a balance between improved detection accuracy and reduced computational redundancy. The effectiveness of this approach is reflected in the consistent gains in precision, recall, and F1-score across both VisDrone-2019 and MS-COCO datasets.

The Enhanced Feature Pyramid Network (EFPN) improves multi-scale feature fusion, supporting better localization of small objects. Compared to conventional feature extraction approaches, EFPN provides stronger spatial consistency and clarity in feature representation.

The ablation study confirms the importance of key parameters such as the contrast enhancement settings and the optimal number of resolution levels. Adaptive region processing also showed a positive impact on the model's overall performance.

Overall, MSRP-TODNet proves to be a robust and efficient framework for tiny object detection, offering competitive performance and scalability for real-time applications.

Limitations

While MSRP-TODNet demonstrates strong performance in tiny object detection, a few areas offer potential for further enhancement:

Training efficiency: The current model architecture, with its multi-stage processing, requires substantial computational power during training. Future work could explore model compression or lightweight variants for faster training on limited hardware.
Generalization across domains: Although MSRP-TODNet performs effectively on the VisDrone-2019 and MS-COCO datasets, further testing across diverse environments (e.g., maritime, industrial, or medical imagery) can help assess its broader applicability.
Robustness to image quality variations: The model’s performance is influenced by the quality of input images. Developing adaptive pre-processing techniques may enhance robustness against extreme lighting, noise, or motion blur.
Annotation precision dependence: Accurate ground truth annotations are essential for optimal detection performance, particularly for densely packed or overlapping tiny objects. Improved annotation tools or semi-supervised learning could mitigate this.
Edge deployment opportunities: While the model achieves low inference time on conventional systems, optimizing it for real-time performance on embedded or edge devices is a promising direction for future work.

Conclusion

In conclusion, detecting small objects remains challenging due to their limited pixel representation, which affects classifier performance. To address this, we propose Multi-Scale Region-wise Pixel Analysis with GAN for Tiny Object Detection (MSRP-TODNet). The framework includes two-fold pre-processing using Improved Wiener Filter (IWF) for noise removal and Adjusted Contrast Enhancement Method (ACEM) for contrast enhancement. The model is implemented in MATLAB and evaluated using performance metrics such as accuracy, precision, recall, F1-score, and AUC-ROC. Experimental results show that MSRP-TODNet outperforms state-of-the-art methods, making it a valuable solution for real-time surveillance and small object detection in computer vision.

Availability of data and materials

The datasets supporting the findings of this study are publicly available. PASCAL VOC 2012 DATASET can be accessed at https://www.kaggle.com/datasets/gopalbhattrai/pascal-voc-2012-dataset, the VisDrone Dataset is available at https://www.kaggle.com/datasets/kushagrapandya/visdrone-dataset, and Ms-Coco Dataset can be accessed at https://www.kaggle.com/datasets/riffaap/mscoco-dataset.

References

Rocky A, Wu QJ, Zhang W. Review of accident detection methods using dashcam videos for autonomous driving vehicles. IEEE Trans Intell Transport Syst. 2024;25:8356.
Article Google Scholar
Wei W, et al. A review of small object detection based on deep learning. Neural Comput Appl. 2024;36(12):6283–303.
Article Google Scholar
Zhang Y, et al. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans Geosci Remote Sens. 2024;62:1–15.
Article Google Scholar
Iqra, Giri KJ, Javed M. Small object detection in diverse application landscapes: a survey. Multimed Tools Appl. 2024;83:88645–80.
Article Google Scholar
Vitale F, et al. Process mining for digital twin development of industrial cyber-physical systems. IEEE Trans Ind Inf. 2024;21:866.
Article Google Scholar
Liu J, et al. HRD-Net: high resolution segmentation network with adaptive learning ability of retinal vessel features. Comput Biol Med. 2024;173:108295.
Article PubMed Google Scholar
Ouyang, Haodong. "DEYO: DETR with YOLO for End-to-End Object Detection." arXiv preprint arXiv:2402.16370 (2024).
Benjumea A, Teeti I, Cuzzolin F, Bradley A. YOLO-Z: Improving small object detection in YOLOv5 for autonomous vehicles. 2021. arXiv preprint arXiv:2112.11798.
Gong Y, Yu X, Ding Y, Peng X, Zhao J, Han Z. Effective fusion factor in FPN for tiny object detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 1160–1168). 2021.
Akyon FC, Altinuc SO, Temizel A. Slicing aided hyper inference and fine-tuning for small object detection. In 2022 IEEE International Conference on Image Processing (ICIP) (pp. 966–970). IEEE. 2022.
Bosquet B, Mucientes M, Brea VM. STDnet-ST: spatio-temporal ConvNet for small object detection. Pattern Recognit. 2021;116: 107929.
Article Google Scholar
Deng C, Wang M, Liu L, Liu Y. Extended feature pyramid network for small object detection. IEEE Trans Multimedia. 2021;24:1968–79.
Article Google Scholar
Yang C, Huang Z, Wang N. QueryDet: cascaded sparse query for accelerating high-resolution small object detection. IEEE/CVF Conf Comput Vision Pattern Recogn (CVPR). 2021;2022:13658–67.
Google Scholar
Li S, Li Y, Li Y, Li M, Xu X. YOLO-FIRI: improved YOLOv5 for infrared image object detection. IEEE Access. 2021. 1–1.
Cai Y, Luan T, Gao H, Wang H, Chen L, Li Y, Sotelo MÁ, Li Z. YOLOv4-5D: an effective and efficient object detector for autonomous driving. IEEE Trans Instrum Meas. 2021;70:1–13.
Google Scholar
Koyun OC, Keser RK, Akkaya IB, Töreyin BU. Focus-and-detect: a small object detection framework for aerial images. Signal Process Image Commun. 2022;104: 116675.
Article Google Scholar
Saeed Z, Yousaf MH, Ahmed R, Velastin SA, Viriri S. On-board small-scale object detection for unmanned aerial vehicles (UAVs). Drones. 2023;7:310. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/drones7050310.
Article Google Scholar
Sun J, Gao H, Wang X, Yu J. Scale enhancement pyramid network for small object detection from UAV images. Entropy. 2022;24:1699. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/e24111699.
Article PubMed PubMed Central Google Scholar
Zeng N, Wu P, Wang Z, Li H, Liu W, Liu X. A small-sized object detection oriented multi-scale feature fusion approach with application to defect detection. IEEE Trans Instrum Meas. 2022;71:1–1. https://doiorg.publicaciones.saludcastillayleon.es/10.1109/TIM.2022.3153997.
Article Google Scholar
Cheng G, Wang J, Li K, Xie X, Lang C, Yao Y, Han J. Anchor-free oriented proposal generator for object detection. IEEE Trans Geosci Remote Sens. 2021;60:1–11.
Google Scholar
Zhu X, Lyu S, Wang X, Zhao Q. TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. IEEE/CVF Int Conf Comput Vision Workshops (ICCVW). 2021;2021:2778–88.
Google Scholar
Sundaralingam B, Lambert A, Handa A, Boots B, Hermans T, Birchfield S, Ratliff ND, Fox D. Surface normal regularization training phase testing on object lifting and placement tasks 3D force prediction. 2018.
Ma A, Wang J, Zhong Y, Zheng Z. FactSeg: foreground activation-driven small object semantic segmentation in large-scale remote sensing imagery. IEEE Trans Geosci Remote Sens. 2021;60:1–16.
Google Scholar
Demiray BZ, Sit M, Demir I. D-SRGAN: DEM super-resolution with generative adversarial networks. SN Comput Sci. 2021;2:1–11.
Article Google Scholar
Yue M, et al. Lightweight and efficient tiny-object detection based on improved YOLOv8n for UAV aerial images. Drones. 2024;8(7):276.
Article Google Scholar
Ma Y, Chai L, Jin L. Scale decoupled pyramid for object detection in aerial images. IEEE Trans Geosci Remote Sens. 2023;61:1.
CAS Google Scholar
Wang X, et al. Improved TPH for object detection in aerial images. J Spatial Sci. 2024;69(2):493–505.
Article Google Scholar
Hu S, et al. Improving YOLOv7-tiny for infrared and visible light image object detection on drones. Remote Sens. 2023;15(13):3214.
Article Google Scholar
Song Z, Fan Z, Zhu Y. DRDet: light-weight object detector for UAV imagery. Available at SSRN 4542998. 2024.
Zhang W, et al. Sinextnet: a new small object detection model for aerial images based on pp-yoloe. J Artif Intell Soft Comput Res. 2024;14(3):251–65.
Article Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Amrita School of Computing, Amrita Vishwa Vidyapeetham, Amaravati, Andhra Pradesh, 522503, India
Thulasi Bikku
Department of Computer Science and Engineering, Usha Rama College of Engineering and Technology, Telaprolu, India
K. P. N. V. Satya Sree
Department of Mathematics, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Amaravati, Andhra Pradesh, 522503, India
Srinivasarao Thota
Department of Electrical and Electronics Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Andhra Pradesh, India
Malligunta Kiran Kumar
Department of Mathematics, College of Natural & Computational Sciences, MizanTepi University, 21, Tepi Bushira, Ethiopia
P. Shanmugasundaram

Authors

Thulasi Bikku
View author publications
You can also search for this author inPubMed Google Scholar
K. P. N. V. Satya Sree
View author publications
You can also search for this author inPubMed Google Scholar
Srinivasarao Thota
View author publications
You can also search for this author inPubMed Google Scholar
Malligunta Kiran Kumar
View author publications
You can also search for this author inPubMed Google Scholar
P. Shanmugasundaram
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

TB, KPNVSS and ST are involved in the formation of the proposed technique and writing the original manuscript. MKK and PS are involved in suggestions, revisions and verification of the technique. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to P. Shanmugasundaram.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent to publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Bikku, T., Sree, K.P.N.V.S., Thota, S. et al. MSRP-TODNet: a multi-scale reinforced region wise analyser for tiny object detection. BMC Res Notes 18, 200 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13104-025-07263-7

Download citation

Received: 17 February 2025
Accepted: 22 April 2025
Published: 30 April 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13104-025-07263-7

MSRP-TODNet: a multi-scale reinforced region wise analyser for tiny object detection

Abstract

Objective

Results

Introduction

Related works

Research methodology

Two-fold pre-processing

Noise removal, artifact removal and blur correction

Contrast enhancement

Region wise reinforced pixel analysis

Feature maps processing, and object detection

Modified GAN framework

Experimental results

Dataset description

Evaluation on VisDrone-2019 VID dataset

Evaluation of MS-COCO dataset

Ablation study

Discussion

Limitations

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent to publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Research Notes

Contact us