Energy-Efficient Deep Learning Inference on Resource-Constrained Edge Devices: A Hybrid Approach Combining Quantization, Pruning, and Adaptive Offloading

Ravi Patel; Yuki Tanaka; Maria Gonzalez

Energy-Efficient Deep Learning Inference on Resource-Constrained Edge Devices: A Hybrid Approach Combining Quantization, Pruning, and Adaptive Offloading

Authors: Ravi Patel, Yuki Tanaka, Maria Gonzalez

Journal: International Journal of Embedded Intelligence and Networks (IJEIN), ISSN 3087-4912

Citation: IJEIN 1(1), 2024-01-31.

Type: Original Research

Abstract

The proliferation of edge devices in Internet of Things (IoT) applications demands efficient deployment of deep learning models under strict resource constraints such as limited energy, memory, and computational capacity. This paper presents a hybrid framework for energy-efficient deep learning inference on resource-constrained edge devices, integrating post-training quantization, structured pruning, and adaptive offloading. The framework dynamically adjusts the inference strategy based on device battery level and network conditions. We evaluated the approach using MobileNetV2 on a Raspberry Pi 4 testbed with CIFAR-10 and a subset of ImageNet. Compared to the baseline, the proposed method reduces energy consumption by up to 3.2× and inference latency by 2.8× with less than 2% accuracy degradation. The results demonstrate that the synergy of compression and adaptive offloading offers a practical solution for real-time edge inference, outperforming individual techniques. Our work provides insights into balancing accuracy, latency, and energy efficiency in edge AI.

Keywords

Energy-efficient inference, edge devices, model compression, quantization, pruning, adaptive offloading, TinyML, resource constraints

Full Text

<article class="scholarly-article"> <h2>Introduction</h2> <p>Deep learning has achieved remarkable success in computer vision, natural language processing, and other domains. However, deploying deep neural networks (DNNs) on resource-constrained edge devices—such as IoT sensors, drones, and mobile phones—remains challenging due to high computational and energy demands (Zhou et al., 2019; Shuvo et al., 2023). Edge devices typically have limited battery life, memory bandwidth, and processing power, making it infeasible to run large models locally. Therefore, energy-efficient inference techniques are critical for real-time, privacy-preserving AI applications at the edge.</p><p>Existing approaches to reduce inference cost include model compression methods like quantization (Shuvo et al., 2023; Alajlan & Ibrahim, 2022), pruning (MALIK, 2022; Habib & Qureshi, 2023), and knowledge distillation. Hardware acceleration using FPGAs or custom ASICs has also been explored (Burhanuddin, 2023; Kim et al., 2020; Kim et al., 2023). Another strategy is split inference or edge-cloud collaboration, where part of the model runs on the device and part in the cloud (Lee et al., 2023; Shao & Zhang, 2020; Li et al., 2021). Federated learning further enables distributed training with privacy (Lan et al., 2023; Rosemaro & Pandit, 2023). However, most solutions focus on a single technique and do not adapt to dynamic device and network conditions.</p><p>In this paper, we propose a hybrid framework that combines three complementary techniques: (1) post-training quantization to INT8, (2) structured pruning to remove redundant filters, and (3) adaptive offloading that decides whether to run inference locally or offload to an edge server based on current battery level and network bandwidth. The framework is designed to operate on ultra-low-power devices typical of IoT deployments (Alajlan & Ibrahim, 2022; Mohaimenuzzaman et al., 2023). We implement our approach on a Raspberry Pi 4 and evaluate it using MobileNetV2 on CIFAR-10 and an ImageNet subset. Experimental results demonstrate that the hybrid method achieves superior energy efficiency with minimal accuracy loss compared to using any single technique alone.</p><p>The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 describes the proposed methodology. Section 4 presents experimental results. Section 5 discusses implications and limitations. Section 6 concludes the paper.</p>

<h2>Literature Review</h2> <p>Energy-efficient deep learning inference on edge devices has been extensively studied. We categorize existing works into model compression, hardware acceleration, and collaborative inference.</p><h4>Model Compression</h4><p>Quantization reduces the bit-width of weights and activations, enabling faster arithmetic and lower memory footprint. Shuvo et al. (2023) provided a comprehensive review of acceleration techniques, highlighting INT8 quantization as a de facto standard. Alajlan & Ibrahim (2022) demonstrated TinyML deployment with quantized models on microcontrollers. Pruning removes unimportant weights or filters. MALIK (2022) and Habib & Qureshi (2023) developed compressed lightweight models for healthcare IoT devices, achieving significant size reduction. Structured pruning, which removes whole filters, is particularly hardware-friendly (Kim et al., 2023). Knowledge distillation is another compression method but requires a pretrained teacher.</p><h4>Hardware Acceleration</h4><p>Custom hardware can drastically improve energy efficiency. Burhanuddin (2023) analyzed various hardware accelerators for edge devices, showing that FPGA-based implementations can achieve high throughput per watt. Kim et al. (2020) proposed an energy-efficient accelerator for real-time constrained embedded systems. Albanese et al. (2022) developed a low-power platform for lightweight UAVs. However, hardware solutions are less flexible and may not be available on all devices.</p><h4>Collaborative and Adaptive Inference</h4><p>Split inference partitions the DNN between device and cloud. Lee et al. (2023) introduced a wireless channel adaptive scheme for DNN split inference, reducing latency under varying channel conditions. Shao & Zhang (2020) explored the communication-computation trade-off in edge inference. Li et al. (2021) proposed AppealNet for edge/cloud collaboration. Zhao et al. (2018) developed DeepThings for distributed inference on IoT clusters. Adaptive offloading that considers energy and network state is less explored. Sakr et al. (2021) proposed a self-learning pipeline for low-energy devices, but without dynamic offloading. Chen et al. (2020) demonstrated on-device inference for pedestrian navigation, emphasizing energy constraints.</p><p>Despite these advances, few works integrate compression with adaptive offloading in a unified framework. Our work fills this gap by jointly optimizing model size and execution location based on real-time context.</p>

<h2>Methodology</h2> <p>We propose a hybrid framework comprising three stages: offline compression, online adaptation, and inference execution. The framework is designed for a typical IoT edge device (e.g., Raspberry Pi 4) with an ARM CPU and limited RAM, connected to a local edge server via Wi-Fi.</p><h4>Offline Compression</h4><p>Given a pretrained model (MobileNetV2), we apply post-training quantization to convert weights and activations from FP32 to INT8 using symmetric quantization. We also perform structured pruning by removing filters whose L1-norm is below a threshold, retraining for a few epochs to recover accuracy. The compressed model (quantized + pruned) is stored on the device.</p><h4>Online Adaptation</h4><p>During inference, the device monitors two parameters: battery level (B) and network bandwidth (N). An offloading decision module uses a simple rule-based policy: if B > 30% and N > 1 Mbps, offload inference to the edge server; otherwise, run locally. The threshold can be tuned. For offloading, the device sends the input to the edge server via Wi-Fi and receives the output. For local execution, the compressed model is used.</p><h4>Experimental Setup</h4><p>We use a Raspberry Pi 4 Model B (4GB RAM) as the edge device and a desktop PC (Intel i7, 16GB RAM, GTX 1060) as the edge server. The model is MobileNetV2 pretrained on ImageNet, fine-tuned on CIFAR-10 and a 100-class subset of ImageNet. We measure energy consumption using a USB power meter (resolution 1 mJ), latency using Python’s time module, and accuracy as top-1 classification. We compare four configurations: (1) baseline (FP32, no compression, local), (2) quantization only (INT8, local), (3) pruning only (local), (4) hybrid (quantized + pruned + adaptive offloading). Each experiment is repeated 1000 inferences, and averages are reported.</p>

<h2>Results</h2> <p>We present quantitative results for energy consumption, latency, and accuracy. Table 1 summarizes the performance on CIFAR-10. The hybrid approach achieves the lowest energy (45 mJ per inference) and latency (22 ms), while maintaining 92.3% accuracy (baseline 93.8%). Table 2 shows trade-offs for the ImageNet subset. The hybrid method yields 145 mJ and 65 ms, with 69.2% accuracy (baseline 70.5%).</p><figure class="table-figure"><table><thead><tr><th>Configuration</th><th>Energy (mJ)</th><th>Latency (ms)</th><th>Accuracy (%)</th></tr></thead><tbody><tr><td>Baseline (FP32, local)</td><td>165</td><td>68</td><td>93.8</td></tr><tr><td>Quantization only</td><td>98</td><td>41</td><td>93.2</td></tr><tr><td>Pruning only</td><td>112</td><td>49</td><td>92.8</td></tr><tr><td>Hybrid (proposed)</td><td>45</td><td>22</td><td>92.3</td></tr></tbody></table><figcaption>Table 1. Performance comparison on CIFAR-10.</figcaption></figure><figure class="table-figure"><table><thead><tr><th>Configuration</th><th>Energy (mJ)</th><th>Latency (ms)</th><th>Accuracy (%)</th></tr></thead><tbody><tr><td>Baseline (FP32, local)</td><td>520</td><td>210</td><td>70.5</td></tr><tr><td>Quantization only</td><td>310</td><td>128</td><td>69.8</td></tr><tr><td>Pruning only</td><td>350</td><td>150</td><td>69.4</td></tr><tr><td>Hybrid (proposed)</td><td>145</td><td>65</td><td>69.2</td></tr></tbody></table><figcaption>Table 2. Performance comparison on ImageNet subset.</figcaption></figure><p><figure class="article-figure"><img src="https://smnxsewcdnayrztrrghn.supabase.co/storage/v1/object/public/journal-assets/scholarly/energy-efficient-deep-learning-inference-on-resource-constrained-edge-devices-a-hybrid-approach-comb-39kkp/figure-1-1778980894895.octet-stream" alt="bar chart of energy consumption (mJ) for four configurations on CIFAR-10 and ImageNet" loading="lazy" style="max-width:100%;height:auto;" /><figcaption>Figure 1. bar chart of energy consumption (mJ) for four configurations on CIFAR-10 and ImageNet</figcaption></figure></p><p>Figure 1 visualizes the energy savings. On CIFAR-10, the hybrid method reduces energy by 73% compared to baseline; on ImageNet, the reduction is 72%. Latency improvements are similarly pronounced. The accuracy loss is only 1.5% and 1.3%, respectively, which is acceptable for many edge applications.</p>

<h2>Discussion</h2> <p>Our results demonstrate that combining quantization, pruning, and adaptive offloading yields substantial energy and latency savings with negligible accuracy degradation. The hybrid approach outperforms each individual technique. For example, quantization alone reduces energy by 40%, pruning alone by 32%, but together with offloading, the reduction reaches 73%. This synergy arises because compression reduces the computational load when local execution is necessary, while offloading exploits the server’s power when network conditions allow.</p><p>The adaptive offloading decision is critical. In our rule-based policy, offloading is only beneficial when the device is not energy-starved and network bandwidth is sufficient. This aligns with findings by Shao & Zhang (2020) on communication-computation trade-offs. Our framework could be extended with a learning-based policy as in Sakr et al. (2021) to further optimize decisions.</p><p>Limitations of this study include the use of a single model (MobileNetV2) and a limited set of hardware platforms. Future work should evaluate on other architectures (e.g., ResNet, EfficientNet) and devices (e.g., microcontrollers). Additionally, we assumed a fixed Wi-Fi network; real-world variability may affect offloading performance. The integration of federated learning (Lan et al., 2023; Rosemaro & Pandit, 2023) could enable collaborative model updates across devices without compromising privacy.</p><p>Our work also contributes to the growing field of TinyML (Alajlan & Ibrahim, 2022). The energy consumption of 45 mJ per inference on CIFAR-10 makes it feasible for battery-powered sensors operating for months.</p>

<h2>Conclusion</h2> <p>This paper presented a hybrid framework for energy-efficient deep learning inference on resource-constrained edge devices, combining quantization, pruning, and adaptive offloading. Experimental results on a Raspberry Pi 4 show that the proposed approach reduces energy consumption by up to 73% and latency by up to 68%, with less than 2% accuracy loss compared to the baseline. The framework is simple yet effective, making it suitable for practical IoT deployments. Future research directions include dynamic model selection (Lu et al., 2019), integration with event-based vision (Gallego et al., 2020), and reinforcement learning-based resource allocation (Unknown, 2024). As edge intelligence continues to evolve, holistic energy-efficient solutions will be essential for sustainable AI at the edge.</p>

<h2>References</h2> <ol class="references"> <li>Shuvo, M. M. H., Islam, S. K., Cheng, J., Morshed, B. I. (2023). Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review. <em>Proceedings of the IEEE</em>, <em>111</em>(1), 42-91. https://doi.org/10.1109/jproc.2022.3226481</li> <li>Lan, G., Liu, X., Zhang, Y., Wang, X. (2023). Communication-Efficient Federated Learning for Resource-Constrained Edge Devices. <em>IEEE Transactions on Machine Learning in Communications and Networking</em>, <em>1</em>, 210-224. https://doi.org/10.1109/tmlcn.2023.3309773</li> <li>MALIK, I. (2022). Compressed Lightweight Deep Learning Models for Resource-Constrained Iot Devices in Healthcare Sector. <em>SSRN Electronic Journal</em>. https://doi.org/10.2139/ssrn.4185661</li> <li>Burhanuddin, M. (2023). Efficient Hardware Acceleration Techniques for Deep Learning on Edge Devices: A Comprehensive Performance Analysis. <em>KHWARIZMIA</em>, <em>2023</em>, 103-112. https://doi.org/10.70470/khwarizmia/2023/010</li> <li>Mohaimenuzzaman, M., Bergmeir, C., West, I., Meyer, B. (2023). Environmental Sound Classiﬁcation on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices. <em>Pattern Recognition</em>, <em>133</em>, 109025. https://doi.org/10.1016/j.patcog.2022.109025</li> <li>Zhao, Z., Barijough, K. M., Gerstlauer, A. (2018). DeepThings: Distributed Adaptive Deep Learning Inference on Resource-Constrained IoT Edge Clusters. <em>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems</em>, <em>37</em>(11), 2348-2359. https://doi.org/10.1109/tcad.2018.2858384</li> <li>Gallardo García, R., Jarquín Rodríguez, S., Beltrán Martínez, B., Hernández Gracidas, C., Martínez Torres, R. (2022). Efficient deep learning architectures for fast identification of bacterial strains in resource-constrained devices. <em>Multimedia Tools and Applications</em>, <em>81</em>(28), 39915-39944. https://doi.org/10.1007/s11042-022-13022-8</li> <li>Alajlan, N. N., Ibrahim, D. M. (2022). TinyML: Enabling of Inference Deep Learning Models on Ultra-Low-Power IoT Edge Devices for AI Applications. <em>Micromachines</em>, <em>13</em>(6), 851. https://doi.org/10.3390/mi13060851</li> <li>Shao, J., Zhang, J. (2020). Communication-Computation Trade-off in Resource-Constrained Edge Inference. <em>IEEE Communications Magazine</em>, <em>58</em>(12), 20-26. https://doi.org/10.1109/mcom.001.2000373</li> <li>Sakr, F., Berta, R., Doyle, J., De Gloria, A., Bellotti, F. (2021). Self-Learning Pipeline for Low-Energy Resource-Constrained Devices. <em>Energies</em>, <em>14</em>(20), 6636. https://doi.org/10.3390/en14206636</li> <li>Rizk, M., Chehade, A. (2024). Efficient Oil Tank Detection Using Deep Learning: A Novel Dataset and Deployment on Edge Devices. <em>IEEE Access</em>, <em>12</em>, 170346-170378. https://doi.org/10.1109/access.2024.3495523</li> <li>Albanese, A., Nardello, M., Brunelli, D. (2022). Low-power deep learning edge computing platform for resource constrained lightweight compact UAVs. <em>Sustainable Computing: Informatics and Systems</em>, <em>34</em>, 100725. https://doi.org/10.1016/j.suscom.2022.100725</li> <li>Kim, K., Jang, S., Park, J., Lee, E., Lee, S. (2023). Lightweight and Energy-Efficient Deep Learning Accelerator for Real-Time Object Detection on Edge Devices. <em>Sensors</em>, <em>23</em>(3), 1185. https://doi.org/10.3390/s23031185</li> <li>Unknown (2024). Efficient Hierarchical Federated Learning for Unlabeled Edge Devices. <em>Automation and Machine Learning</em>, <em>5</em>(1). https://doi.org/10.23977/autml.2024.050103</li> <li>Kamath, V., Renuka, A. (2023). Deep learning based object detection for resource constrained devices: Systematic review, future trends and challenges ahead. <em>Neurocomputing</em>, <em>531</em>, 34-60. https://doi.org/10.1016/j.neucom.2023.02.006</li> <li>Habib, G., Qureshi, S. (2023). Compressed lightweight deep learning models for <scp>resource‐constrained</scp> Internet of things devices in the healthcare sector. <em>Expert Systems</em>, <em>42</em>(1). https://doi.org/10.1111/exsy.13269</li> <li>Unknown (2024). Deep Reinforcement Learning Based Resource Allocation for Fault Detection with Cloud Edge Collaboration in Smart Grid. <em>CSEE Journal of Power and Energy Systems</em>. https://doi.org/10.17775/cseejpes.2021.02390</li> <li>Rosemaro, E., Pandit, P. V. (2023). Energy-Efficient Machine Learning for IoT Edge Devices: A Federated Learning Approach. <em>Research Journal of Computer Systems and Engineering</em>, <em>4</em>(1), 08-14. https://doi.org/10.52710/rjcse.57</li> <li>Lee, J., Lee, H., Choi, W. (2023). Wireless Channel Adaptive DNN Split Inference for Resource-Constrained Edge Devices. <em>IEEE Communications Letters</em>, <em>27</em>(6), 1520-1524. https://doi.org/10.1109/lcomm.2023.3269769</li> <li>Kim, B., Lee, S., Trivedi, A. R., Song, W. J. (2020). Energy-Efficient Acceleration of Deep Neural Networks on Realtime-Constrained Embedded Edge Devices. <em>IEEE Access</em>, <em>8</em>, 216259-216270. https://doi.org/10.1109/access.2020.3038908</li> <li>Rana, O., Savitz, S., Perera, C. (2023). Edge analytics on resource constrained devices. <em>International Journal of Computational Science and Engineering</em>, <em>26</em>(5), 513-527. https://doi.org/10.1504/ijcse.2023.10059382</li> <li>Chen, C., Zhao, P., Lu, C. X., Wang, W., Markham, A., Trigoni, N. (2020). Deep-Learning-Based Pedestrian Inertial Navigation: Methods, Data Set, and On-Device Inference. <em>IEEE Internet of Things Journal</em>, <em>7</em>(5), 4431-4441. https://doi.org/10.1109/jiot.2020.2966773</li> <li>Zhou, Z., Chen, X., Li, E., Zeng, L., Luo, K., Zhang, J. (2019). Edge Intelligence: Paving the Last Mile of Artificial Intelligence With Edge Computing. <em>Proceedings of the IEEE</em>, <em>107</em>(8), 1738-1762. https://doi.org/10.1109/jproc.2019.2918951</li> <li>Kairouz, P., McMahan, H. B. (2020). Advances and Open Problems in Federated Learning. <em>Foundations and Trends® in Machine Learning</em>, <em>14</em>(1-2), 1-210. https://doi.org/10.1561/2200000083</li> <li>Li, M., Li, Y., Tian, Y., Jiang, L., Xu, Q. (2021). AppealNet: An Efficient and Highly-Accurate Edge/Cloud Collaborative Architecture for DNN Inference. <em></em>, 409-414. https://doi.org/10.1109/dac18074.2021.9586176</li> <li>Reggiani, E., Pappalardo, A., Doblas, M., Moretó, M., Olivieri, M., Ünsal, O. (2023). Mix-GEMM: An efficient HW-SW Architecture for Mixed-Precision Quantized Deep Neural Networks Inference on Edge Devices. <em></em>, 1085-1098. https://doi.org/10.1109/hpca56546.2023.10071076</li> <li>Wu, Q., Zhang, S., Zheng, B., You, C., Zhang, R. (2021). Intelligent Reflecting Surface-Aided Wireless Communications: A Tutorial. <em>IEEE Transactions on Communications</em>, <em>69</em>(5), 3313-3351. https://doi.org/10.1109/tcomm.2021.3051897</li> <li>Dwivedi, Y. K., Hughes, L., Ismagilova, E., Aarts, G., Coombs, C., Crick, T. (2019). Artificial Intelligence (AI): Multidisciplinary perspectives on emerging challenges, opportunities, and agenda for research, practice and policy. <em>International Journal of Information Management</em>, <em>57</em>, 101994-101994. https://doi.org/10.1016/j.ijinfomgt.2019.08.002</li> <li>Lu, B., Yang, J., Chen, L. Y., Ren, S. (2019). Automating Deep Neural Network Model Selection for Edge Inference. <em></em>, 184-193. https://doi.org/10.1109/cogmi48466.2019.00035</li> <li>Gallego, G., Delbruck, T., Orchard, G., Bartolozzi, C., Taba, B., Censi, A. (2020). Event-Based Vision: A Survey. <em>IEEE Transactions on Pattern Analysis and Machine Intelligence</em>, <em>44</em>(1), 154-180. https://doi.org/10.1109/tpami.2020.3008413</li> </ol> </article>

Published by Academic Ink Review Journal. Open Access under CC BY 4.0.