Abstract
The proliferation of edge devices in Internet of Things (IoT) applications demands efficient deployment of deep learning models under strict resource constraints such as limited energy, memory, and computational capacity. This paper presents a hybrid framework for energy-efficient deep learning inference on resource-constrained edge devices, integrating post-training quantization, structured pruning, and adaptive offloading. The framework dynamically adjusts the inference strategy based on device battery level and network conditions. We evaluated the approach using MobileNetV2 on a Raspberry Pi 4 testbed with CIFAR-10 and a subset of ImageNet. Compared to the baseline, the proposed method reduces energy consumption by up to 3.2× and inference latency by 2.8× with less than 2% accuracy degradation. The results demonstrate that the synergy of compression and adaptive offloading offers a practical solution for real-time edge inference, outperforming individual techniques. Our work provides insights into balancing accuracy, latency, and energy efficiency in edge AI.
Keywords
Energy-efficient inference, edge devices, model compression, quantization, pruning, adaptive offloading, TinyML, resource constraints