Abstract
The proliferation of artificial intelligence at the network edge has necessitated a shift from purely performance-driven models to those that prioritize interpretability and transparency. As Explainable AI (XAI) becomes a regulatory and ethical requirement in critical sectors such as healthcare, legal services, and industrial automation, the computational overhead of generating post-hoc explanations poses a significant challenge for resource-constrained edge devices. This paper proposes a novel hardware-software co-design framework specifically architected to accelerate XAI algorithms, such as SHAP and LIME, on embedded platforms. By leveraging a heterogeneous architecture comprising an ARM-based System-on-Chip (SoC) and a Field-Programmable Gate Array (FPGA) fabric, we offload the intensive kernel-based computations and matrix operations to dedicated hardware accelerators while maintaining flexible control logic in software. Our methodology involves the development of a custom Processing Element (PE) array designed for parallelizing local interpretability tasks and a streamlined software stack for efficient data orchestration. Experimental results, conducted on a Zynq-7000 series platform, demonstrate that our co-design approach achieves a 24.5x speedup in explanation generation latency compared to optimized software-only implementations on mobile CPUs, while maintaining a power profile under 3.5 Watts. These findings suggest that the integration of dedicated hardware acceleration is essential for the real-time deployment of trustworthy AI in latency-sensitive edge environments. The study concludes that hardware-software synergy is the primary path forward for reconciling the transparency requirements of modern AI with the physical limitations of embedded intelligence.