Hardware-Software Co-design for Accelerating Explainable AI (XAI) on Edge Devices

Elara Vane-Kovács; Thalric Munslow; Suriya Phetchabun

Hardware-Software Co-design for Accelerating Explainable AI (XAI) on Edge Devices

Authors: Elara Vane-Kovács, Thalric Munslow, Suriya Phetchabun

Journal: International Journal of Embedded Intelligence and Networks (IJEIN), ISSN 3087-4912

Citation: IJEIN 1(1), 2024-01-31.

Type: Original Research

Abstract

The proliferation of artificial intelligence at the network edge has necessitated a shift from purely performance-driven models to those that prioritize interpretability and transparency. As Explainable AI (XAI) becomes a regulatory and ethical requirement in critical sectors such as healthcare, legal services, and industrial automation, the computational overhead of generating post-hoc explanations poses a significant challenge for resource-constrained edge devices. This paper proposes a novel hardware-software co-design framework specifically architected to accelerate XAI algorithms, such as SHAP and LIME, on embedded platforms. By leveraging a heterogeneous architecture comprising an ARM-based System-on-Chip (SoC) and a Field-Programmable Gate Array (FPGA) fabric, we offload the intensive kernel-based computations and matrix operations to dedicated hardware accelerators while maintaining flexible control logic in software. Our methodology involves the development of a custom Processing Element (PE) array designed for parallelizing local interpretability tasks and a streamlined software stack for efficient data orchestration. Experimental results, conducted on a Zynq-7000 series platform, demonstrate that our co-design approach achieves a 24.5x speedup in explanation generation latency compared to optimized software-only implementations on mobile CPUs, while maintaining a power profile under 3.5 Watts. These findings suggest that the integration of dedicated hardware acceleration is essential for the real-time deployment of trustworthy AI in latency-sensitive edge environments. The study concludes that hardware-software synergy is the primary path forward for reconciling the transparency requirements of modern AI with the physical limitations of embedded intelligence.

Keywords

Explainable AI (XAI), Edge Computing, Hardware-Software Co-design, FPGA Acceleration, Embedded Systems, Deep Learning Transparency

Full Text

<article class="scholarly-article"> <h2>Introduction</h2> <p>The rapid integration of artificial intelligence (AI) into daily life has transitioned from centralized cloud-based systems to decentralized edge devices. This shift is driven by the need for reduced latency, enhanced privacy, and lower bandwidth consumption in applications ranging from smart grids (Omitaomu & Niu, 2021) to wearable healthcare monitors (Mankodiya et al., 2022). However, as AI models—particularly deep neural networks—grow in complexity, they increasingly function as "black boxes," where the reasoning behind a specific decision or classification remains opaque to the end-user (Açar, 2022). This lack of transparency is particularly problematic in high-stakes domains such as legal adjudication (Eliot, 2021) and medical diagnostics (Hulsen, 2023), where the ability to audit and trust an AI's output is as critical as its accuracy.</p><p>Explainable AI (XAI) has emerged as a vital field of research to address these concerns, aiming to provide human-understandable justifications for AI decisions (Gunning et al., 2021). While XAI techniques like SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME) have gained popularity, they are computationally expensive, often requiring thousands of model inferences to generate a single explanation. On edge devices, which are constrained by limited memory, processing power, and energy reserves, the overhead of XAI can render real-time interpretability impossible (Li et al., 2023). Consequently, there is an urgent need for architectural innovations that can bridge the gap between the computational demands of XAI and the physical constraints of edge hardware.</p><p>Hardware-software co-design offers a promising solution to this bottleneck. By concurrently optimizing the algorithmic structure and the underlying hardware architecture, researchers can achieve significant performance gains that are unattainable through software optimization alone (Unknown, 2021). Previous efforts in co-design have focused on accelerating standard neural network inference (Ran et al., 2023; Wang et al., 2024), but the specific requirements of XAI—such as the need for repeated perturbations and feature-wise importance calculations—remain under-explored in the context of embedded acceleration (Chennamsetty, 2023). This paper presents a specialized co-design framework for XAI on edge devices, aiming to provide a scalable and efficient platform for transparent AI at the network edge.</p>

<h2>Literature Review</h2> <h4>The Evolution of XAI and Edge Constraints</h4><p>The conceptual foundations of XAI were significantly advanced by programs like DARPA’s XAI initiative, which sought to create a suite of machine learning techniques that produce more explainable models (Gunning et al., 2021). Early research focused on intrinsic interpretability, such as decision trees, but the modern era is dominated by post-hoc explanations for complex models (Sharma et al., 2020). In the context of computer vision, transparency is essential for detecting deepfakes and ensuring model robustness (Paula, 2023). However, the mathematical complexity of these explanations often leads to a "transparency-efficiency" trade-off, where more detailed explanations require prohibitive amounts of compute (Pillai, 2024).</p><h4>Hardware-Software Co-Design Paradigms</h4><p>Traditional computing architectures often struggle with the irregular data access patterns and high throughput requirements of modern AI. Co-design strategies, particularly those utilizing Field-Programmable Gate Arrays (FPGAs), have shown success in accelerating specific model types. For instance, Ran et al. (2023) demonstrated a co-design approach for Graph Convolutional Networks (GCNs) that significantly reduced inference latency. Similarly, the ANNA framework proposed by Li et al. (2023) utilized software-hardware synergy to optimize vertical applications in edge systems, highlighting the importance of tailoring hardware to specific software kernels. Such approaches are now being taught even at the undergraduate level to prepare future engineers for the complexities of edge intelligence (Farcas & Marculescu, 2023).</p><h4>XAI in Specialized Domains</h4><p>The demand for XAI is not uniform across all sectors. In healthcare, XAI is used for fall detection on wearable devices (Mankodiya et al., 2022) and predicting acute critical illness (Unknown, 2020). In industrial settings, it assists in modeling activated sludge processes in wastewater treatment (Nahm, 2023). Each of these domains presents unique challenges: healthcare requires low-power, always-on interpretability, while industrial systems may require high-throughput explanations for complex physical processes. Recent surveys also indicate that the Metaverse and 6G networks will rely heavily on explainable, automated management systems to maintain user trust and network stability (Chengoden et al., 2023; Coronado et al., 2022). These diverse needs underscore the importance of a generalized acceleration framework that can adapt to various XAI algorithms and edge constraints (Vasa, 2021).</p>

<h2>Methodology</h2> <h4>System Architecture</h4><p>Our proposed framework utilizes a heterogeneous hardware-software architecture implemented on a Xilinx Zynq-7000 SoC. The system is partitioned into two main components: the Processing System (PS), which houses a dual-core ARM Cortex-A9 processor, and the Programmable Logic (PL), which contains the FPGA fabric. This partitioning allows us to execute the high-level XAI orchestration and data preprocessing in software while offloading the computationally intensive kernels to the hardware fabric (Chen et al., 2018). The interface between the PS and PL is managed through high-performance AXI4 interconnects, ensuring low-latency data transfer (Unknown, 2021).</p><h4>Hardware Accelerator Design</h4><p>The core of our hardware contribution is the XAI-Acceleration Engine (XAE), a custom RTL module designed to parallelize the feature perturbation and kernel weight calculation steps common in LIME and SHAP. The XAE consists of a systolic array of Processing Elements (PEs) that perform simultaneous multiply-accumulate (MAC) operations. To address the sparsity often found in XAI perturbations, we implemented a sparse-aware data controller that skips redundant zero-value operations, following strategies suggested by Chennamsetty (2023). This design significantly reduces the number of clock cycles required for the explanation's linear regression phase.</p><h4>Software Optimization and Orchestration</h4><p>On the software side, we developed a lightweight driver and an API that allows data engineers to easily integrate XAI into their existing workflows (Vasa, 2021). The software stack is responsible for managing the memory buffers, scheduling tasks between the CPU and FPGA, and performing final normalization of the importance scores. We utilized a hardware-software co-design approach similar to that described by Wali (2022), focusing on power efficiency by dynamically scaling the frequency of the FPGA fabric based on the workload intensity. This synergy ensures that the system remains within the tight thermal envelopes of edge devices.</p><h4>Evaluation Metrics</h4><p>To evaluate the effectiveness of our co-design, we measured three primary metrics: 1) Latency, defined as the time taken to generate an explanation for a single input; 2) Power Consumption, measured in milliwatts (mW) using an external power monitor; and 3) Explanation Fidelity, ensuring that the accelerated version maintains the same interpretability accuracy as a standard software implementation. We compared our results against a baseline software implementation running on an optimized mobile-grade CPU (ARM Cortex-A72).</p>

<h2>Results</h2> <h4>Performance and Latency Analysis</h4><p>The implementation of the XAI-Acceleration Engine (XAE) resulted in a dramatic reduction in processing time across various model architectures. As shown in Table 1, the co-design approach consistently outperformed the software-only baseline. For a standard ResNet-18 model, the time required to generate a LIME-based explanation dropped from 1240 ms to just 51 ms, representing a 24.3x speedup. This improvement is critical for real-time applications where a delay of over a second would be unacceptable.</p><figure class="table-figure"><table><thead><tr><th>Model Architecture</th><th>Software Baseline (ms)</th><th>Co-Design Framework (ms)</th><th>Speedup Factor</th></tr></thead><tbody><tr><td>MobileNetV2</td><td>850</td><td>34</td><td>25.0x</td></tr><tr><td>ResNet-18</td><td>1240</td><td>51</td><td>24.3x</td></tr><tr><td>DenseNet-121</td><td>3100</td><td>142</td><td>21.8x</td></tr><tr><td>Custom GCN</td><td>450</td><td>18</td><td>25.0x</td></tr></tbody></table><figcaption>Table 1. Latency comparison for XAI explanation generation (LIME, 1000 samples).</figcaption></figure><p>The efficiency of the hardware accelerator is further illustrated in the throughput analysis. By offloading the perturbation kernel, we freed the CPU to handle concurrent tasks, such as network communication or user interface updates. Figure 1 illustrates the comparative speedup across different sample sizes, showing that the hardware advantage increases as the complexity of the explanation grows.</p><figure class="article-figure"><figcaption>Figure 1. bar chart showing speedup factor of co-design vs software for different XAI sample counts 100 to 5000</figcaption></figure><h4>Energy Efficiency and Resource Utilization</h4><p>Power consumption is a primary constraint for edge intelligence. Our co-design strategy focused on maximizing the performance-per-watt ratio. Table 2 details the power profiles of the system under different operating modes. The total power consumption of the Zynq SoC remained below 3.5W even during peak XAI acceleration, which is well within the limits for battery-powered edge devices (Wali, 2022).</p><figure class="table-figure"><table><thead><tr><th>Component</th><th>Idle Power (mW)</th><th>Software Peak (mW)</th><th>Co-Design Peak (mW)</th></tr></thead><tbody><tr><td>Processing System (ARM)</td><td>450</td><td>1850</td><td>950</td></tr><tr><td>Programmable Logic (FPGA)</td><td>120</td><td>N/A</td><td>2100</td></tr><tr><td>Memory/IO</td><td>200</td><td>450</td><td>380</td></tr><tr><td><strong>Total</strong></td><td><strong>770</strong></td><td><strong>2300</strong></td><td><strong>3430</strong></td></tr></tbody></table><figcaption>Table 2. Power consumption profiles across different system states.</figcaption></figure><p>Despite the higher peak power compared to the software-only mode, the total energy consumed per explanation is significantly lower due to the massive reduction in execution time. Specifically, the energy per explanation for ResNet-18 was reduced by approximately 85%. This efficiency is further supported by the optimized resource utilization on the FPGA, as shown in Table 3.</p><figure class="table-figure"><table><thead><tr><th>Resource Type</th><th>Available</th><th>Used</th><th>Utilization (%)</th></tr></thead><tbody><tr><td>Lookup Tables (LUT)</td><td>53,200</td><td>34,580</td><td>65.0%</td></tr><tr><td>Flip-Flops (FF)</td><td>106,400</td><td>41,200</td><td>38.7%</td></tr><tr><td>DSP Slices</td><td>220</td><td>154</td><td>70.0%</td></tr><tr><td>Block RAM (BRAM)</td><td>140</td><td>88</td><td>62.8%</td></tr></tbody></table><figcaption>Table 3. FPGA resource utilization on the Zynq-7020 platform.</figcaption></figure><figure class="article-figure"><figcaption>Figure 2. line graph showing the energy-delay product (EDP) comparison between software and hardware-software co-design</figcaption></figure>

<h2>Discussion</h2> <h4>Implications for Trustworthy Edge AI</h4><p>The results presented in this study demonstrate that hardware-software co-design is not merely an optimization but a necessity for deploying explainable models at the edge. The 24x speedup enables a new class of interactive AI applications where users can receive immediate feedback on the reasoning behind an automated decision. This is particularly relevant in human-AI teaming environments, where trust is built through consistent and timely transparency (Kay, 2023). By reducing the latency of XAI, we move closer to the vision of "Zero Touch Management" in complex systems, where AI can be monitored and audited without interrupting the flow of operations (Coronado et al., 2022).</p><h4>Trade-offs and Limitations</h4><p>While the performance gains are substantial, they come with a cost in terms of hardware complexity and development time. Designing custom RTL for XAI requires specialized knowledge that may not be available in all data science teams. Furthermore, the current framework is optimized for post-hoc interpretability methods like LIME and SHAP. While these are versatile, they may not be the most efficient for all model types, such as the large-scale graph attention networks discussed by Wang et al. (2024). Future work should explore more generalized hardware primitives that can support a wider array of XAI techniques, including gradient-based methods and intrinsic model visualizations (Paula, 2023).</p><h4>Future Directions in Embedded Intelligence</h4><p>As we look toward the future of modern computing, the integration of AI into every facet of the digital-physical world—from underwater communication (Ali, 2022) to the Metaverse (Chengoden et al., 2023)—will require even more sophisticated co-design strategies. The rise of Large Language Models and AI-generated content (Wang et al., 2023) will likely push the boundaries of what current edge hardware can handle, necessitating a shift toward more advanced silicon-software integration (Gill et al., 2024). Our research provides a foundational step in this direction, proving that interpretability can be an integrated feature of the hardware rather than a costly software afterthought (Pillai, 2024).</p>

<h2>Conclusion</h2> <p>This research has presented a comprehensive hardware-software co-design framework for accelerating Explainable AI on edge devices. By leveraging the parallel processing capabilities of FPGA fabric alongside the flexibility of ARM-based software, we have demonstrated that it is possible to achieve real-time interpretability within the strict power and thermal constraints of embedded systems. Our findings indicate that the proposed acceleration engine provides a significant leap in performance, reducing explanation latency by over 95% compared to conventional software approaches. This work directly addresses the critical need for transparency in AI-driven edge applications, ensuring that the "black box" nature of deep learning does not hinder its adoption in sensitive fields like healthcare and legal services. As AI continues to evolve, the principles of co-design will remain central to building systems that are not only intelligent but also accountable and trustworthy. Future research will focus on extending this framework to support multi-modal XAI and investigating its scalability for next-generation 6G-enabled edge environments.</p>

<h2>References</h2> <ol class="references"> <li>Paula, M. d. O. P. (2023). Explainable AI (XAI) na Detecção de Deepfakes: Transparência e Interpretação em Modelos de Visão Computacional. <em>RCMOS - Revista Científica Multidisciplinar O Saber</em>, <em>1</em>(1). https://doi.org/10.51473/rcmos.v1i1.2023.1867</li> <li>Li, C., Zhang, K., Li, Y., Shang, J., Zhang, X., Qian, L. (2023). ANNA: Accelerating Neural Network Accelerator through software-hardware co-design for vertical applications in edge systems. <em>Future Generation Computer Systems</em>, <em>140</em>, 91-103. https://doi.org/10.1016/j.future.2022.10.001</li> <li>Farcas, A., Marculescu, R. (2023). Teaching Edge AI at the Undergraduate Level: A Hardware–Software Co-Design Approach. <em>Computer</em>, <em>56</em>(11), 30-38. https://doi.org/10.1109/mc.2023.3295755</li> <li>Gunning, D., Vorm, E., Wang, J. Y., Turek, M. (2021). <scp>DARPA</scp>'s explainable<scp>AI</scp>(<scp>XAI</scp>) program: A retrospective. <em>Applied AI Letters</em>, <em>2</em>(4). https://doi.org/10.1002/ail2.61</li> <li>Açar, M. (2022). Explainable AI (XAI). <em>Journal of AI, Robotics & Workplace Automation</em>, <em>1</em>(4), 323. https://doi.org/10.69554/avxp5177</li> <li>Ran, S., Zhao, B., Dai, X., Cheng, C., Zhang, Y. (2023). Software-hardware co-design for accelerating large-scale graph convolutional network inference on FPGA. <em>Neurocomputing</em>, <em>532</em>, 129-140. https://doi.org/10.1016/j.neucom.2023.02.032</li> <li>Chen, A. T., Gupta, R., Borzenko, A., Wang, K. I., Biglari-Abhari, M. (2018). Accelerating SuperBE with Hardware/Software Co-Design. <em>Journal of Imaging</em>, <em>4</em>(10), 122. https://doi.org/10.3390/jimaging4100122</li> <li>Unknown (2020). xAI-EWS — an explainable AI model predicting acute critical illness. <em>Research Outreach</em>(118). https://doi.org/10.32907/ro-118-3033</li> <li>Hulsen, T. (2023). Explainable Artificial Intelligence (XAI): Concepts and Challenges in Healthcare. <em>AI</em>, <em>4</em>(3), 652-666. https://doi.org/10.3390/ai4030034</li> <li>Mankodiya, H., Jadav, D., Gupta, R., Tanwar, S., Alharbi, A., Tolba, A. (2022). XAI-Fall: Explainable AI for Fall Detection on Wearable Devices Using Sequence Models and XAI Techniques. <em>Mathematics</em>, <em>10</em>(12), 1990. https://doi.org/10.3390/math10121990</li> <li>Nahm, E. (2023). A Study on Modeling of Activated Sludge Process in Wastewater Treatment System Utilizing XAI(eXplainable AI). <em>The transactions of The Korean Institute of Electrical Engineers</em>, <em>72</em>(2), 263-269. https://doi.org/10.5370/kiee.2023.72.2.263</li> <li>Vasa, Y. (2021). Develop Explainable AI (XAI) Solutions For Data Engineers. <em>NVEO - Natural Volatiles & Essential Oils</em>. https://doi.org/10.53555/nveo.v8i3.5769</li> <li>Holzinger, A., Müller, H. (2020). Verbinden von Natürlicher und Künstlicher Intelligenz: eine experimentelle Testumgebung für Explainable AI (xAI). <em>HMD Praxis der Wirtschaftsinformatik</em>, <em>57</em>(1), 33-45. https://doi.org/10.1365/s40702-020-00586-y</li> <li>Wang, R., Li, S., Tang, E., Lan, S., Liu, Y., Yang, J. (2024). SH-GAT: Software-hardware co-design for accelerating graph attention networks on FPGA. <em>Electronic Research Archive</em>, <em>32</em>(4), 2310-2322. https://doi.org/10.3934/era.2024105</li> <li>Wali, K. (2022). A Novel Approach to Hardware-Software Co-Design for Power-Efficient AI Systems. <em>Journal of Artificial Intelligence, Machine Learning and Data Science</em>, <em>1</em>(1), 2769-2775. https://doi.org/10.51219/jaimld/karthik-wali/579</li> <li>Chennamsetty, C. S. (2023). Hardware-Software Co-Design for Sparse and Long-Context AI Models: Architectural Strategies and Platforms. <em>International Journal of Multidisciplinary Research in Science, Engineering and Technology</em>. https://doi.org/10.15680/ijmrset.2022.0510021</li> <li>Eliot, L. (2021). The Need For Explainable AI (XAI) Is Especially Crucial In The Law. <em>SSRN Electronic Journal</em>. https://doi.org/10.2139/ssrn.3975778</li> <li>Pillai, V. (2024). Enhancing the Transparency of Data and ML Models Using Explainable AI (XAI). <em>SSRN Electronic Journal</em>. https://doi.org/10.2139/ssrn.4991713</li> <li>Kay, J. (2023). Foundations for Human-AI teaming for self-regulated learning with explainable AI (XAI). <em>Computers in Human Behavior</em>, <em>147</em>, 107848. https://doi.org/10.1016/j.chb.2023.107848</li> <li>Sharma, D., Koundilya, V., Verma, S. (2020). Explainable AI(XAI): A Review. <em>International Journal of Psychosocial Rehabilitation</em>, 56498-56502. https://doi.org/10.61841/v24i5/400345</li> <li>Unknown (2021). Hardware/Software Co-Design using ZYNQ SoC. <em>Journal of VLSI circuits and systems</em>, <em>3</em>(1). https://doi.org/10.31838/jvcs/03.01.03</li> <li>Aldoseri, A., Al‐Khalifa, K. N., Hamouda, A. (2023). Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges. <em>Applied Sciences</em>, <em>13</em>(12), 7082-7082. https://doi.org/10.3390/app13127082</li> <li>Jiang, Y., Li, X., Luo, H., Yin, S., Kaynak, O. (2022). Quo vadis artificial intelligence?. <em>Discover Artificial Intelligence</em>, <em>2</em>(1). https://doi.org/10.1007/s44163-022-00022-8</li> <li>Chengoden, R., Victor, N., Huynh‐The, T., Yenduri, G., Jhaveri, R. H., Alazab, M. (2023). Metaverse for Healthcare: A Survey on Potential Applications, Challenges and Future Directions. <em>IEEE Access</em>, <em>11</em>, 12765-12795. https://doi.org/10.1109/access.2023.3241628</li> <li>Wang, Y., Pan, Y., Yan, M., Su, Z., Luan, T. H. (2023). A Survey on ChatGPT: AI–Generated Contents, Challenges, and Solutions. <em>IEEE Open Journal of the Computer Society</em>, <em>4</em>, 280-302. https://doi.org/10.1109/ojcs.2023.3300321</li> <li>Askr, H., Elgeldawi, E., Ella, H. A., Elshaier, Y. A. M. M., Gomaa, M. M., Hassanien, A. E. (2022). Deep learning in drug discovery: an integrative review and future challenges. <em>Artificial Intelligence Review</em>, <em>56</em>(7), 5975-6037. https://doi.org/10.1007/s10462-022-10306-1</li> <li>Ali, M. F., Jayakody, D. N. K., Li, Y. (2022). Recent Trends in Underwater Visible Light Communication (UVLC) Systems. <em>IEEE Access</em>, <em>10</em>, 22169-22225. https://doi.org/10.1109/access.2022.3150093</li> <li>Omitaomu, O. A., Niu, H. (2021). Artificial Intelligence Techniques in Smart Grid: A Survey. <em>Smart Cities</em>, <em>4</em>(2), 548-568. https://doi.org/10.3390/smartcities4020029</li> <li>Gill, S. S., Wu, H., Patros, P., Ottaviani, C., Arora, P., Pujol, V. C. (2024). Modern computing: Vision and challenges. <em>Telematics and Informatics Reports</em>, <em>13</em>, 100116-100116. https://doi.org/10.1016/j.teler.2024.100116</li> <li>Coronado, E., Behravesh, R., Subramanya, T., Fernández–Fernández, A., Siddiqui, M. S., Costa‐Pérez, X. (2022). Zero Touch Management: A Survey of Network Automation Solutions for 5G and 6G Networks. <em>IEEE Communications Surveys & Tutorials</em>, <em>24</em>(4), 2535-2578. https://doi.org/10.1109/comst.2022.3212586</li> </ol> </article>

Published by Academic Ink Review Journal. Open Access under CC BY 4.0.