# LoAS: Fully Temporal-Parallel Dataflow for Dual-Sparse Spiking Neural Networks

Ruokai Yin Yale University New Haven, USA ruokai.yin@yale.edu

Youngeun Kim Yale University New Haven, USA youngeun.kim@yale.edu

Di Wu University of Central Florida Orlando, USA di.wu@ucf.edu

Priyadarshini Panda Yale University New Haven, USA priya.panda@yale.edu



<span id="page-0-1"></span>Fig. 1. An illustrative example of FTP dataflow and LoAS. FTP dataflow is shown along with the prior dataflow design for SNNs. Temporal sequential tick-batch is from SpinalFlow [\[36\]](#page-12-8), and partially temporal parallel is from PTB [\[29\]](#page-12-9). Each arrow loop indicates the processing of one timestep. The vertical line indicates that the processing is in parallel.

Certain works have managed to achieve approximately 98% weight sparsity and 90% spike sparsity [\[23\]](#page-12-5), leveraging the lottery ticket hypothesis [\[13\]](#page-12-10). These works have outlined the potential of dual-sparse SNNs in reaching unprecedented energy efficiency and memory footprint with little to no compromise in accuracy.

Challenge. Although dual-sparse SNNs have made strides with algorithmic advancements, the hardware is not yet catching up to make full use of such dual-sparsity. In general, existing SNN accelerators can be categorized into two main groups. First, multi-core neuromorphic systems<sup>[1](#page-0-0)</sup> employ a plethora of cores, even chips, to exploit the inherent parallelism in spiking neuron dynamics [\[1\]](#page-12-11), [\[7\]](#page-12-12), [\[14\]](#page-12-13), [\[48\]](#page-13-5). Though capable of capturing the massive parallelism and sparse activities across neurons, multi-core neuromorphic systems require all neurons (including weights) to be mapped on-chip. This undoubtedly wastes a huge amount of hardware resources on the neurons that are not involved in any computations due to the dualsparsity [\[40\]](#page-12-14). Second, dataflow-based SNN accelerators draw inspiration from dataflow-based ANN accelerators and take advantage of the rich data reuse among the array of processing elements [\[29\]](#page-12-9), [\[33\]](#page-12-15), [\[36\]](#page-12-8). Nonetheless, these designs have mainly focused on processing dense SNN workloads. Currently, there is a lack of dataflow architectures that uniquely

#### <span id="page-0-0"></span><sup>1</sup>We are not comparing with those systems due to our focus on single-core dataflow SNN accelerator designs.

*Abstract*—Spiking Neural Networks (SNNs) have gained significant research attention in the last decade due to their potential to drive resource-constrained edge devices. Though existing SNN accelerators offer high efficiency in processing sparse spikes with dense weights, opportunities are less explored in SNNs with sparse weights, i.e., dual-sparsity. In this work, we study the acceleration of dual-sparse SNNs, focusing on their core operation, sparse-matrix-sparse-matrix multiplication (spMspM). We observe that naively running a dual-sparse SNN on existing spMspM accelerators designed for dual-sparse Artificial Neural Networks (ANNs) exhibits sub-optimal efficiency. The main challenge is that processing timesteps, a natural property of SNNs, introduces an extra loop to ANN spMspM, leading to longer latency and more memory traffic. To address the problem, we propose a fully temporal-parallel (FTP) dataflow, which minimizes both data movement across timesteps and the endto-end latency of dual-sparse SNNs. To maximize the efficiency of FTP dataflow, we propose an FTP-friendly spike compression mechanism that efficiently compresses single-bit spikes and ensures contiguous memory access. We further propose an FTPfriendly inner-join circuit that can lower the cost of the expensive prefix-sum circuits with almost no throughput penalty. All the above techniques for FTP dataflow are encapsulated in LoAS, a Low-latency inference Accelerator for dual-sparse SNNs. With FTP dataflow, compression, and inner-join, running dual-sparse SNN workloads on LoAS demonstrates significant speedup (up to  $8.51\times$ ) and energy reduction (up to  $3.68\times$ ) compared to running it on prior dual-sparse accelerators.

#### I. INTRODUCTION

Spiking Neural Networks (SNNs) have attracted considerable interest as potential energy-efficient substitutes for Artificial Neural Networks (ANNs) [\[5\]](#page-12-0), [\[11\]](#page-12-1), [\[43\]](#page-13-0). Inspired by the biological neuron, SNNs leverage highly sparse unarycoded  $({0,1})$  spikes to compute and communicate information [\[54\]](#page-13-1). Thus, running SNNs on hardware significantly reduces computation and data movement, making it suitable for edge computing. Therefore, SNNs have been widely used in computer vision tasks, such as image classification [\[46\]](#page-13-2), [\[56\]](#page-13-3), optical flow estimation [\[28\]](#page-12-2), semantic segmentation [\[21\]](#page-12-3), and object detection [\[20\]](#page-12-4).

Opportunity. As the need for edge devices with limited memory capacity increases, recent research on SNNs highlights the significance of dual-sparse (both spikes and weights are sparse), which can be achieved by neural pruning techniques [\[5\]](#page-12-0), [\[23\]](#page-12-5). Pruning the weight connections of SNNs has been explored during both training [\[4\]](#page-12-6), [\[49\]](#page-13-4) and inference [\[38\]](#page-12-7).

<span id="page-1-0"></span>TABLE I COMPARISON OF LOAS WITH PRIOR SNN ACCELERATORS. S AND T DENOTE THE SPATIAL AND TEMPORAL DIMENSIONS. SPATIAL PARALLELISM MEANS PE-LEVEL PARALLELISM.

| Accelerator              | Spike<br>Sparsity | Weight<br>Sparsity | Parallel<br>support      | Neuron<br>support |
|--------------------------|-------------------|--------------------|--------------------------|-------------------|
| SpinalFlow [36]          |                   | x<br>x             |                          | LIF<br>LIF        |
| PTB [29]<br>Stellar [33] |                   | x                  | S+partial-T<br>S+fully-T | <b>FS</b>         |
| LoAS (ours)              |                   |                    | S+fully-T                | LIF               |

target dual-sparsity in SNNs. Table. [I](#page-1-0) summarizes existing dataflow SNN accelerators.

Insight. Though spikes and weights have varying bitwidth, in dual-sparse SNNs, *their interactions follow the pattern in sparse-matrix-sparse-matrix multiplication (spMspM)*, which has been extensively studied in ANNs [\[9\]](#page-12-16), [\[15\]](#page-12-17), [\[18\]](#page-12-18), [\[19\]](#page-12-19), [\[39\]](#page-12-20), [\[41\]](#page-13-6), [\[42\]](#page-13-7), [\[51\]](#page-13-8), [\[62\]](#page-13-9), [\[64\]](#page-13-10). However, naively running dual-sparse SNNs on existing spMspM accelerators is inefficient. The reason is multifaceted. First, the timesteps in SNNs complicate the dataflow design for existing spMspM accelerators. spMspM operations in ANNs are triple-nested forloops [\[41\]](#page-13-6), [\[55\]](#page-13-11). Different spMspM dataflows are obtained by permuting the order of loops. However, in SNNs, the timesteps introduce an extra level of for loop, leading to extra latency and memory traffic. What's worse, it constrains dataflow dependency and doubles the dataflow design space, delaying the time-to-solution. Second, the asymmetric bitwidth of spikes and weights in SNNs makes it inefficient to use conventional compression formats in ANN spMspM accelerators. Existing ANN spMspM accelerators store sparse matrices with popular compressed formats like compressed sparse row (CSR). These formats usually have multiple bits to record the coordinates of the non-zero values, and so does the hardware designed. Consequentially, using multiple bits to compress single-bit spikes (valued at either 1 or 0) is extremely inefficient for dual-sparse SNNs.

Proposal. To solve these problems and unleash the potential of dual-sparse SNNs in the presence of spMspM, we propose fully temporal-parallel (FTP) dataflow, illustrated in Figure [1.](#page-0-1) FTP dataflow parallelizes all timesteps to avoid complicated dataflow dependency for minimized latency and memory traffic. To maximize the efficiency of FTP dataflow on memory and computation, we design FTP-friendly spike compression and inner-joint mechanism. The proposed compression packs spike along timesteps and can access the relevant memory space in a contiguous manner. The proposed inner-join nearly halves the cost of cumbersome prefix-sum circuits with almost no throughput penalty compared to prior inner-join designs. To validate FTP dataflow, we design LoAS, a Low-latency Inference Accelerator for Dual-Sparse Spiking Neural Networks. Our contributions are listed below:

1) We observe that SNNs with rich dual-sparsity from both input spikes and weight connections are sub-optimal on existing hardware. SNN hardware usually does not support sparse weights, while ANN spMspM hardware



<span id="page-1-2"></span>Fig. 2. Difference between the LIF-based SNN neuron and ReLU-based ANN neuron. We compare the behavior of LIF-based and ReLU-based neurons and their hardware implementations.

fails to efficiently process timesteps in SNNs with low latency and memory traffic.

- 2) To improve the efficiency of processing timesteps, we propose a fully temporal-parallel (FTP) dataflow. FTP avoids extra memory traffic across timesteps and minimizes the latency penalty in processing timesteps sequentially.
- 3) To make the most of FTP, we propose FTP-friendly spike compression for efficient yet contiguous memory access and an FTP-friendly inner-join mechanism for low-cost computation with almost no latency penalty.
- 4) We build LoAS, a novel architecture that exemplifies the FTP dataflow. With both FTP-friendly compression and inner-join, LoAS is able to achieve high speedup and energy efficiency against other sequential-running spMspM baselines.

The remainder of the text is organized as follows. Section  $\Pi$ reviews the background and justifies the motivation. Section [III](#page-4-0) and [IV](#page-5-0) articulates our proposed FTP dataflow and LoAS architecture. Next, Section [V](#page-7-0) and [VI](#page-8-0) evaluate our design. Finally, Section [VII](#page-11-0) and [VIII](#page-11-1) discuss and conclude this work.

#### II. BACKGROUND AND MOTIVATION

# <span id="page-1-3"></span><span id="page-1-1"></span>*A. Preliminary of SNNs*

*1) Leaky-Integrate-and-Fire Neuron:* The Leaky-Integrateand-Fire (LIF) neuron is a classical neuron model [\[8\]](#page-12-21) and widely adopted by prior SNN works  $[23]$ ,  $[24]$ ,  $[63]$ ,  $[65]$ , thanks to its bio-plausibility and high accuracy. In this work, we focus on accelerating the workloads of dual-sparse SNNs that use LIF neurons.

During inference, each layer has an input spike tensor  $A \in$  $\mathbb{U}^{M \times K \times T}$  where  $\mathbb{U} \in \{0,1\}$  and a weight matrix defined as  $B \in \mathbb{Z}^{K \times N}$ . Here T is the number of total timesteps; M,  $N$ , and  $K$  are the spatial dimensions of the input and weight matrix. The behavior of an SNN layer can be described below:

Step 1: Sparse Matrix Multiplication Sparse matrix multiplication across all timesteps is performed to obtain the full output matrix  $O \in \mathbb{Z}^{M \times \bar{N} \times T}$ , which will be sent to LIF neurons.

$$
O_{m,n}[t_i] = \sum_{k=0}^{K} A_{m,k}[t_i] B_{k,n},
$$
\n(1)

where the  $t_i$  is the current timestep. With dual-sparsity, sparse matrix multiplication becomes spMspM.

Step 2: LIF firing LIF neurons take the snapshot of  $O$  at timestep  $t_i$  and generate a snapshot of the output spike tensor  $C \in \mathbb{U}^{\tilde{M} \times N \times T}$  for current timestep  $t_i$ :

$$
C_{m,n}[t_i] = \begin{cases} 1 & X_{m,n}[t_i] > v_{th} \\ 0 & \text{else,} \end{cases}
$$
 (2)

where

$$
X_{m,n}[t_i] = O_{m,n}[t_i] + U_{m,n}[t_{i-1}].
$$

Here,  $U[t_{i-1}]$  is the membrane potential that carries over the temporal information from previous timestep  $t_{i-1}$ , and  $v_{th}$  is the firing threshold, a pre-defined scalar value.

Step 3: Membrane Potential Update After the output spikes are generated, we update the membrane potential that will carry residual information to the next timestep according to the equation below.<sup>[2](#page-2-0)</sup>

$$
U_{m,n}[t_i] = \tau X_{m,n}[t_i](1 - C_{m,n}[t_i]),\tag{3}
$$

where  $\tau \in (0, 1)$  is the leaky factor. From the above equations, we observe that to generate the output spike matrix  $C$  for timestep  $t_i$ , we need to know the information from the previous timestep  $U[t_{i-1}]$ . This brings temporal dependency between output spike matrices across timesteps. The behavior of a LIF neuron can be found in Figure [2.](#page-1-2)

*2) Spike Encoding and SNN Training:* One key step in leveraging SNNs in conventional machine learning tasks is encoding the input source data (e.g., image pixels or text embeddings) into spike trains across multiple timesteps. The input spike trains are then sequentially sent to the SNN for processing. Recent SNN works adopt direct encoding (a special case of rate encoding) to achieve high accuracy on conventional computer vision tasks in very few timesteps  $( \leq 4)$  [?], [\[23\]](#page-12-5), [\[25\]](#page-12-23), [\[57\]](#page-13-14), [\[65\]](#page-13-13). In direct encoding, the source data, instead of being directly converted into spike trains, first goes through one ANN layer. The output from the ANN layer is then converted into spike trains. We will focus on accelerating direct-coded dual-sparse SNNs in this work. The SNNs are trained using backpropagation-throughtime (BPTT)  $[53]$  with surrogate gradient  $[37]$  to achieve very close performance to ANNs on many complex tasks [\[57\]](#page-13-14), [\[65\]](#page-13-13).

### <span id="page-2-3"></span>*B. Distinctive Features and Challenge of SNNs*

Several distinctive features make SNNs favorable for lowpower edge deployment, but they also come with challenges.

Feature 1: Unary Activation One of the most distinctive features of SNNs is their unary spike activation. More specifically, the SNNs leverage single-bit non-weighted activation to propagate information through layers. The primary benefit of the unary activation is the simplified low-power arithmetic

units that they require. As shown in Figure [2,](#page-1-2) compared to the multiply-accumulate (MAC) of ANNs, SNN only requires simple bitwise-AND and accumulate (AC) operations during inference time.<sup>[3](#page-2-1)</sup> Without the expensive multipliers  $[16]$ , the computations for SNNs require extremely low power and area.

Feature 2: Sparse Spike Activity The second feature of SNNs is their highly sparse spike-firing activity. In ANNs, upon completion, MAC results go through the ReLU unit, which filters out non-positive outputs. Different from ANNs, AC results in SNNs go through the Leaky-Integrate-and-Fire (LIF) unit, which only fires (generates an output of 1) when the input is greater than a pre-set threshold. As a result, the output sparsity in SNNs is usually much higher ( $\sim 90\%$ ) [\[60\]](#page-13-16), [\[61\]](#page-13-17), [\[63\]](#page-13-12), [\[65\]](#page-13-13) than that of ANNs (∼ 50%) [\[41\]](#page-13-6), [\[45\]](#page-13-18). More sparse outputs apparently lead to more computation and memory saving under the context of spMspM acceleration.

Challenge: Repeated Timesteps Despite the aforementioned hardware-friendly features, one main challenge of deploying SNNs on hardware is their intrinsic repeated timesteps. A timestep is the minimum unit of time in SNNs, thus discrete.<sup>[4](#page-2-2)</sup> In one timestep, each neuron needs to complete the AC operations for all inputs, fire a spike if necessary, and update its membrane potential (will be discussed shortly). The SNN needs to run across multiple timesteps to capture the temporal dynamics from the input data, as shown in Figure [2.](#page-1-2) Running multiple timesteps increases latency and fails to be energy efficient, diluting the advantage of low-power circuits unless we have a specialized architecture design [\[36\]](#page-12-8).

# <span id="page-2-4"></span>*C. spMspM Dataflows in SNNs*

There are various ways to map spMspM onto hardware, each with unique efficiency [\[31\]](#page-12-26), [\[34\]](#page-12-27). Three different spMspM dataflows have been proposed in existing dual-sparse ANN accelerators: Inner-product (**IP**) [\[15\]](#page-12-17), [\[18\]](#page-12-18), [\[19\]](#page-12-19), [\[42\]](#page-13-7), Outer-product (**OP**) [\[9\]](#page-12-16), [\[39\]](#page-12-20), [\[41\]](#page-13-6), [\[64\]](#page-13-10), and Gustavson's (**Gust**) [\[51\]](#page-13-8), [\[62\]](#page-13-9). In Figure [3,](#page-3-0) we illustrate these three dataflows in SNNs for two input matrices  $A$  and  $B$ , and an output matrix C. We also formulate their abstract loop nests on the right-hand side. As we discussed in Section  $II-B$ , it is impossible not to consider the multiple timesteps for spMspM operations in SNNs.

Inside the black box in Figure [3,](#page-3-0) the dataflow is for one timestep, thus identical to ANN dataflow. Outside the black box, multiple input matrices A (blurred) represent the input spike matrices across different timesteps, which need to be processed. Meanwhile, multiple output spike matrices C that have temporal dependency between each other are also generated. Specifically, to accommodate the timesteps in SNNs, we need to consider one more loop dimension (t dimension) in the original triple-nested for-loop. The  $t$  dimension (annotated in the blue box) brings temporal dependency to each output

<span id="page-2-0"></span><sup>&</sup>lt;sup>2</sup>We focus on the hard reset (membrane potential is reset to zero if there is an output spike of one) in this work. Though there exist other reset schemes, sticking with one of them will not lose generality in the hardware design.

<span id="page-2-1"></span><sup>&</sup>lt;sup>3</sup>There exist other implementations using multiplexers instead [\[29\]](#page-12-9), [\[36\]](#page-12-8). We focus on using bitwise-AND gates in this work.

<span id="page-2-2"></span><sup>&</sup>lt;sup>4</sup>Timestep is also called tick  $[36]$  or time-point  $[29]$  in other works. We follow the naming convention adopted by the latest SNN algorithm works.



<span id="page-3-0"></span>Fig. 3. Comparison of different spMspM dataflow for SNNs. Here, for illustration purposes, we put  $C[t_i]$  as the spMspM result between  $A[t_i]$  and B to align with spMspM in ANNs. In SNNs, we need to go through one more LIF step (Equation (2)) to get  $C[t_i]$ . The circled numbers illustrate the order of computation for the specific spMspM dataflow. Please note that we fix the position of **t** dimension for illustration purposes. In practice, there will be a total of 16 possible permutations of spMspM dataflow in SNNs, which we will discuss in Section [III.](#page-4-0)

pixel in SNNs. For example, to process the SNN using **IP** dataflow as shown in Figure  $3$ , we first calculate the output cell at  $(0,0)$  position for timestep  $0$  ( $C[0,0,0]$ ), then instead of moving to the position  $(0,1)$ , we move on to process the output cell at  $(0,0)$  for timestep 1  $(C[0,0,1])$ . Since the output cell  $C[0,0,1]$  is temporal dependent on the result of the output cell  $C[0,0,0]$ , we cannot process  $C[0,0,1]$  before  $C[0,0,0]$ .

#### <span id="page-3-3"></span>*D. ANN spMspM Hardware for dual-sparse SNNs*

We review existing ANN spMspM accelerators to understand why naively running dual-sparse SNNs on these accelerators is sub-optimal.

Inner-join Design: For the **IP** dataflow, prior accelerators usually adopt the inner-join-based design [\[9\]](#page-12-16), [\[15\]](#page-12-17). In such designs, non-zero values in rows of matrix A and columns of matrix B are compressed using bitmask representation (a bit string that has 1's for positions with non-zero values and 0's otherwise). An inner-join unit scans two bitmasks on the fly to determine if there's a matched position (both multiplicands are non-zero) and then sends the matched pairs to the compute units. Running dual-sparse SNNs on an innerjoin-based design does not require the extra bit-masks for the input spike matrix A (the unary spike train itself can be viewed as a bit-mask). However, as shown in Figure [4,](#page-3-1) the timesteps will impose multiple extra rounds of running the expensive inner-join units (e.g., occupying roughly 46% of the systemlevel power [\[15\]](#page-12-17)), thus incurring high energy cost. Moreover, since the spike trains are used as bit-masks, all the spikes, no matter 1 or 0, are necessary to be fetched from off-chip DRAM. This brings no memory traffic saving on the sparse spike matrix A. each column of the state of  $\frac{1}{2}$  column of  $\frac{1}{2}$  column of  $\frac{1}{2}$  and  $\frac{1}{2}$  column of  $\frac{1}{2}$  and  $\frac{1}{2}$  ( $\frac{1}{2}$ ) ( $\frac{1$ 

Merger-based Design: Unlike **IP** dataflow designs that exhibit full output matrix (C) reuse, **OP** and **Gust** dataflow designs focus on the reuse of input matrix A and B. In **OP**,



Fig. 4. An example of the inner-join design. The difference between the behavior of ANN and SNN is shown. \*Data from SparTen [\[15\]](#page-12-17).

<span id="page-3-1"></span>

<span id="page-3-2"></span>Fig. 5. Off-chip traffic of partial sum matrices on different SNN layers. We envision SNNs with a timestep of 1 and 4 running on GoSPA [\[9\]](#page-12-16), an OP dataflow spMspM accelerator.

once, leading to efficient input data reuse. However, one partial sum is generated at a time and merged later. While these two dataflows have better data reuse on the input matrix, the partial sum matrices (rows) potentially bring more off-chip data traffic. To amortize the large memory traffic of partial sums, some designs implement large and costly mergers (e.g.,  $38\times$  more area than multipliers [\[64\]](#page-13-10)) to merge as many as partial sum matrices (rows) before sending them back to the off-chip DRAM. Due to the extra  $t$  dimension, running dualsparse SNNs on a merger-based design either requires a more complex merger that is capable of digesting the extra partial sum traffic or incurs more off-chip memory traffic. As shown in Figure [5,](#page-3-2) for a timestep of four, on average,  $4\times$  more partial sum traffic will be induced compared to a single timestep.



<span id="page-4-1"></span>Fig. 6. Example of PTB's partially temporal parallel design. Each column of the PE array processes a time-window that consists of multiple timesteps. Different time-windows run in parallel, but the timesteps inside the window are still processed sequentially.

## <span id="page-4-3"></span>*E. Dataflow Architecture for SNNs*

SpinalFlow: Temporal Sequential Design. SpinalFlow [\[36\]](#page-12-8) is the first SNN-tailored accelerator for extracting the efficiency from the single-bit activation and the extremely sparse spike activity. The authors identified the challenge of sequentially processing the entire SNN network through timesteps. To overcome the challenge, SpinalFlow proceeds all timesteps for one layer and then proceeds to the next layer, as shown in Figure [1.](#page-0-1) SpinalFlow dispatches LIF neurons across different processing elements (PEs) and parallelizes the computation. Within each layer, the timesteps are processed sequentially, as shown in Figure [1.](#page-0-1) Spinalflow is optimized exclusively for the temporal-coded SNNs that potentially lag in terms of accuracy performance compared to rate-coded SNNs [\[29\]](#page-12-9). In this work, we focus on accelerating spMspM for general rate-coded SNNs that yield competitive accuracy as ANNs in various tasks.

PTB: Partially Temporal Parallel. While SpinalFlow's design is tailored to the temporal-coded SNNs, PTB [\[29\]](#page-12-9) proposes a general architecture design for the rate-coded SNN. By leveraging the high data-reuse pattern across different PEs in the systolic array architecture [\[27\]](#page-12-28), PTB breaks the processing of all timesteps into multiple time-windows (each consists of several contiguous timesteps) and run these time-windows in parallel, as shown in Figure [1.](#page-0-1) PTB parallelly maps multiple time-windows across different columns of the systolic array. The computation of different LIF neurons is also parallelized across the rows of the systolic array. We illustrate this hardware mapping strategy in Figure [6](#page-4-1) with details. Though PTB tries to parallelize the processing of timesteps, the parallelization is on the granularity of the time-window. Inside each time-window (column of PEs), the timesteps are still processed sequentially. Consequently, we categorize PTB as a partially temporal parallel design. One unique aspect of LoAS from PTB is that LoAS places the temporal dimension in the inner-most loop, enabling all optimizations.

Prior SNN accelerators with LIF neurons process timesteps in a sequential or partially parallel manner. In this way, as we discussed in (Section [II-C](#page-2-4)  $\&$  [II-D\)](#page-3-3), it is very challenging for those existing SNN designs to have good performance on spMspM SNN acceleration. Thus, we need a spMspM-friendly strategy to process timesteps.

# <span id="page-4-2"></span>Algorithm 1 Fully Temporal-Parallel dataflow (**FTP**) Input: Input spike matrix  $A \in \mathbb{U}^{M \times K \times T}(\mathbb{U} \in \{0, 1\})$ Weight matrix  $B \in \mathbb{Z}^{K \times N}$ Output: Output spike matrix  $C \in \mathbb{U}^{M \times N \times T}$ 1: for  $m \in M$  do 2: for  $n \in N$  do 3: for  $k \in K$  do 4: **parallel-for**  $t \in T$  **do**  $\triangleright$  Spatially unrolled 5:  $O[m, n, t]$  +=  $A[m, k, t] \times B[k, n]$ 6: end for **parallel-for**  $t \in T$  **do**  $\triangleright$  Spatially unrolled 7:  $C[m, n, t] = LIF(O[m, n, t])$ 8: end for 9: end for

Stellar: Fully Temporal Parallel but with non-LIF neurons. Stellar [\[33\]](#page-12-15) is another systolic array SNN accelerator which attempts to process timesteps in a fully parallel manner. Nonetheless, Stellar focuses on optimizing for the Few Spikes (FS) neuron [\[52\]](#page-13-19), as shown in Table [I.](#page-1-0) FS neurons behave differently from LIF neurons by detaching the spike accumulating and firing stages. Therefore, FS neurons naturally do not have temporal dependency among the input data at the spike accumulation stage. This makes fully parallel temporal processing straightforward in Stellar. On the contrary, as discussed in Section [II-A,](#page-1-3) temporal dependency naturally exists in the input data for the LIF neuron, which makes its design space different from the one in Stellar for fully temporal parallel processing. Unlike the widely adopted LIF neurons, supporting FS neurons also requires non-trivial algorithm-hardware codesign, which is out of the scope of this work.

## III. FULLY TEMPORAL PARALLEL DATAFLOW

<span id="page-4-0"></span>We propose a *fully temporal-parallel dataflow* (**FTP**) that targets reducing the negative effects of repeatedly processing the timesteps on spMspM accelerators (Section [II-D\)](#page-3-3). The proposed **FTP** is formulated in Algorithm [1.](#page-4-2)

An SNN-friendly spMspM dataflow should satisfy three goals: (1) avoid as much data refetch as possible across the timesteps; (2) generate as few partial sums as possible on the temporal dimension (timesteps); (3) reduce the latency as much as possible on the temporal dimension to reduce the extra cost of sparsity handling units.

Our first observation is that for all three spMspM dataflows (Section [II-C\)](#page-2-4), unless placing the temporal dimension  $(t$ -dim) at the innermost loop, it will bring at least  $T$  times more data refetch to the dimensions below, compared to the original dataflow. For example, in  $OP$ , if  $t$ -dim is placed between  $m$ and n,  $T$  times more access to  $B$ 's rows is required. If  $t$ dim is placed between  $k$  and  $m$ ,  $T$  times more access to  $A$ 's columns and  $B$ 's rows is required. Depending on the onchip buffer capacity, repeated memory access might lead to

more expensive access to the off-chip memory, which opposes goal (1).

Our second observation is that both **OP** and **Gust** dataflow are not suitable for dual-sparse SNNs since they oppose goal (2). In **OP** dataflow, we observe that no matter where we insert the  $t$  dimension into the original triple-nested loop, we always produce  $T$  times more partial sum matrices compared to the original **OP** dataflow. The partial sums need to be stored in an on-chip cache till all partial sums along both spatial  $(k)$ and temporal dimensions  $(t$ -dim) are accumulated. This will add extra memory overhead in **OP**. The same problem also exists for **Gust** dataflow. The t-dim will either generate T times more partial sum rows or have  $T$  times more access to both  $k$  and  $n$  dimensions. The last observation is that regardless of the position of t-dim, as long as we process it sequentially, it always incurs  $T$  times more processing latency, which opposes goal  $(3)$ .

Our solution is straightforward but effective. We first choose to position the t-dim at the innermost of the **IP** dataflow, as given in Algorithm [1.](#page-4-2) This design choice has several advantages. Firstly, putting the  $t$ -dim at the innermost loop ensures that no extra data movement will be incurred (goal (1)). Secondly, since **IP** dataflow has efficient output reuse, no extra partial sums will be generated on the  $t$ -dim (goal  $(2)$ ). Lastly, we fully parallelize the t-dim and eliminate the latency brought by sequentially processing timesteps. This is equivalent to transforming the *for-loop* of t into a *parallelfor* loop [\[55\]](#page-13-11). This *parallel-for* loop parallelizes the operation across different spatial instances, requiring minimum hardware overheads due to only cheap accumulators being duplicated, and timesteps of direct-coded SNNs are small (Section [II-A\)](#page-1-3). We later show in the ablation studies that **FTP** scales well with the increasing timesteps.



<span id="page-5-1"></span>Fig. 7. Architecture of LoAS and the microarchitecture of the TPPE. Red arrows are the enable signals that skip the computation on 0 spikes [\[59\]](#page-13-20).

#### IV. LOAS

<span id="page-5-0"></span>An overview of LoAS is shown in Figure [7.](#page-5-1) LoAS consists of multiple temporal parallel processing elements (TPPEs) and parallel Leaky-Integrate-Fire units (P-LIFs) that are tailored to run the FTP dataflow; a scheduler that distributes workloads across TPPEs; and a compressor that compresses the output



<span id="page-5-3"></span>Fig. 8. Example showing how input spikes are compressed in LoAS. bm stands for the bitmask, and ptr stands for the pointer.

spikes from P-LIFs and writes them back to the on-chip memory. An on-chip SRAM is equipped to capture data reuse.

#### <span id="page-5-4"></span>*A. Spikes Compression*

We first discuss how sparse input spikes (matrix  $A$ ) across timesteps are compressed in LoAS. Efficiently compressing matrix A in SNNs necessitates solving two challenges:

How to maximize the compression ratio of 1-bit spikes? Assume that the input spike matrix A has a size of  $128 \times 128$ for each timestep. Then for either CSR or CSC, we need to use two 7-bit coordinates to compress each 1-bit non-zero spike.<sup>[5](#page-5-2)</sup> Furthermore, SNNs naturally run for multiple timesteps, which means that for the same coordinate, different spike values may occur at different timesteps (e.g., 0 for T=1&3, and 1 for  $T=2&4$ ). To faithfully capture all the non-zero spikes, we need separate coordinate values for each timestep.

How to maintain contiguous memory access of non-zero spikes across timesteps? The **FTP** dataflow we proposed in Section [III](#page-4-0) requires spatial unrolling of the input spike matrix A across all timesteps beneath the k dimension. Consequently, a dis-contiguous memory layout of  $A$  along the  $t$  dimension will cause fragmented memory access at all levels of memory hierarchies, leading to higher data movement costs.

To better illustrate these two points, we provide an example in Figure [8.](#page-5-3) Envisioning that the input spikes sent to the system have the pre-synaptic neuron  $a_{0,0}$  (first element of row-0 in matrix A) firing a spike at  $t_0$  and  $t_2$ . As shown in step 1 , to represent this pre-synaptic neuron behavior, a singlebit 1 needs to be stored at row-0, column-0 of matrix A for both timestep 0 and 2 into the memory, shown in the box of 'unpacked real data.' Then, for each non-zero spike in row-0 of matrix A for each timestep, if we need to use a coordinate value (e.g., 4-bit for CSR) to record its position. We then need  $2 \times 4 = 8$  bits to compress 2 bits (2 spikes). The compression efficiency in this case is only 25%. Furthermore, memory access to spikes across different timesteps is discontinuous (sequentially access different rows of A). We propose the following spikes compression format for LoAS to solve these

<span id="page-5-2"></span><sup>&</sup>lt;sup>5</sup>For 128 columns, we need  $log_2(128) = 7$  bits for coordinates. We neglect the offsets in the discussion, which will further increase the number of bits used for coordinates.

two challenges. In our method, as shown in step  $\bullet$ , we pack all the spikes (both 0 and 1) across all timesteps into one continuous data block in the system for each pre-synaptic neuron. In the example of Figure [8,](#page-5-3) we store a 4-bit value 1010 at the first position of row-0 of matrix A for  $a_{0,0}$  and 0111 at the fourth position for  $a_{0,3}$ . Since neurons  $a_{0,1}$  and  $a_{0,2}$  do not spike at any timestep, their packed value would be 0000 (shown in the box of 'packed real data'). We define these neurons as silent neurons. $6$  With this strategy, only the nonsilent neurons will be treated as non-zero values and stored in the memory for matrix  $A$ , as shown in step  $\bigcirc$ . In our example, we end up using 4 bits to compress 5 bits. The compression efficiency in this case is 125%.

To accommodate our **FTP** dataflow, we compress the input spike matrix  $A$  in a row-wise manner and use the bitmask format [\[9\]](#page-12-16), [\[15\]](#page-12-17), [\[42\]](#page-13-7) to represent the coordinates of the nonzero values. The bitmask format uses a 1-bit coordinate value for each position in the row. In our example, the bitmask is 1001 since the first and the fourth elements in the row are non-zero. The second and third elements are silent neurons, so we do not store them in the memory (represented by a 0 in bitmask). Following the bitmask, a pointer is stored to provide the starting location of the non-zero values of the row. We call this compressed row: a fiber  $[34]$ ,  $[62]$ .

The key to our compression method is the ratio of silent neurons in the SNN. Fortunately, empirical studies have shown that SNNs have a significant fraction of silent neurons  $(60\% \sim 70\%$ , as shown in Table [II\)](#page-8-1). We further use a similar bitmask-based technique to compress weights in a columnwise manner. Each compressed weight column is also called a fiber.

### *B. Temporal Parallel Processing Elements*

The fundamental building blocks of LoAS's compute engine are Temporal Parallel Processing Elements (TPPEs) and Parallel Leaky-Integrate-Fire units (P-LIFs), which we describe next. Figure [7](#page-5-1) also details the design of TPPE. Each TPPE produces the full sum for one output neuron across all timesteps (Line 5 in Algorithm. [1\)](#page-4-2). Before the computation starts, the bitmask (bm-B) of a fiber from weight matrix  $B$ (fiber-B) and its non-zero data are read from SRAM and broadcasted into the small bitmask buffers (128 bits in our design) inside each TPPE. The bitmask (bm-A) of fiber from input spike matrix  $A$  (fiber-A) is also fetched and sent to the TPPEs. Each TPPE will hold the bitmask for a distinct fiber along the row of A. After the data are loaded, an *innerjoin* operation [\[9\]](#page-12-16), [\[15\]](#page-12-17), [\[18\]](#page-12-18) is performed between the two bitmasks. Depending upon the inner-join result, the matched non-zero data of fiber-A will be fetched from the global cache and sent to the *pseudo-accumulator* (soon be discussed) to perform the accumulation (AC) operation. After the TPPE completes the full computation of one output neuron, it will send the result to the P-LIF unit to generate output spikes for all timesteps in one shot.



<span id="page-6-2"></span>Fig. 9. Illustration of the proposed FTP-friendly inner join unit.

# *C. Inner-join Unit*

The inner-join operation has been extensively studied by prior works [\[9\]](#page-12-16), [\[15\]](#page-12-17), [\[18\]](#page-12-18) for spMspM acceleration in ANNs. The inner-join mechanism with prefix-sum circuit has been efficiently implemented with the bitmask representation [\[15\]](#page-12-17). In [\[15\]](#page-12-17), a logical-AND operation is first applied to two bitmasks to get the *AND-result*, which represents the location where both data are nonzero. The *AND-result* is then sent to a priority encoder to convert the *matched positions* into integer values. The *matched positios* are sent to two separate prefixsum circuits to get the number of 1s in front of the *matched position* for each bitmask. This gets the offsets for each nonzero data in the memory.

During the above process, the use of two fast prefix-sum circuits is an expensive operation (taking more than 45% power and area in  $[15]$ ).<sup>[7](#page-6-1)</sup> To reduce the overhead brought by the prefix-sum circuits, we propose an FTP-friendly inner-join unit that is detailed in Figure [9.](#page-6-2)

We first observe that in ANNs, the MAC operation requires both inputs to be explicitly known at computation time. Therefore, we need two fast prefix-sum circuits to match the processing speed between two inputs. However, this is not the case with SNNs. In SNNs, we only have two cases for the input (1 or 0), meaning we either accumulate or discard the weight. This provides the opportunity to have an imbalanced processing speed for two inputs at the prefix-sum stage.

In our design, instead of using two fast prefix-sum circuits as in ANNs, we have one fast and one laggy prefix-sum circuit, as shown in Figure [9.](#page-6-2) Recall that our compression method only fetches the non-silent neurons (that fire at least once across timesteps) from DRAM for A. Thus, as soon as we find a matched position in *AND-result*, we are confident that the corresponding non-zero value in fiber-B will be accumulated at least once (at least one timestep). Therefore, we can begin accumulating the non-zero value in fiber-B without knowing the exact spike information from fiber-A. In this way, we can ensure the throughput of consuming fiber-B is always high regardless of the processing speed of fiber-A.

In our efficient inner-join unit, each time the fast prefixsum circuit generates an offset, the corresponding non-zero value of fiber-B will be directly sent to a *pseudo-accumulator* for accumulation. This mechanism opportunistically presumes

<span id="page-6-0"></span> $6$ We follow the same terminology used in [\[29\]](#page-12-9).

<span id="page-6-1"></span> $7 \text{In } [15]$  $7 \text{In } [15]$ , the design of the prefix-sum circuit is not described. We assume it to be a tree-like prefix-sum circuit with  $O(log(n))$  complexity that can run in one clock cycle.  $n$  is the size of input and output for the prefix-sum circuit, which is set to 128 in both [\[15\]](#page-12-17) and our work.

the matched non-zero value of fiber-A is all 1s (pre-synaptic neuron fires at all timesteps) to fully leverage the throughput of the fast prefix-sum circuit. Since the non-zero value in fiber-A is not always all 1s, we need a mechanism to ensure that the accumulation results are correct. Instead of using the expensive fast prefix-sum circuit to access and check the matched non-zero value in fiber-A, we use a much simpler circuit to generate the offset of fiber-A. We defined the simpler prefix-sum circuit as the *laggy prefix-sum circuit*, illustrated on the left of Figure [9.](#page-6-2) We use a group of adders to sequentially add up the prefix-sum results and store them inside a small buffer. These adders run in parallel, and hence, the latency of generating all the offsets is equal to len(bm-A)/# of adders.



<span id="page-7-1"></span>Fig. 10. A walk-through example of the proposed FTP-friendly inner-join unit. This example assumes the laggy prefix-sum circuit will be ready after 2 cycles. FIFO-mp is the FIFO to buffer the matched position. FIFO-B is the FIFO to buffer the matched non-zero value of B. Acc stands for accumulator.

We provide a simple walk-through example in Figure<sup>10</sup>. We first run the fast prefix-sum circuit; in every cycle, we accumulate the matched non-zero value of fiber B and buffer it together with the matched position in small FIFOs. When the laggy prefix-sum circuit finishes running, a ready signal is sent out. We then check the non-zero value in fiber-A according to the buffered position from FIFO-mp. If the matched value is all 1s, we simply discard the current value in FIFO-B. Otherwise, we need to send the buffered non-zero values of fiber-B from the FIFO-B to the correction accumulators. As illustrated in Figure [10,](#page-7-1) at cycle 4, we check  $a_2$  and find its value is 1111. Thus, we simply discard  $b_2$ . At cycle 5, we check  $a_4$  and find its value is 1010. Thus, we send  $b_4$  to the correction accumulator for  $t_1$  and  $t_3$ . This example shows the motivation and benefits of using a combination of fast and laggy prefix sums. By having a fast prefix sum, we can consume B at the earliest possible by first accumulating it into the pseudo-accumulator. While waiting for the laggy prefix sum to correct the accumulation results, we can proceed to fetch the next fiber-B's data into the buffer. This way, the latency of fetching fiber B can be overlapped with the laggy prefix sum and correction to improve the overall throughput. At the same time, replacing one fast prefix sum with a laggy one saves the overall power and area of our TPPE.

# *D. Other Units*

After the computation of the *pseudo-accumulator* completes, its accumulation results are duplicated and sent to each correction accumulator. The correction value inside each accumulator will be subtracted from the pseudo accumulation results for each timestep. Finally, we send the corrected results to the P-LIF units to generate the output spikes. As shown inside the purple box in Figure [7,](#page-5-1) we spatially unroll the LIF operations so that the output spikes for all timesteps will be generated at once.

LoAS uses a unified global buffer for holding compressed fiber-A and fiber-B with their bitmask representations. We adopt a FiberCache design [\[62\]](#page-13-9). A unified shared cache exhibits better utilization compared to separate ones. Each line in the global cache consists of two parts. The first part is the bitmask representation of a fiber, followed by a pointer. The second part is the non-zero values of that fiber. If the line manages to hold all the non-zero values, the pointer will be a NULL pointer. Otherwise, it will point to the location of the line where the rest of the data are held. Each PE will take responsibility for generating one output neuron. Therefore, we use a highly banked global cache to ensure multiple PEs can access their data concurrently. Inside each bank, we fetch as many chunks as possible for one fiber in matrix A and hold them as long as possible to maximally have the data reuse of A. This can be achieved by adopting a replacement policy for the global cache as in [\[31\]](#page-12-26), [\[62\]](#page-13-9). Only one compressed row fiber of matrix  $B$  is fetched into the global cache and broadcasted to all TPPEs. We follow a compression unit as [\[15\]](#page-12-17), where an inverted prefix-sum circuit is used to compress the output spikes and generate their bitmask representations. Similar to the observation in [\[15\]](#page-12-17), this compression step does need to be performed fast. Therefore, we equip an inverted *laggy prefix-sum* circuit to perform the compression. The scheduler will be responsible for casting the data to each TPPE through a simple swizzle-switch-based crossbar [\[47\]](#page-13-21).

#### V. EXPERIMENTAL METHODOLOGY

<span id="page-7-0"></span>Software Configuration: For the dual-sparse SNNs, we train and compress the AlexNet [\[26\]](#page-12-29), VGG16 [\[50\]](#page-13-22), and ResNet19 [\[17\]](#page-12-30). We use the open-source toolchains for lotteryticket-hypothesis (LTH)-based SNN pruning [\[13\]](#page-12-10), [\[22\]](#page-12-31). We set the default timesteps  $T$  to 4 across all experiments. We use 15 rounds of LTH searching, and all SNNs are trained towards convergence with similar accuracy as state-of-the-art dense baselines [\[22\]](#page-12-31). We further select representative layers from each network to provide single-layer insights. The summary of the workloads is in Table [II.](#page-8-1) We further use a simple yet effective preprocessing technique: zeroing out all presynaptic neurons that have a low firing activity to further improve the number of silent neurons. We take the trained SNN and mask the neurons with only one output spike throughout all timesteps. We find that with a very small number of fine-tuning  $(<5$  epochs), the accuracy can be fully recovered, as shown in

#### TABLE II

<span id="page-8-1"></span>SNN WORKLOADS. NL = # OF LAYERS. T = TIMESTEPS.  $AVSp{A, B}$  = AVERAGE SPARSITY OF THE MATRICES{A, B} IN(%). AVSPA-ORIGIN IS THE ORIGINAL SPIKE SPARSITY ACROSS TIMESTEPS, AVSPA-PACKED IS THE DENSITY OF SILENT NEURONS, AND AVSPA-PACKED+FT IS THE DENSITY AFTER FINE-TUNED PREPROCESSING. M/N/K DENOTES MATRIX SHAPE.



Figure [11.](#page-8-2) Please note that this preprocessing technique aims to maintain the accuracy of the original workload instead of improving it. During hardware execution, the compressor will discard the output neurons that have 0 or only 1 output spike. From Table [II,](#page-8-1) we see that preprocessing effectively creates up to  $1.1 \times$  more silent neurons<sup>[8](#page-8-3)</sup>.



<span id="page-8-2"></span>Fig. 11. Accuracy trends of the fine-tuned preprocessing. Mask means masking out all presynaptic neurons that fire only once during the inference. FT-ex means fine-tuning for x epochs.

Hardware Configuration: We evaluate LoAS with the configuration in Table [III.](#page-8-4) In our experiments, we configure the LoAS to support SNNs running with 4 timesteps. We use 16 TPPEs, each with 5 accumulators (1 12-bit pseudoaccumulator and 4 10-bit correction accumulators) and 1 innerjoin unit. Inside each inner-join unit, there is 1 fast prefix-sum circuit and 1 laggy prefix-sum circuit. The fast prefix-sum circuit can generate the offsets in a single cycle. The laggy prefix-sum circuit contains 16 adders and a 128-bit buffer. It generates the offset results in 8 cycles. The TPPE also has 2 depth-8 FIFOs (for correction purposes) and 2 128-bit buffers (for holding bitmasks). Finally, a 128-byte buffer is equipped inside the TPPE to hold the non-zero weights from  $fiber-B$ . We allocate 256 KB (double-buffered) for the global cache. For our workloads, this memory size is enough to capture good on-chip data reuse and keep all TPPEs busy.

Baseline: As discussed previously, there are currently very limited spMspM accelerators available for dual-sparse SNNs. As a result, we construct our baselines in the following way, We first pick three popular ANN spMspM accelerators that use **IP**, **OP**, and **Gust** dataflow: SparTen [\[15\]](#page-12-17), GoSPA [\[9\]](#page-12-16), and Gamma  $[62]$ . We then envision that a dual-sparse SNN

TABLE III CONFIGURATION OF THE LOAS SYSTEM.

<span id="page-8-4"></span>

| <b>TPPEs</b>    | 16 TPPEs, 8-bit weight                                   |  |  |
|-----------------|----------------------------------------------------------|--|--|
| Inner-join unit | 16 Inner-join units                                      |  |  |
| Global cache    | 256 KB, 16 banks, 16-way associative                     |  |  |
| Crossbars       | $16 \times 16$ and $16 \times 16$ , swizzle-switch based |  |  |
| Main memory     | 128 GB/s over 16 64-bit HBM channels                     |  |  |
|                 |                                                          |  |  |

(with 4 timesteps and 8-bit weights) is naively running (sequentially processing its timesteps) on these accelerators. To be conservative, we place the  $t$  dimension at the innermost loop of the original **IP**, OP, and Gust dataflow.<sup>[9](#page-8-5)</sup> We then make essential simplifications for the two accelerators. For example, we remove the multipliers in these designs. To make a fair comparison, we configure all designs to have 16 PEs and the same global SRAM size. We call these three baselines SparTen-SNN, GoSPA-SNN, and Gamma-SNN.

We implement the key components of LoAS and our hardware baselines in RTL and synthesize them using Synopsys DC compiler at 800MHz with 32 nm technology. A 128 GB/s High-Bandwidth Memory (HBM) module is connected to LoAS as the off-chip memory. We use CACTI 7.0 [\[35\]](#page-12-32) to model the memory components. We built a simulator in Python to model the cycle-level behavior of LoAS and the baselines by tiling the loop and mapping it to hardware.

# VI. EXPERIMENTAL RESULTS

# <span id="page-8-0"></span>*A. Hardware Evaluation*

Overall Performances: Figure [12](#page-9-0) compares the performance between three dual-sparse SNN accelerator baselines (SparTen-SNN, GoSPA-SNN, and Gamma-SNN) and LoAS (with and without fine-tuned preprocessing) on three SNNs (speedup w.r.t the cycle numbers of the SparTen-SNN).

The first observation is that LoAS significantly outperforms the other three accelerator baselines in all cases, obtaining average speed-ups of  $6.79 \times$  (vs. SparTen-SNN),  $5.99 \times$  (vs. GoSPA-SNN), and  $3.25 \times$  (vs. Gamma-SNN). This is due to LoAS leverages **FTP** dataflow. The **FTP** dataflow completely unleashes LoAS from the intra-PE latency penalty of sequentially running the timesteps. It also enables LoAS to invoke less on-chip and off-chip data communications across timesteps. The second observation is that LoAS's performance gain is highly correlated with the sparsity of matrix A. This relationship is expected since our workloads are extremely sparse on matrix B; thus, the overall computation is matrix-Abounded. Consequentially, the performance of two baselines suffers more from sequentially running timesteps through matrix A with less sparsity. However, LoAS will not get this sequentially running penalty. As a result, LoAS achieves from  $4.08\times$  speedup (vs. SparTen-SNN) on VGG16 (highest matrix A sparsity) to  $8.51 \times$  speedup (vs. SparTen-SNN) on ResNet19 (lowest matrix A sparsity). Finally, we observe that with the help of pre-processing (removing the neurons that only spike one time), LoAS further improves the performance by 20% on average. This is because the pre-processing technique helps to

<span id="page-8-3"></span><sup>8</sup>The source codes can be found at <https://github.com/RuokaiYin/LoAS>

<span id="page-8-5"></span> $9$ Adding the t dimension anywhere else will bring more data traffic, thus worsening the performance.



<span id="page-9-0"></span>Fig. 12. Performance and efficiency comparison between SparTen-SNN, GoSPA-SNN, Gamma-SNN, and LoAS (with and without fine-tuned (FT) pre-processed) architectures across three SNN workloads. All numbers are normalized to that of the SparTen-SNN baseline.



<span id="page-9-1"></span>Fig. 13. Off-chip traffic (KB) and on-chip memory traffic (MB) for SparTen-SNN, GoSPA-SNN, Gamma-SNN, and LoAS (with and without preprocessed) architectures across three SNN workloads.

increase the density of silent neurons (Section  $IV-A$ ), which LoAS is able to completely avoid the data communications and computations. Figure [12](#page-9-0) also compares the energy efficiency of LoAS and three baselines on different SNN workloads. It is observed that LoAS (with preprocessing) achieves (3.68×, 3.09 $\times$ , 2.40 $\times$ ), (3.17 $\times$ , 1.50 $\times$ , 2.33 $\times$ ), and (3.54 $\times$ , 1.34 $\times$ , 2.47×) higher energy efficiency over (SparTen-SNN, GoSPA-SNN, and Gamma-SNN) on Alexnet, VGG16, and ResNet19.

Detailed Analysis: We next explain the performance gains of LoAS. Owing to the **FTP** dataflow, LoAS has much less onchip and off-chip memory traffic than the two baselines. As shown in Figure [13,](#page-9-1) compared to SparTen-SNN (**IP**), LoAS has  $3.93\times(3.70\times)$ ,  $3.57\times(2.22\times)$ , and  $4.07\times(2.24\times)$  less onchip SRAM (off-chip DRAM) access on Alexnet, VGG16, and ResNet19, respectively. This behavior is expected since **IP** dataflow design like SparTen is known for having poor



<span id="page-9-2"></span>Fig. 14. Normalized Off-chip traffic with breakup for SparTen-SNN, GoSPA-SNN, Gamma-SNN, and LoAS (with pre-processed) architectures across three SNN layer workloads. The normalized SRAM cache miss rate is also provided for the ResNet19 layer workload. All numbers are normalized to that of LoAS.

input data reuse. This inefficient input data reuse pattern is exacerbated by the extra temporal dimension  $(t$ -dim) in SNN workloads. While **FTP** dataflow is a variant of inner-product, it does not incur any extra executions on the t-dim since it parallelizes the t-dim at the inner-most loop.

Not surprisingly, compared to GoSPA-SNN (**OP**), LoAS still achieves  $2.87 \times (4.49 \times)$ ,  $2.19 \times (2.78 \times)$ , and  $2.98 \times (3.03 \times)$ less on-chip SRAM (off-chip DRAM) access on Alexnet, VGG16, and ResNet19, respectively. This behavior is also expected even though **OP** dataflow design is known to have excellent input data reuse (on average, GoSPA-SNN has 1.45× less SRAM traffic than SparTen-SNN). The inefficiency for GoSPA-SNN comes from the partial sum (psum) matrices. Because of the extra  $t$ -dim in SNNs, the size of psum matrices expands with the number of timesteps. GoSPA's design allocates a small on-chip memory for the psum. The psum matrices that cannot fit on-chip must be written to offchip DRAM and read back later for reduction. This incurs significant off-chip memory traffic.

Finally, compared to Gamma-SNN (Gust), LoAS is able to achieve  $2.16 \times$ ,  $1.76 \times$ , and  $1.91 \times$  less DRAM accesses. This result is aligned with Gust dataflow's ability to reduce offchip partial row accesses through on-chip SRAM and mergers. While reducing the DRAM accesses, Gamma's SRAM accesses are exacerbated by the t-dim in SNNs. This ends up with on average  $13.4\times$  more SRAM traffic than LoAS.

To better visualize the aforementioned analysis, we provide a memory traffic breakup in Figure [14](#page-9-2) for the three SNN layers in Table [II.](#page-8-1) As shown in the figure, SparTen-SNN has the largest input off-chip traffic, and GoSPA-SNN has the largest psum off-chip traffic across all workloads. Among the three baselines, Gamma-SNN has the smallest off-chip traffic footprint due to Gust dataflow's on-chip reuse of partial rows. GoSPA-SNN has the largest off-chip traffic for compressed format due to its CSR format for each spike. We notice that LoAS has slightly larger  $(2.1\times)$  off-chip traffic for compressed format compared to SparTen-SNN. This is because we need extra bitmasks to mark the position of non-silent neurons, while in SparTen-SNN, we can directly leverage the input spike trains. Nevertheless, this overhead is negligible com-

<span id="page-10-0"></span>TABLE IV AREA AND POWER BREAKDOWN OF LOAS (LEFT) AND ONE TPPE (RIGHT).



<span id="page-10-2"></span>Fig. 15. On-chip power breakup of LoAS. Accs stands for the accumulators, which include 1 pseudo-accumulator and 4 correction-accumulators.

pared to LoAS's saving on off-chip traffic for other quantities. Figure [14](#page-9-2) also provides the normalized SRAM cache miss rate for the layer workload in ResNet19. SparTen-SNN has a  $16\times$  higher miss rate(1.47%) compared to LoAS. GoSPA-SNN has the lowest miss rate due to its Output-stationary dataflow. However, the tradeoff is the higher off-chip traffic of psums. Gamma-SNN has a higher SRAM miss rate than GoSPA-SNN and LoAS. The reason is that the extra t-dim enlarges the partial row traffic by  $t$  times. Some of the extra traffic cannot be held in the on-chip SRAM, thus leading to the cache eviction. Overall, the cache miss rate results align with the off-chip traffic trends. Since we set all the baselines to have the same global cache size, the reduction in the memory traffic reflects LoAS's improvement in both speedup and energy efficiency.

Area and Power: Table [IV](#page-10-0) shows the area and power breakdown of LoAS with the configuration in Table [III.](#page-8-4) Inside each TPPE, one single fast prefix-sum circuit dominates both the area  $(66.7%)$  and power  $(51.8%)$ . Original SparTen  $[15]$ even requires two fast prefix-sum circuits for both inputs and weights.<sup>[10](#page-10-1)</sup> Thanks to the laggy prefix-sum circuits  $(8.3\% \text{ of }$ area and 11.4% of power) we proposed, LoAS only requires one fast prefix-sum circuit inside each TPPE. At the system level, the global SRAM cache dominates both the power and area, which aligns with the previous works [\[31\]](#page-12-26), [\[34\]](#page-12-27), [\[62\]](#page-13-9). Figure [15](#page-10-2) provides a visualization of the power breakup.

# *B. Ablation Studies*

Temporal Scalability Studies: In our experimental settings, we configured the TPPE inside LoAS to run the SNNs with 4 timesteps. Most state-of-the-art SNN algorithms [\[10\]](#page-12-33), [\[12\]](#page-12-34) usually use a timestep equal to or less than 8. So, we want to understand how TPPE scales with the timesteps. Figure  $16(a)$  $16(a)$ shows that TPPE scales well with the timesteps. The reason

<span id="page-10-1"></span> $10$ This is not the case in SparTen-SNN. Since the input spikes are bitmasks and data at the same time, thus SparTen-SNN only requires one fast prefixsum circuit.



<span id="page-10-3"></span>Fig. 16. (a) The scalability of TPPE with increasing timesteps. The yellow region denotes the portion that grows with the timesteps. (b) The scalability of the ratio of silent neurons (sparsity of matrix A) with increasing timesteps. All values are normalized to the original silent neuron ratio at the timestep of  $4$ .



<span id="page-10-4"></span>Fig. 17. Scalability of LoAS across different sparsity patterns of matrix B, number of timesteps, and layer size.

is that all TPPE components other than accumulators and the input data buffer are agnostic to the number of timesteps. Even at 16 timesteps, the TPPE only increases its area (power) by  $1.37\times$  (1.25 $\times$ ) compared to 4 timesteps. We also showcase how the ratio of silent neurons in VGG16 scales with the number of timesteps. Figure  $16(b)$  $16(b)$  shows that with the help of the pre-processing technique, even at the timestep of 8, we can still have a similar ratio of silent neurons as the timestep of 4. However, it is very likely to have fewer silent neurons when we have even larger timesteps  $(> 8)$ . This is one of the challenge that LoAS needs to face when scaling up on the number of timesteps.

Scalability Study: Figure [17](#page-10-4) further shows how the overall performance of LoAS scales with different quantities. We first test LoAS running on VGG16 with average sparsity of B (weight) at 98.2%(High), 68.4(Medium), and 25%(Low). The result shows that LoAS's performance is highly sensitive to the sparsity level of B. When we scale the sparsity from 98.2% to 25%, the performance scales down by roughly 88%. We also find that LoAS's performance scales pretty well on timesteps. LoAS only loses roughly 14% of performance when increasing the number of timesteps by  $2\times$ . Finally, we test LoAS's scalability on layer size. We compare one layer from VGG16 and the hidden feed-forward (HFF) layer from SpikeTransformer [\[58\]](#page-13-23). The results show that LoAS scales pretty well, even on the layer with a larger parameter size.

Dual-sparse SNN vs. Dual-sparse ANN: In this work, we focus on providing insights for the community on how the spMspM acceleration works on SNNs. However, it is unavoidable to discuss the comparison between SNNs and ANNs. In Figure [18,](#page-11-2) we show the comparison of normalized energy efficiency and memory traffic between SNNs (LoAS)



<span id="page-11-2"></span>Fig. 18. Normalized energy efficiency and memory traffic between SNNs (LoAS., T=4) vs. ANN baselines (SparTen, Gamma).

and ANNs (SparTen [\[15\]](#page-12-17)) and Gamma [\[62\]](#page-13-9) running VGG16 workload. We use the VGG16 workload in Table [II](#page-8-1) for LoAS. ANN-version of VGG16 has 8-bit weights (98.2% sparsity) and activations (43.9% sparsity). Overall, the SNN running on LoAS has roughly  $2.5\times$  and  $1.2\times$  energy efficiency compared to the ANNs running on SparTen and Gamma, respectively. We observe that around  $60\%$  of energy contributes to the data movement for both networks. We, therefore, also include the DRAM and SRAM traffic comparison in Figure [18.](#page-11-2) It shows that SNNs, on average, have  $\sim 60\%$  less memory traffic compared to SparTen-ANN. The less memory traffic comes from less input bitwidth (4-bit vs. 8-bit) and higher input sparsity (79.6% vs. 43.9%), thanks to SNN's features of unary activation and sparse spike activity  $(II-B)$ . Not surprisingly, Gamma-ANN has lower overall DRAM accesses compared to LoAS due to its Gust dataflow [\[62\]](#page-13-9). The tradeoff is  $3.5\times$  more SRAM traffic, which explains why the LoAS has a slightly higher overall energy efficiency.

Dual-sparse SNN vs. Dense SNN: To show the benefits of dual-sparsity in SNNs, we compare LoAS with the prior dense SNN systolic-array accelerators, PTB [\[29\]](#page-12-9) and Stellar [\[33\]](#page-12-15), running dense VGG16 with 4 timesteps. For a fair comparison, we set the array size for PTB to be  $16 \times 4$ , which generates 16 full-sum outputs for 4 timesteps in parallel (same as LoAS). We further configure Stellar to the same array size. We leverage ScaleSim [\[44\]](#page-13-24) to estimate both baselines' memory traffic and cycle counts. We show the comparison in Figure [19.](#page-11-3) We first observe that LoAS has roughly  $6\times$  higher energy efficiency compared to PTB, mainly resulting from the  $3\times$  $(12.5\times)$  less DRAM (SRAM) traffic. Compared to Stellar, LoAS has roughly  $2.5\times$  higher energy efficiency, as well as the  $2.7\times$  (6.6 $\times$ ) less DRAM (SRAM) traffic. We also observe that LoAS has  $46.9\times$  speedup against PTB. This is primarily due to the data sparsity and the difference between PTB's partially temporal parallel (Section  $II-E$ ) and LoAS's fully temporal parallel mechanism. We observe that Stellar outperforms PTB across all matrices. This is mainly due to Stellar's optimized spatiotemporal row-stationary dataflow and its spike-skipping technique. However, compared to Stellar, we are still able to achieve roughly  $7.1 \times$  speedup due to LoAS's capability to leverage the dual-sparsity. Please note that we do not compare with the SpinalFlow [\[36\]](#page-12-8) due to its temporal encoding achieves



<span id="page-11-3"></span>Fig. 19. Normalized performance comparison between dual-sparse SNN accelerator (LoAS) vs. dense SNN accelerator baselines (PTB, Stellar).

<span id="page-11-0"></span>limited accuracy on challenging learning tasks [\[6\]](#page-12-35), [\[29\]](#page-12-9).

# VII. RELATED WORK

Except for the prior SNN dense accelerator works we discussed in Section  $II-E$ , there also exists prior works that try to leverage the sparsity in SNNs. In [\[3\]](#page-12-36), a neuron filter unit is leveraged to only fetch the weight if there is a 1-spike. However, dual-sparsity (both spike and weight sparsity) is not considered. In [\[2\]](#page-12-37), the dual-sparsity of SNN is considered to skip the unmatched computation. However, the weights and spikes are fetched in a dense format without any compression from the off-chip memory, thus failing to save data movement costs. In this work, LoAS leverages the dual-sparsity in SNNs from both computation and data movement.

As we discussed, PTB processes the timesteps in a partially parallel manner. Even if one re-configures the PTB to run all timesteps in parallel (time-window=1), it still differs from LoAS in the loop ordering. In PTB's loop ordering, t-dim is placed between  $m$ -dim and  $n$ -dim, while LoAS places the  $t$ -dim in the inner-most loop. As discussed in Section [III,](#page-4-0) LoAS's loop ordering brings more efficiency in spMspM operation. Moreover, PTB targets accelerating workloads with time-series data from DVS sensors [\[30\]](#page-12-38), where the timestep is usually large  $(> 100)$ . On our workloads, where the timesteps are small  $(< 8)$ , PTB experiences low hardware utilization. In [\[32\]](#page-12-39), processing timesteps in parallel is also studied. However, they target the temporal-coded SNN workloads, and the loop ordering is not discussed. Finally, as discussed in Section [II-E,](#page-4-3) Stellar [\[33\]](#page-12-15) is another work that also tries to process timesteps in parallel. However, it targets the non-LIF, FS-coded SNNs and does not support the dual-sparsity.

#### VIII. CONCLUSION

<span id="page-11-1"></span>In this work, we observe that naively running dual-sparse SNNs on existing spMspM accelerators exhibits sub-optimal efficiency due to the latency and memory traffic penalty brought by processing timesteps. To improve the efficiency, we propose a fully temporal-parallel dataflow (FTP), which avoids the above problems. To maximize the benefits of FTP, we propose FTP-friendly spike compression and innerjoin mechanism. We also build LoAS, a novel architecture that exemplifies the FTP dataflow. With the help of both FTP-friendly compression and inner-join, LoAS demonstrates significant speedup (up to  $8.51\times$ ) and energy reduction (up to  $3.68\times$ ) compared to prior dual-sparse accelerator baselines.

#### **REFERENCES**

- <span id="page-12-11"></span>[1] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.-J. Nam *et al.*, "Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip," *IEEE transactions on computeraided design of integrated circuits and systems*, vol. 34, no. 10, pp. 1537–1557, 2015.
- <span id="page-12-37"></span>[2] Q. Chen, C. Gao, and Y. Fu, "Cerebron: a reconfigurable architecture for spatiotemporal sparse spiking neural networks," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 30, no. 10, pp. 1425–1437, 2022.
- <span id="page-12-36"></span>[3] Q. Chen, G. He, X. Wang, J. Xu, S. Shen, H. Chen, Y. Fu, and L. Li, "A  $67.5$  *u*i/prediction accelerator for spiking neural networks in image segmentation," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 69, no. 2, pp. 574–578, 2021.
- <span id="page-12-6"></span>[4] Y. Chen, Z. Yu, W. Fang, T. Huang, and Y. Tian, "Pruning of deep spiking neural networks through gradient rewiring," *arXiv preprint arXiv:2105.04916*, 2021.
- <span id="page-12-0"></span>[5] D. V. Christensen *et al.*, "2022 roadmap on neuromorphic computing and engineering," *Neuromorphic Computing and Engineering*, vol. 2, no. 2, p. 022501, 2022.
- <span id="page-12-35"></span>[6] I. M. Comsa, K. Potempa, L. Versari, T. Fischbacher, A. Gesmundo, and J. Alakuijala, "Temporal coding in spiking neural networks with alpha synaptic function," in *ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. IEEE, 2020, pp. 8529–8533.
- <span id="page-12-12"></span>[7] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain *et al.*, "Loihi: A neuromorphic manycore processor with on-chip learning," *Ieee Micro*, vol. 38, no. 1, pp. 82–99, 2018.
- <span id="page-12-21"></span>[8] P. Dayan and L. F. Abbott, *Theoretical neuroscience: computational and mathematical modeling of neural systems*. MIT press, 2005.
- <span id="page-12-16"></span>[9] C. Deng, Y. Sui, S. Liao, X. Qian, and B. Yuan, "Gospa: an energyefficient high-performance globally optimized sparse convolutional neural network accelerator," in *2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)*. IEEE, 2021, pp. 1110– 1123.
- <span id="page-12-33"></span>[10] S. Deng, Y. Li, S. Zhang, and S. Gu, "Temporal efficient training of spiking neural network via gradient re-weighting," *arXiv preprint arXiv:2202.11946*, 2022.
- <span id="page-12-1"></span>[11] W. Fang, Y. Chen, J. Ding, Z. Yu, T. Masquelier, D. Chen, L. Huang, H. Zhou, G. Li, and Y. Tian, "Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence," *Science Advances*, vol. 9, no. 40, p. eadi1480, 2023.
- <span id="page-12-34"></span>[12] W. Fang, Z. Yu, Y. Chen, T. Huang, T. Masquelier, and Y. Tian, "Deep residual learning in spiking neural networks," *Advances in Neural Information Processing Systems*, vol. 34, pp. 21 056–21 069, 2021.
- <span id="page-12-10"></span>[13] J. Frankle and M. Carbin, "The lottery ticket hypothesis: Finding sparse, trainable neural networks," *arXiv preprint arXiv:1803.03635*, 2018.
- <span id="page-12-13"></span>[14] S. B. Furber, F. Galluppi, S. Temple, and L. A. Plana, "The spinnaker project," *Proceedings of the IEEE*, vol. 102, no. 5, pp. 652–665, 2014.
- <span id="page-12-17"></span>[15] A. Gondimalla, N. Chesnut, M. Thottethodi, and T. Vijaykumar, "Sparten: A sparse tensor accelerator for convolutional neural networks," in *Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture*, 2019, pp. 151–165.
- <span id="page-12-25"></span>[16] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "Eie: Efficient inference engine on compressed deep neural network," *ACM SIGARCH Computer Architecture News*, vol. 44, no. 3, pp. 243–254, 2016.
- <span id="page-12-30"></span>[17] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *CVPR*, 2016.
- <span id="page-12-18"></span>[18] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, "Extensor: An accelerator for sparse tensor algebra," in *Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture*, 2019, pp. 319–333.
- <span id="page-12-19"></span>[19] K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. Fletcher, "Ucnn: Exploiting computational reuse in deep neural networks via weight repetition," in *2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)*. IEEE, 2018, pp. 674–687.
- <span id="page-12-4"></span>[20] S. Kim, S. Park, B. Na, and S. Yoon, "Spiking-yolo: spiking neural network for energy-efficient object detection," in *Proceedings of the AAAI conference on artificial intelligence*, vol. 34, no. 07, 2020, pp. 11 270–11 277.
- <span id="page-12-3"></span>[21] Y. Kim, J. Chough, and P. Panda, "Beyond classification: Directly training spiking neural networks for semantic segmentation," *Neuromorphic Computing and Engineering*, vol. 2, no. 4, p. 044015, 2022.
- <span id="page-12-31"></span>[22] Y. Kim *et al.*, "Exploring lottery ticket hypothesis in spiking neural networks," in *Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII*. Springer, 2022, pp. 102–120.
- <span id="page-12-5"></span>[23] Y. Kim, Y. Li, H. Park, Y. Venkatesha, R. Yin, and P. Panda, "Exploring lottery ticket hypothesis in spiking neural networks," in *European Conference on Computer Vision*. Springer, 2022, pp. 102–120.
- <span id="page-12-22"></span>[24] Y. Kim and P. Panda, "Revisiting batch normalization for training low-latency deep spiking neural networks from scratch," *Frontiers in neuroscience*, 2021.
- <span id="page-12-23"></span>[25] Y. Kim, H. Park, A. Moitra, A. Bhattacharjee, Y. Venkatesha, and P. Panda, "Rate coding or direct coding: Which one is better for accurate, robust, and energy-efficient spiking neural networks?" in *ICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. IEEE, 2022, pp. 71–75.
- <span id="page-12-29"></span>[26] A. Krizhevsky et al., "Imagenet classification with deep convolutional neural networks," *NeurIPS*, 2012.
- <span id="page-12-28"></span>[27] H.-T. Kung, "Why systolic architectures?" *Computer*, vol. 15, no. 1, pp. 37–46, 1982.
- <span id="page-12-2"></span>[28] C. Lee, A. K. Kosta, A. Z. Zhu, K. Chaney, K. Daniilidis, and K. Roy, "Spike-flownet: event-based optical flow estimation with energy-efficient hybrid neural networks," in *European Conference on Computer Vision*. Springer, 2020, pp. 366–382.
- <span id="page-12-9"></span>[29] J.-J. Lee, W. Zhang, and P. Li, "Parallel time batching: Systolicarray acceleration of sparse spiking neural computation," in *2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*. IEEE, 2022, pp. 317–330.
- <span id="page-12-38"></span>[30] H. Li et al., "Cifar10-dvs: an event-stream dataset for object classification," *Frontiers in neuroscience*, vol. 11, p. 309, 2017.
- <span id="page-12-26"></span>[31] Z. Li, J. Li, T. Chen, D. Niu, H. Zheng, Y. Xie, and M. Gao, "Spada: Accelerating sparse matrix multiplication with adaptive dataflow," in *Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2*, 2023, pp. 747–761.
- <span id="page-12-39"></span>[32] F. Liu, W. Zhao, Z. Wang, Y. Chen, T. Yang, Z. He, X. Yang, and L. Jiang, "Sato: spiking neural network acceleration via temporaloriented dataflow and architecture," in *Proceedings of the 59th ACM/IEEE Design Automation Conference*, 2022, pp. 1105–1110.
- <span id="page-12-15"></span>[33] R. Mao, L. Tang, X. Yuan, Y. Liu, and J. Zhou, "Stellar: Energyefficient and low-latency snn algorithm and hardware co-design with spatiotemporal computation," in *2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*. IEEE, 2024, pp. 172–185.
- <span id="page-12-27"></span>[34] F. Muñoz-Martínez, R. Garg, M. Pellauer, J. L. Abellán, M. E. Acacio, and T. Krishna, "Flexagon: A multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing," in *Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3*, 2023, pp. 252–265.
- <span id="page-12-32"></span>[35] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "Cacti 6.0: A tool to model large caches," *HP laboratories*, 2009.
- <span id="page-12-8"></span>[36] S. Narayanan, K. Taht, R. Balasubramonian, E. Giacomin, and P.- E. Gaillardon, "Spinalflow: An architecture and dataflow tailored for spiking neural networks," in *2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)*. IEEE, 2020, pp. 349– 362.
- <span id="page-12-24"></span>[37] E. O. Neftci, H. Mostafa, and F. Zenke, "Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks," *IEEE Signal Processing Magazine*, vol. 36, no. 6, pp. 51–63, 2019.
- <span id="page-12-7"></span>[38] E. O. Neftci, B. U. Pedroni, S. Joshi, M. Al-Shedivat, and G. Cauwenberghs, "Stochastic synapses enable efficient brain-inspired learning machines," *Frontiers in neuroscience*, vol. 10, p. 241, 2016.
- <span id="page-12-20"></span>[39] S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng, C. Chakrabarti, H.-S. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, "Outerspace: An outer product based sparse matrix multiplication accelerator," in *2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)*. IEEE, 2018, pp. 724–736.
- <span id="page-12-14"></span>[40] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "Scnn: An

accelerator for compressed-sparse convolutional neural networks," *ACM SIGARCH computer architecture news*, vol. 45, no. 2, pp. 27–40, 2017.

- <span id="page-13-6"></span>[41] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "Scnn: An accelerator for compressed-sparse convolutional neural networks," *ACM SIGARCH computer architecture news*, vol. 45, no. 2, pp. 27–40, 2017.
- <span id="page-13-7"></span>[42] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, "Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training," in *2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)*. IEEE, 2020, pp. 58–70.
- <span id="page-13-0"></span>[43] K. Roy, A. Jaiswal, and P. Panda, "Towards spike-based machine intelligence with neuromorphic computing," *Nature*, 2019.
- <span id="page-13-24"></span>[44] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, "A systematic methodology for characterizing scalability of dnn accelerators using scale-sim," in *2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)*. IEEE, 2020, pp. 58–68.
- <span id="page-13-18"></span>[45] A. Sarma, S. Singh, H. Jiang, A. Pattnaik, A. K. Mishra, V. Narayanan, M. T. Kandemir, and C. R. Das, "Exploiting activation based gradient output sparsity to accelerate backpropagation in cnns," *arXiv preprint arXiv:2109.07710*, 2021.
- <span id="page-13-2"></span>[46] A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy, "Going deeper in spiking neural networks: Vgg and residual architectures," *Frontiers in neuroscience*, vol. 13, p. 95, 2019.
- <span id="page-13-21"></span>[47] K. Sewell, R. G. Dreslinski, T. Manville, S. Satpathy, N. Pinckney, G. Blake, M. Cieslak, R. Das, T. F. Wenisch, D. Sylvester *et al.*, "Swizzle-switch networks for many-core systems," *IEEE Journal on Emerging and Selected Topics in Circuits and Systems*, vol. 2, no. 2, pp. 278–294, 2012.
- <span id="page-13-5"></span>[48] L. Shi, J. Pei, N. Deng, D. Wang, L. Deng, Y. Wang, Y. Zhang, F. Chen, M. Zhao, S. Song *et al.*, "Development of a neuromorphic computing system," in *2015 IEEE international electron devices meeting (IEDM)*. IEEE, 2015, pp. 4–3.
- <span id="page-13-4"></span>[49] Y. Shi, L. Nguyen, S. Oh, X. Liu, and D. Kuzum, "A soft-pruning method applied during training of spiking neural networks for inmemory computing applications," *Frontiers in neuroscience*, vol. 13, p. 405, 2019.
- <span id="page-13-22"></span>[50] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," *arXiv:1409.1556*, 2014.
- <span id="page-13-8"></span>[51] N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang, "Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product," in *2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)*. IEEE, 2020, pp. 766–780.
- <span id="page-13-19"></span>[52] C. Stöckl and W. Maass, "Optimized spiking neurons can classify images with high accuracy through temporal coding with two spikes," *Nature Machine Intelligence*, vol. 3, no. 3, pp. 230–238, 2021.
- <span id="page-13-15"></span>[53] P. J. Werbos, "Backpropagation through time: what it does and how to do it," *Proceedings of the IEEE*, vol. 78, no. 10, pp. 1550–1560, 1990.
- <span id="page-13-1"></span>[54] D. Wu, J. Li, R. Yin, H. Hsiao, Y. Kim, and J. San Miguel, "Ugemm: Unary computing architecture for gemm applications," in *2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)*. IEEE, 2020, pp. 377–390.
- <span id="page-13-11"></span>[55] Y. N. Wu, P.-A. Tsai, A. Parashar, V. Sze, and J. S. Emer, "Sparseloop: An analytical approach to sparse tensor accelerator modeling," in *2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)*. IEEE, 2022, pp. 1377–1395.
- <span id="page-13-3"></span>[56] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, "Spatio-temporal backpropagation for training high-performance spiking neural networks," *Frontiers in neuroscience*, vol. 12, p. 331, 2018.
- <span id="page-13-14"></span>[57] Y. Wu, L. Deng, G. Li, J. Zhu, Y. Xie, and L. Shi, "Direct training for spiking neural networks: Faster, larger, better," in *Proceedings of the AAAI conference on artificial intelligence*, vol. 33, no. 01, 2019, pp. 1311–1318.
- <span id="page-13-23"></span>[58] M. Yao, J. Hu, Z. Zhou, L. Yuan, Y. Tian, B. Xu, and G. Li, "Spikedriven transformer," *Advances in neural information processing systems*, vol. 36, 2024.
- <span id="page-13-20"></span>[59] R. Yin *et al.*, "Sata: Sparsity-aware training accelerator for spiking neural networks," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 42, no. 6, pp. 1926–1938, 2022.
- <span id="page-13-16"></span>[60] R. Yin *et al.*, "Workload-balanced pruning for sparse spiking neural networks," *IEEE Transactions on Emerging Topics in Computational Intelligence*, 2024.
- <span id="page-13-17"></span>[61] R. Yin, Y. Li, A. Moitra, and P. Panda, "Mint: Multiplier-less integer quantization for energy efficient spiking neural networks," in *2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC)*. IEEE, 2024, pp. 830–835.
- <span id="page-13-9"></span>[62] G. Zhang, N. Attaluri, J. S. Emer, and D. Sanchez, "Gamma: Leveraging gustavson's algorithm to accelerate sparse matrix multiplication," in *Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems*, 2021, pp. 687–701.
- <span id="page-13-12"></span>[63] W. Zhang and P. Li, "Temporal spike sequence learning via backpropagation for deep spiking neural networks," *NeurIPS*, 2020.
- <span id="page-13-10"></span>[64] Z. Zhang, H. Wang, S. Han, and W. J. Dally, "Sparch: Efficient architecture for sparse matrix multiplication," in *2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)*. IEEE, 2020, pp. 261–274.
- <span id="page-13-13"></span>[65] H. Zheng et al., "Going deeper with directly-trained larger spiking neural networks," in *AAAI*, 2021.