

# Computer Architecture and Memory System Design for Deep Neural Networks



Presenter

Hyun Kim | Associate Professor

Affiliation

Seoul National University of Science and Technology  
Electrical and Information Engineering

Contact

[hyunkim@seoultech.ac.kr](mailto:hyunkim@seoultech.ac.kr) / [idsl.seoultech.ac.kr](http://idsl.seoultech.ac.kr)

# Key Issue of On-device AI Accelerators: Memory



## Three main goals of AI accelerators

High accuracy + High speed (Throughput) + Low power (Energy-Efficiency)



## Necessity of memory-level approach

Recent AI models demand a significant amount of data and memory systems are becoming increasingly critical → **Low-power and high-speed memory platforms** dedicated to AI models lead to power-saving and speed-up of AI accelerators



### Key Point

**Minimization of memory capacity and access overhead** by reflecting the model structure

## Overhead for memory access compared to operations in processing units



## Energy per operation [pJ] 45nm vs 7nm

| Operation | Picojoules per Operation |                   |                      |     |
|-----------|--------------------------|-------------------|----------------------|-----|
|           | 45 nm                    | 7 nm              | 45 / 7               |     |
| +         | Int 8                    | 0.03              | 0.007                | 4.3 |
|           | Int 32                   | 0.1               | 0.03                 | 3.3 |
|           | BFloat 16                | --                | 0.11                 | --  |
|           | IEEE FP 16               | 0.4               | 0.16                 | 2.5 |
|           | IEEE FP 32               | 0.9               | 0.38                 | 2.4 |
| ×         | Int 8                    | 0.2               | 0.07                 | 2.9 |
|           | Int 32                   | 3.1               | 1.48                 | 2.1 |
|           | BFloat 16                | --                | 0.21                 | --  |
|           | IEEE FP 16               | 1.1               | 0.34                 | 3.2 |
|           | IEEE FP 32               | 3.7               | 1.31                 | 2.8 |
| SRAM      | 8 KB SRAM                | 10                | 7.5                  | 1.3 |
|           | 32 KB SRAM               | 20                | 8.5                  | 2.4 |
| DRAM      | DDR3/4                   | 1300 <sup>2</sup> | 1300 <sup>2</sup>    | 1.0 |
|           | HBM2                     | --                | 250-450 <sup>2</sup> | --  |
|           | GDDR6                    | --                | 350-480 <sup>2</sup> | --  |



# Scaling of Model Size / PU / Memory / Interface



# Overview of On-device AI Accelerators

## Mobile Characteristic

- Both Inference & Training
- Low-Power FPGA/ASIC for Mobile
- Low Precision: 2b/4b/8b (INT)
- Sparse network
- Application-specific accelerator design



## Architecture Platform



## Memory System for DNNs



## Self-Learning



## SW-based low complexity schemes for low-power & speed-up



# Low-Power CNN Inference Using Prefetcher-Assisted NVM Systems

## Goal

Improve the energy efficiency and speed up using prefetch-aware memory controller within the NVM system for low-power on-device AI

## Motivation 1

- Computation-intensive CNNs have long memory idle time → Huge portion of unnecessary standby energy in main memory

## Motivation 2

- $P_{idle}$  of PRAM is 100 x lower than DRAM and PRAM has a higher density than DRAM

## Solution/ Contribution

### 1 Adaptive prefetcher for CNN inference

Record and predict using block addresses in main memory through hot/cold data frequency and temporal and spatial locality

### 2 Prefetcher-aware memory controller within the NVM system:

36-bit of DRAM-AIT is reserved to be used as counter for prefetcher and data management & Proposed architecture uses 128B data as a block group to eliminate additional data latency in RMW

## Prefetcher accuracy and coverage comparison



## Normalized IPC comparison



## Overview of the EDeN Platform



## Internal Architecture of EDeN Prefetcher



## Control Flow of EDeN prefetcher



## Energy consumption comparison



# Processing-in-Memory (PIM)

## Motivation of PIM

### Memory Bottleneck or Memory “Wall”



**Memory bottleneck** (or Memory wall) due to serial communication method between processor and memory area  
 → **PIM** is a technology to overcome the memory bottleneck (wall)  
 → **Energy-efficiency** and **high-speed** are essential!

## PIM Features

xPU

C

C : Compute

2  
Data

PIM



Normal DRAM



PIM DRAM



1

**Performance Improvement**

By utilizing the higher Bandwidth inside the memory

2

**Energy Efficiency Improvement**

By minimizing data movement between host and memory

# LLM: Prefill + Generation (Decode)



## LLM operation | Compute-bound+Memory-bound

- 01 LLM Inference consists of input processing (Prompt, Prefill) and answer generating (Generation, Decode)
- 02 Prefill: Compute-bound / Generation: Memory-bound
- 03 The larger the model, the more memory intensive function (GEMV), so memory bandwidth for GEMV has a greater impact on system performance than the processor
- 04 Good motivation for Hybrid NPU-PIM Acceleration Platform

# A Versatile and Cycle-Accurate Simulator for DRAM PIM Systems

## Goal

A cycle-accurate and scalable PIM-CPU simulator that can faithfully model CPU-only, PIM-only, and hybrid DRAM-based PIM-CPU configurations, thereby enabling system-level exploration of emerging memory-centric architectures

## Motivation

The lack of versatile and cycle-accurate simulation frameworks significantly limits the evaluation and optimization of diverse PIM designs → Existing in-house tools are narrowly tailored to specific architectures, lacking generality and extensibility

### 1 Comprehensive PIM-CPU configuration support

Validated cycle-level fidelity ( $\leq 6\%$  latency & power error) across CPU-, PIM- and hybrid modes versus Newton and PIM-HBM baselines

### 2 Multi-threaded PIM operation compiler

Automatically maps x86 micro-ops in PIM\_ROI to various PIM opcodes (MAC, MAXPOOL, ReLU, Softmax), delivering  $46.6 \times$  speed-up and  $15.2 \times$  energy efficiency on Llama2-7B

## Solution/ Contribution



## Overview of HyDRASim

### Application-Level



### Architecture-Level



## Relative performance & power consumption



Negligible increase in computations

# Emulation Frameworks Supporting Adaptive Configurations in DRAM PIMs

## Goal

### A Fast, cycle-accurate, and scalable FPGA-based PIM emulation framework

## Motivation 1

- CPU-based software-level simulators limit throughput

## Motivation 2

- Limited FPGA BRAM capacity for PIM emulation

## Solution/ Contribution

### 1 Trace-driven, cycle-accurate FPGA emulation

**FSM-based controller** decouples macro-command scheduling & **Offline traces** provide reproducible macro-commands & **Modular design** provides per-cycle accuracy and simplifies exploration of various DRAM-PIM architectures

### 2 Memblock-based memory management & Optimizations

**Memblock** caches active DRAM regions reducing BRAM usage & **Hit/Miss logic** avoids redundant reloads & **Dynamic latency** hiding overlaps operations & **Adaptive memblock utilization** optimizes efficiency via BRAM remapping

### Timing diagram of the proposed dynamic latency hiding method



### Runtime speedup evaluation (vs in-house PIM simulators)



### Overview of proposed PIM emulation framework



### Proposed adaptive memblock utilization



### Runtime speedup evaluation (vs DRAMsim3)



### Performance reliability evaluation (vs in-house PIM simulators)



# General-Purpose Programmable PIM with a Dual-Tier ISA

## Goal

Develop a **general-purpose PIM architecture** that bridges **functionality** and **programmability** gaps for end-to-end AI acceleration

## Motivation

- Existing PIMs fail to support the end-to-end execution of modern heterogeneous AI workloads due to critical gaps in functionality and programmability

## Solution/ Contribution

### 1 Synergistic Heterogeneous Architecture

Co-locating 128 In-Bank PEs (for bandwidth) and 8 Logic-Die Clusters (for compute) within a single HBM3 stack

### 2 Dual-Tier Fused ISA

A unified instruction set that fuses complex subgraphs into single commands, minimizing host interaction and data movement

### 3 HW/SW Co-Design & Performance

Minimize off-chip traffic by up to 83% and achieves 19.1x higher energy efficiency and up to 43.5x speedup (LLaMA3) compared to NVIDIA A100

### Energy efficiency improvement of GPPIM over GPU (A100)



### Performance speedup of GPPIM over GPU and Baseline PIM (IBPE-Only)



## Overall architecture of proposed GPPIM



## The fused ISA execution model



# Heterogeneous NPU-PIM Execution

## Goal

To alleviate decode-phase memory bottlenecks and low-latency LLM/VLM inference by **dynamically rebalancing between an NPU and PIM on a unified memory subsystem with co-execution**

## Motivation 1

- LLM and VLM inference is phase heterogeneous, with compute-bound prefill and memory-bound decode, and KV-cache traffic that scales with sequence length stresses memory bandwidth

## Motivation 2

- Static accelerator mappings and blocking PIM execution are not robust to variation in batch size or sequence length, producing low utilization and unstable latency

### 1 Unified memory and programming model

A unified address space with memory-mapped control enables the RISC-V core, NPU, and PIM to share buffers without copies and access memory concurrently

### 2 Asynchronous co-execution

A callback-driven nonblocking runtime overlaps NPU compute and DMA with in-memory PIM operations to improve concurrency and reduce stalls

### 3 Dynamic partitioning with tile alignment

Runtime counters drive adaptive NPU-PIM offloads per operator that stay within a stable range and align to NPU tiles, with residual rows directed to the PIM

## Solution/ Contribution

### VLM performance scalability analysis of NEXUS



### Energy Efficiency Analysis of NEXUS



### An overall architecture of NEXUS



### Asynchronous execution of timeline of NEXUS



# PIM-Aware Efficient Compression for Embedding Layers in sLLMs

## Goal

To enhance the efficiency of embedding layer processing in sLLM environments through PIM based optimizations

## Motivation 1

- In quantized sLLMs, embedding layers constitute a significant share of the overall model parameters

## Motivation 2

- Embedding layer computations exhibit low arithmetic intensity (operations per byte), rendering them memory-bound

### 1 XOR-based masking compressor (XMC)

A lossless compression scheme tailored for sLLM embedding layers, achieving 1.49 $\times$  higher compression ratio

### 2 Hardware-friendly decompressor

Achieves a 3-cycle decompression latency with only 0.05% area overhead compared to conventional HBM

### 3 PIM optimization with XMC

Embedding-layer-aware dataflow optimization enables a 3.95 $\times$  speedup over GPUs and an 11.5 $\times$  improvement over conventional HBM-PIM

## Overview of proposed PriME architecture



## Data transformation process /w optimal mask value



## HW overhead & performance of PriME Decomp.

| Metrics      | Area ( $\mu\text{m}^2$ ) | Power (mW) | Throughput (GB/s) | Delay (cycles) |
|--------------|--------------------------|------------|-------------------|----------------|
| Decompressor | 40110                    | 4.7944     | 10.67             | 3              |

## Comp. ratio comparison of each method



## Energy reduction/efficiency Comparison



## GPU normalized perf. PriME & HBM-PIM



# Sparsity-Preserving Attention Replacement for PIM-based LLM Inference

## Goal

Accelerating LLM scaled dot-product attention inference (SDPA) by leveraging dynamic sparsity and bank-level parallelism in HBM-based PIM

## Motivation

- Because SDPA causes severe bottlenecks in NPUs, there is a strong need for PIM that effectively exploits sparsity

## Solution/ Contribution

### 1 Elemax: A Highly Parallelizable Softmax Alternative

Introduce Elemax, a ReLU-based and highly parallelizable alternative to Softmax → reduce attention latency eliminating expensive operations such as exponential, division

### 2 SPARE-PIM: Dynamic-Sparsity-Aware PIM Architecture

Design SPARE-PIM, a dedicated PIM architecture for Elemax → hide the latency of sparsity detection, exploit it to prune unnecessary operations, and maximize parallelism

### 3 Single-Stream SDPA Acceleration Dataflow

Develop a single-stream SDPA acceleration dataflow → skip bank writes for intermediate activation results and reuse data in place to improve PIM throughput

## Overall architecture of SPARE-PIM for exploiting dynamic sparsity



## Comparison of Softmax Replacements



## Timing Diagram of Sparsity Detection



## Core Dataflow of Accelerating SDPA Inference



## Comparison with Existing PIM-Based Accelerators

