

Workshop Session on:

**3D Emerging Memories and New Architecture Paradigms** 

### 3D Memories: Now and Then!

Hybrid, Cubes, Approximate, and Custom

Dr.-Ing. Christian Weis







## Why do we care about DRAM?

#### Power Break-Down for Suspended 3G State



Source: The systems hackers guide to the galaxy: Energy usage in a modern smartphone

## Power Break-down for Big Data Application



Source: Power Consumption of Green Wave Architecture 2011

## Power Break-Down Google Datacenter



Source: The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. 2009

#### Power Break-Down eBrain



Source: A Scalable Custom Simulation Machine for the Bayesian Confidence Propagation Neural Network model of the Brain, 2014

## **Comparison of DRAM Subsystems**

#### **DIMM Based:**

General Purpose Computers *e.g. DDR3, DDR4* 



#### **Device Based:**

Embedded / Tablets / Graphic Cards e.g. LPDDR3, GDDR5



#### Package on Package (PoP):

Soldered on top of the MPSoC. Smartphones e.g. LPDDR3, LPDDR4



#### **Buffer on Board:**

Memory Controller on Buffer Chip, Serial Connection e.g. FBDIMM, IBM CDIMM, Intel SMI/SMB



#### 3D/2.5D-Integrated:

Stacked on Logic or Silicon Interposer by means of TSVs e.g. Wide I/O, HBM



#### **Memory Cube:**

3D-Stacked, Memory Controller on Bottom Layer, Serial Interconnect (SerDes) e.g. HMC, SMC



Source: Matthias Jung

## **Comparison of DRAM Subsystems**



Best case - 100% usage of the available BW

## 3D DRAM's starting point ...

Reducing the I/O loads – a performance and power advantage:

much smaller capacitances for a TSV stack than a Quad die

Conventional quad-die stack



TSV stack:



source: Qimonda, 2008

## 3D DRAM packaging example

- 3D Packaging with a commodity 2Gb DDR3 SDRAM chip (4x 2Gb = 8Gb)
- With areas reserved for TSVs





12.8 GB/s
DIMM Bandwidth



source: Samsung'09

## **3D Integration: State-of-the-art**



#### **Graphics Memory**



High Bandwidth Memory (HBM)



512 GB/s Bandwidth

1024bit I/O per HBM cube

Light-weight Logic layer interface @500MHz DDR

source: AMD, June 2015

## 3D Integration: State-of-the-art

#### **DRAM** Cube with Abstracted Interface



160+ GB/s Bandwidth



Hybrid Memory Cube (HMC)



32 bit DDR I/O per Vault



Source: Micron, 2014

## 3D Integration: State-of-the-art

Chip architecture of

1Gb Wide-IO DRAM and

SEM image of microbumps

#### **Chip photograph**





|                                         | Device                   |                 | MDDR            | LPDDR2            | Wide IO           |
|-----------------------------------------|--------------------------|-----------------|-----------------|-------------------|-------------------|
| Density                                 |                          |                 | 1Gb             | 1Gb               | 1Gb               |
| Organization                            |                          | 4 Bank<br>/ x32 | 8 Bank<br>/ x32 | 16 Bank<br>/ x512 |                   |
| VDD [V]                                 |                          |                 | 1.8             | 1.2               | 1.2               |
| Data Rate [MHz]  Data Bandwidth  [GB/s] |                          | 400             | 800             | 200               |                   |
|                                         |                          | 1.6<br>(100%)   | 3.2<br>(200%)   | 12.8<br>(800%)    |                   |
| Meas.<br>Power<br>[mW]                  | Standby                  |                 | 0.32<br>(100%)  | 0.27<br>(83.3%)   | 0.27<br>(83.3%)   |
|                                         | Read                     | DQ              | 215.8<br>(100%) | 221.2<br>(102.5%) | 73.7<br>(34.2%)   |
|                                         |                          | Total           | 322.3<br>(100%) | 372.1<br>(115.4%) | 330.6<br>(102.6%) |
|                                         | I/O per pin<br>[mW/Gbps] |                 | 17.33<br>(100%) | 8.71<br>(50.3%    | 0.78<br>(4.5%)    |

### **Different Die Flavors of DRAMs**

#### Commodity Samsung **2G DDR3** die:



#### WIDE I/O 1Gb SDR JEDEC based die:



#### Micron's **Hybrid Memory Cube** (HMC):



**DRAM Layer** 

## Does 3D help to do it better?

3 severe problems appeared during the last years:

- 1. DRAMs don't like heat → 2.5D integration or very good heat control in the underlying logic layer (uProc)
- 2. When not using direct 3D stacking (on top of uProc), how to get this huge bandwidth out of the devices?
- 3. Memory centric computing, such as neuromorphic, NNs, or DL makes it even worse ...

~ 30 GB/s < 2 Channel Bandwidth

70 Gops/W - 2 Tops/W





Google, 2017

## DRAM Energy Distribution

- DRAM Power Breakdown for Twitter Memcached Application\*
- 2GB LPDDR3



A High-Level DRAM Timing, Power and Area Exploration Tool, O. Naji, A. Hansson, C. Weis, M. Jung, N. Wehn IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS), July 2015

## Impact of Refresh for Future DRAMs

#### **Refresh Performance Impact**

#### **Refresh Energy Overhead**



#### → High Temperatures Worsen The Behaviour

- J. Liu, et al. RAIDR: Retention-Aware Intelligent DRAM Refresh, ISCA 2012
- I. Bhati, et al. DRAM Refresh Mechanisms, Trade-offs and Penalties, IEEE Trans. 2015

## Refresh at High Temperatures

The <u>exponential</u> leakage current behavior must be counterbalanced by <u>shorter</u> refresh periods!



## Refreshing WIDE I/O DRAM Stacks

#### **Worst Case Assumptions:**

- Temperature =  $100^{\circ}$ C  $\rightarrow$  t<sub>REF</sub>= 8 ms
- Number of rows = 32768
- Bank parallel refresh (with 2 rows concurrently refreshed in one bank)
- Refresh command issued every:

$$t_{REFI} = \frac{8ms}{32768:2} = 488ns$$

Refresh duration =  $t_{RFC} = 130ns$ 



~25% of time spend in Refresh!



# Me can do better ... esign

Response to the 1. problem: DRAMs don't like heat

→ Fine-granular refresh control &

→ Approximate DRAM





## Approximate DRAM

## 10×10=90 36 ×2/5 9.32

#### **Reduce the number of Refreshes**

- Lowering rate or completely switching off refresh
- Possible risk of data errors
- Example Case Studies:
  - Flikker<sup>1</sup>
    Lowers the refresh rate in a non-critical memory region
  - Omitting Refresh<sup>2</sup>
     Disables refresh completely for a specific memory region
  - ...
- Further Approaches:
  - RAPID, RAIDR, RIO, SECRET, ProactiveDRAM, AVATAR ...
  - But: VRT, DPD, Temperature, Characterization Time, Storage ...
- Thorough analysis of retention errors mandatory

<sup>1:</sup> Song Liu, et al. 2011. Flikker: saving DRAM refresh-power through critical data partitioning.

<sup>2:</sup> Matthias Jung, et al. 2015. Omitting Refresh: A Case Study for Commodity and Wide I/O DRAMs.



### **Retention Error Analysis**

### Wide I/O 3D-DRAM



#### **WIOMING MPSoC:**

- CMOS 65nm, 72mm<sup>2</sup>,1250 TSVs
- 4 Channels, 1Gb, 512 I/Os, 50nm
- Heaters
- Temperature sensors

### **Commodity DDR3 DRAM**



#### **DRAMMeasure:**

- Precise heating control of DDR3 SO-DIMMs
- Measuring currents and retention errors
- Applicable to any DDR3 SO-DIMM based platform (FPGAs, CPUs, ...)



## Wide I/O Retention Error Analysis

Observations: Variable Retention Times (VRT) & Data Pattern Dependency (DPD)



Unique bit error at 90°C

Different pattern cause different error rates!

## ORAMMeasure

### Commodity DDR3 Measurements<sup>1</sup>



57% 1E-01 Error Rate (log) 15% 1E-02 1E-03 1E-04 1E-05 1E-06 1E-07 -80°C 1E-08 **−**90°C 1E-09 10 100 1000 10000 Retention Time [s] (log)

- Main reference in literature about retention errors published by Samsung<sup>2</sup>
- Measurements: 1-3 orders of magnitude better retention error behaviour
- DRAM can hold data much longer than specified, even at high temperatures.
- Can be exploited for Approximate Computing (DRAM)

<sup>&</sup>lt;sup>1</sup> Values normalized to total DDR3 DRAM size: 512 MiB (Total number of DRAM cells: 4294967296)

<sup>&</sup>lt;sup>2</sup> Kim and Lee, A New Investigation of Data Retention Time in Truly Nanoscaled DRAMs, 2009



### **Commodity DDR3 Scaling Trends**



- A DRAM from 2009 (50nm) is compared with DRAM from 2013 (30nm)
- Scaling down DRAMs results in more errors
- We observe bends in the curves between 10 and 100 s

## DRAM Retention Error Model

## **Calibration from Measurements**



- Data Pattern Dependency (DPD)
- Variable Retention Times (VRT)
- Wide I/O and DDR3 DRAM
- Can be used in any C++ Simulator (e.g. gem5)

C. Weis, et al. Retention Time Measurements and Modelling of Bit Error Rates of WIDE-I/O DRAM in MPSoCs, DATE, 2015

## **Approximate DRAM Simulation Framework**



### **Temperature Variation Aware Bank-Wise Refresh**



23

## **Switch off Refresh: Image Processing**

- Streamed image processing on Xilinx FPGA
- DDR3 SO-DIMM
- Application Specific Memory Controller (ASMC)
- Frame deadline = 9ms <  $t_{REF}$  = 64ms @ 25\*C
- Refresh disabled in the memory controller
- No retention errors occur







## A Per Layer Refresh Policy for 3D DRAMs

#### Separation of 3D DRAM Stack into unreliable and reliable regions

- Reliable regions: higher DRAM layers with temperature aware refresh
- Unreliable region: bottom DRAM layer with disabled refresh → Omit Refresh (OR)
- Access unreliable region while reliable region is refreshed

#### **Example applications**

- Graph processing
- Image processing
- Baseband processing
- → Saves 100% refresh power in the unr.-layer
- → Increases bandwidth



Simulation Results

# Microelectronicismos Example Applications De Sign

28 nm ASIC, 400 MHz, 51 mm<sup>2</sup>





#### **Recommendation Systems:**

- Netflix Dataset Graph:
  - 100,480,507 User Ratings
  - 480,189 Users
  - 17,700 Movies



- Graph is stored as matrix in unreliable region (sparse)
- Worst Case Assumptions: 90°C (actually required t<sub>REF</sub>=16ms)
- → No noticeable loss in quality of recommendations

#### **Baseband Processing:**

- Simulation of Low-Density Parity Check Coding (LDPC)
- Channel data is stored in unreliable region Worst case assumptions: 100°C (actually required t<sub>RFF</sub>=8ms)
- Influence of retention errors much smaller than channel errors during transmission
- → No noticeable loss in communications performance

# Me can do better ... esign

Response to the 2. problem: **How to get the huge bandwidth out of the device?** 

- → More clever usage &
- → Maybe not needed at all





# Microelectronicismesign



## **Applied bandwidth to the HMC**

- > Taken from M. Gokhale
- → At the LLNL measured on a FPGA board the different response times of the HMC (round-trip ~24ns)
- → Here we used 40 threads active with different data granularity (64 & 256B)
- → BW was very similar to Mrs. Gokhale's results:



| Workload           | Short name | Description                                          |
|--------------------|------------|------------------------------------------------------|
| Page Rank          | pager_s22  | A benchmark to rank web pages in popularity          |
| Image Diff. (full) | image_x1   | Pixel-wise diff-computation of two images (full)     |
| Image Diff. (x16)  | image_x16  | Pixel-wise diff-computation of two images (x16 dec.) |
| Sparse Mat. Vec.   | spmv_s21   | Multiply a sparse matrix with a dense vector         |
| Random Access      | randa_s29  | Read and updates random locations in a table         |
| Mixed              | mixed      | A mix of all listed benchmarks                       |



## **HMC Latency – not always predictable**



### Average access latency (a):

- o 22nm DRAM HMC
- Page size = 256 Bytes and
- Packet size = 256 Bytes

### Average access latency (b):

- o 22nm DRAM HMC
- Page size = 512 Bytes and
- Packet size = 64 Bytes

## Microelectronic; HMCPower-11W++ esign

M. Gokhale et al., 2015



# Microelectronic We can do better ... Design

Response to the 3. problem: **Memory centric computing** makes it worse...?

- → New Architectures &
- → Custom 3D-DRAMs







Used

**CNNs** 

for

## The Smart Memory Cube (SMC)



~22 GFlops/W

**RISC-V Processors** 

Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes Erfan Azarkhish et al., 2017

33



### **Custom 3D-DRAM for eBRAIN II**

- A custom multi-chip design to simulate the human brain in real time using the spiking BCPNN (Bayesian Confidence Neural Network)
- The architecture for this algorithm is based on Hyper Columns Units (HCU) and Mini Columns units (MCU)
- The parallel computability of HCUs and MCUs makes this architecture hardware friendly
- Each HCU is an aggregation of 100 MCUs

The hyper column unit has 10000 input connections and 100 output connections





- Custom-optimized **3D-DRAM architecture** => 48 I/O DDR microChannel per HCU  $(1 2 \text{ mm}^2 \text{ depending on the DRAM tech.})$  with 500MHz freq.
- Tailored access → using a technique called "Row merge", where we balanced the BW between Row-updates and Col-updates (from the HCUs).



| Species | # of HCUs           | Average Power |
|---------|---------------------|---------------|
| Mouse   | $1.6 \times 10^{3}$ | 13 W          |
| Rat     | $5.0 \times 10^{3}$ | 44 W          |
| Cat     | $6.0 \times 10^{4}$ | 522 W         |
| Macaque | $2.0 \times 10^{5}$ | 1700 W        |
| Human   | $2.0 \times 10^{6}$ | 17 KW         |

DRAM

dies

#### Matrix – Bank mapping of 4 HCUs: → optimized data layout

| -                 |                   | -                 |                   |
|-------------------|-------------------|-------------------|-------------------|
| i cells HCU 2     | i cells HCU 2     | i cells HCU 3     | i cells HCU 3     |
| ij cells<br>HCU 1 | ij cells<br>HCU 1 | ij cells<br>HCU 1 | ij cells<br>HCU 1 |
| ij cells<br>HCU 0 | ij cells<br>HCU 0 | ij cells<br>HCU 0 | ij cells<br>HCU 0 |
| Bank 0            | Bank 1            | Bank 2            | Bank 3            |
| Dariik 0          | Dann 1            |                   |                   |
| i cells HCU 0     | i cells HCU 0     | i cells HCU 1     | i cells HCU 1     |
|                   |                   |                   |                   |
| i cells HCU 0     | i cells HCU 0     | i cells HCU 1     | i cells HCU 1     |

## The Future is Heterogeneous



- New memory technologies:
  - PCM
  - 3DXPoint
  - STT-MRAM
  - RRAM
- DRAM won't be dead, but will change its role → maybe used as Cache ...
- New memory ECC techniques
- Heterogeneous main memory systems:
  - NVDIMM-P
  - 3D MPSoCs / 3D Memory Stacks
- New requirements on:
  - Compiler
  - OS
- Processing in memory (PIM)



## Summary – Take-away messages

- Approximate DRAM can be used to trade-off BW vs. reliability
  - Fine-granular refresh control in 3D DRAM stacks is required



- HMC is good for high concurrency and highly distributed threads
  - Latency (contentions on the vault accesses) & Power are large drawbacks



- HBM, highest BW possible but cost of a 1000mm² Si interposer
- Custom 3D-DRAMs have a large potential



 Hybrid architectures and Near/In-memory processing (e.g. NeuroStream or uPmem's processor) will be key



Thank you for Listening For more information //ems.eit.uni-kl.de