PhD Thesis Defense

Jul 14, 2017 (Friday)

# Addressing Prolonged Restore Challenges in Further Scaling DRAMs

# **Xianwei Zhang**

Committees:



Youtao Zhang (advisor) CS, Pitt



Bruce R. Childers CS, Pitt



Jun Yang ECE, Pitt



Wonsun Ahn CS, Pitt



Guangyong Li ECE, Pitt

# MAIN MEMORY



Processor

#### Memory

#### Storage

# MAIN MEMORY



# MAIN MEMORY



# Main memory is critical for system performance



### DIMM/Chip



2D Array

### DIMM/Chip



2D Array

DIMM/Chip

**DRAM Cell** 



# The simplicity enabled DRAM to continuously scale

# SCALING







# SCALING



# SCALING



## Do we still need DRAM to continue scale?



CPU/GRAPHICS

**Increasing Computation** 



Increasing Computation Data Intensive Apps



Increasing Computation Data Intensive Apps Tight Power Budgets



Increasing Computation Data Intensive Apps Tight Power Budgets

# **DRAM must keep scaling to meet demands**

























# WHY DIFFICULT?



# WHY DIFFICULT?



# WHY DIFFICULT?



# **RESTORE ISSUE**



# **RESTORE ISSUE**



# More cells will be violating the JEDEC specifications

# **RESTORE ISSUE**



# More cells will be violating the JEDEC specifications

# THESIS STATEMENT

# Enable DRAM further scaling without low yield and degraded performance

# CANDIDATE SOLUTIONS

# CANDIDATE SOLUTIONS



#### **Relax standard**












#### **Expose slow cells to architectural levels**

#### **Address Restore Issues in Further Scaling DRAMs**

#### **Address Restore Issues in Further Scaling DRAMs**



#### Partial restore based on refresh distance [RT-Next'HPCA16]

#### **Address Restore Issues in Further Scaling DRAMs**



### **Address Restore Issues in Further Scaling DRAMs**

Mitigate restore w/ approximate computing [DrMP'PACT17, Award'MemSys16]

Fast restore via reorganization and page alloc [CkRemap'DATE15, Alloc'TODAES17]

DDR

3

2

Partial restore based on refresh distance [RT-Next'HPCA16]

# OUTLINE









Partial restore based on refresh distance

#### **CkRemap**

Fast restore via reorganization and allocation

DrMP

Mitigate restore with approximate computing



#### **Summary and Research Directions**

# OUTLINE



#### **Summary and Research Directions**





Post-access restore

- Fully charge cells
- Read (tRAS), Write (tWR)



Post-access restore

- Fully charge cells
- Read (tRAS), Write (tWR)



Post-access restore

- Fully charge cells
- Read (tRAS), Write (tWR)



#### Post-access restore

- Fully charge cells
- Read (tRAS), Write (tWR)

#### **Prolonged restore leads to slow read/write**







- Periodically fully charge cells to avoid data loss



- Periodically fully charge cells to avoid data loss

















Linear restore curve

- Data is safe as long as the voltage is above decay curve



Linear restore curve

- Data is safe as long as the voltage is above decay curve



#### Linear restore curve

- Data is safe as long as the voltage is above decay curve

#### Use four sub-windows

- Save a set of timings for each



#### Linear restore curve

- Data is safe as long as the voltage is above decay curve

#### Use four sub-windows

- Save a set of timings for each

Charging goal: Vmax of each sub-window

#### RT-next: RESTORE W.R.T NEXT REFRESH



Apply the timings to achieve the charging goal

#### RT-next: RESTORE W.R.T NEXT REFRESH



#### RT-next: RESTORE W.R.T NEXT REFRESH



Example: 40ms to the next refresh, 2<sup>nd</sup> window, charge to V2

#### MULTI-RATE REFRESH



Multi-rate refresh

- Over 64ms row, same four-window division
### MULTI-RATE REFRESH



#### Multi-rate refresh

- Over 64ms row, same four-window division

### MULTI-RATE REFRESH



- Over 64ms row, same four-window division

### **REFRESH UPGRADE**



#### Multi-rate refresh

- Over 64ms row, same four-window division

#### Refresh upgrade

- More frequent refresh, the closer distance to next refresh
- Lower charging goal for restore

### UPGRADE REFRESH DESIGNS



Blindly upgrade (*RT-all*)

- More refreshes, increasing overheads on performance and energy
- Selectively upgrade (*RT-sel*)
  - Only upgrade touched row/bin
  - Back to low-rate afterwards

### UPGRADE REFRESH DESIGNS



#### Blindly upgrade (*RT-all*)

- More refreshes, increasing overheads on performance and energy
- Selectively upgrade (*RT-sel*)
  - Only upgrade touched row/bin
  - Back to low-rate afterwards







RT-all becomes worse because of refresh penalty



RT-next is 15% over Baseline because of restore truncation RT-all becomes worse because of refresh penalty RT-sel achieves the best result by balancing refresh and restore

### COMPARE TO STATE-OF-ARTS



### COMPARE TO STATE-OF-ARTS



While ArchShield+ is close to PRT-free, RT-sel is 5.2% better

### COMPARE TO STATE-OF-ARTS



While losing 50% capacity, MCR is still worse

# SUMMARY: RT-



Prolonged restore issue in future DRAM Restore and refresh are strongly correlated



RT-next: truncate restore w/ refresh distance RT-sel: expose more restore opportunities



Balances refresh and restore, beats state-of-arts Performance: 19.5% improvement

# OUTLINE



# OUTLINE



#### **RT-Next**

Partial restore based on refresh distance



#### **CkRemap**

Fast restore via reorganization and allocation



#### DrMP

Mitigate restore with approximate computing



#### **Summary and Research Directions**











#### Physical bank: chip level, a portion of memory arrays



Physical bank: chip level, a portion of memory arrays Logical bank: rank level, one physical bank from each chip



Physical bank: chip level, a portion of memory arrays

Logical bank: rank level, one physical bank from each chip

### How to utilize the organization to solve restore?









#### Single set of timings for the whole memory



Single set of timings for the whole memory

Cells are more statistical in smaller nodes



Single set of timings for the whole memory

Cells are more statistical in smaller nodes

#### Too pessimistic to decide by the worst case





Partition each chip bank into multi chunks

Set chunk-level timings

Expose timings to memory controller (MC)



Partition each chip bank into multi chunks

Set chunk-level timings

Expose timings to memory controller (MC)



Partition each chip bank into multi chunks

Set chunk-level timings

Expose timings to memory controller (MC)

# Slow & fast chunks can still be combined together

# FAST CHUNK W/ REMAPPING





Partition bank into chunks

Detect chip-chunk timings Remap chunks within each chip-bank

# FAST CHUNK W/ REMAPPING



Partition bank into chunks

Detect chip-chunk timings

Remap chunks within each chip-bank

# FAST CHUNK W/ REMAPPING



Partition bank into chunks

Detect chip-chunk timings

Remap chunks within each chip-bank

# Bad chip leads to slow rank even w/ remapping

# **RANK CONSTRUCTION (BIN)**



Cluster chips into bins using similarity

Construct ranks using chips from each bin
# **RANK CONSTRUCTION (BIN)**



**Cluster** chips into bins using similarity

Construct ranks using chips from each bin

# How to fully utilize the exposed fast regions?

### **RESTORE-AWARE PAGE ALLOCATION**



### **RESTORE-AWARE PAGE ALLOCATION**



### Accesses come from a small set of pages

### **RESTORE-AWARE PAGE ALLOCATION**



### Accesses come from a small set of pages









With chunk remap and rank construction, avg 15% shorter

## PAGE ALLOCATION EFFECTS



# PAGE ALLOCATION EFFECTS



Chunk-remap & rank-construction expose more fast chunks

- provide more opportunities for page-allocation

# PAGE ALLOCATION EFFECTS



Spec-All\_rand

Spec-All\_prof

Chunk-remap & rank-construction expose more fast chunks

- provide more opportunities for page-allocation

Restore-aware page allocation effectively reduce time

# SUMMARY: CkRemap



Further scaling restore has serious PV effects Worse-case based approaches are ineffective



CkRemap: construct fast chunks via remapping PageAlloc: fully utilize the exposed fast regions



Performance: as high as 25% avg improvement Page alloc: hotness-aware alloc maximize gains

# OUTLINE



#### **RT-Next**

Partial restore based on refresh distance



### CkRemap

Fast restore via reorganization and allocation



#### DrMP

Mitigate restore with approximate computing



### **Summary and Research Directions**

# OUTLINE



#### **RT-Next**

Partial restore based on refresh distance



### CkRemap

Fast restore via reorganization and allocation



#### DrMP

Mitigate restore with approximate computing



### **Summary and Research Directions**

## **APPLICATION CHARACTERISTICS**





Credit: image-net.org

### Machine Learning

Credit: www-d0.fnal.gov

### **Computer Vision**



Credit: www.itbusiness.ca/

### **Big Data Analytics**

## **APPLICATION CHARACTERISTICS**







Credit: image-net.org

Credit: www.itbusiness.ca/

### Machine Learning

#### **Computer Vision**

#### **Big Data Analytics**

### Many applications can tolerate accuracy loss

Credit: www-d0.fnal.gov











### Will the final output always be acceptable?





KMEANSLUAccuracy loss steadily enlarges along tWR decrease



Applications show vastly different behaviors



# Final output quality must be controlled

### **CRITICAL DATA**



### Critical data cannot be approximated

















### There is a tradeoff between accuracy and overhead












### DrMP: APPROXIMATE DRAM ROW



### DrMP: APPROXIMATE DRAM ROW



### What if there aren't that much approx data?



#### **Precise + Approx**





#### **Precise + Approx**



Pair two rows to re-combine chip segments

- Choose smaller one from each location to form a fast one (Precise)



#### **Precise + Approx**



Pair two rows to re-combine chip segments

- Choose smaller one from each location to form a fast one (Precise)



#### **Precise + Approx**



Pair two rows to re-combine chip segments

- Choose smaller one from each location to form a fast one (Precise)

Guarantee partial precise for the other slow row



Pair two rows to re-combine chip segments

- Choose smaller one from each location to form a fast one (Precise)

Guarantee partial precise for the other slow row



Pair two rows to re-combine chip segments

- Choose smaller one from each location to form a fast one (Precise)

Guarantee partial precise for the other slow row













DrMP achieves 19.8% performance improvement



DrMP achieves 19.8% performance improvement

- For apps with dominant approx data accesses, DrMP outperforms PRT-free



DrMP achieves 19.8% performance improvement

- For apps with dominant approx data accesses, DrMP outperforms PRT-free Orthogonal to RT

- RT+DrMP is 8.7% better than PRT-free

### SUMMARY: DrMP



Many applications can tolerate output quality loss Restore can be used for approximate computing



DrMP: balance restore reductions and accuracy DrMP': support both approximate and precise



Output quality: no more than 1% accuracy loss Performance: 19.8% improvement

# OUTLINE



#### **RT-Next**

Partial restore based on refresh distance



#### CkRemap

Fast restore via reorganization and allocation



#### DrMP

Mitigate restore with approximate computing



#### **Summary and Research Directions**

# OUTLINE



#### **RT-Next**

Partial restore based on refresh distance



#### CkRemap

Fast restore via reorganization and allocation



#### DrMP

Mitigate restore with approximate computing



#### **Summary and Research Directions**

### SUMMARY



DRAM must keep scaling to meet increasing demands Prolonged restore time has become a major hurdle



RT-next: truncate restore using the time distance to next refresh CkRemap: construct fast access regions using DRAM organization DrMP: mitigate restore while guarantee acceptable output loss



Performed pioneering studies on restore via modeling & simu Developed comprehensive schemes to mitigate restore issue

Supported under NSF grants: CCF-1422331, CNS-1012070, CCF-1535755 and CCF-1617071



#### Sharing/Sensing timing reduction

- Optimize DRAM internal structures [CHARM'ISCA13, TL-DRAM'HPCA13, etc]
- Utilize existing timing margins [NUAT'HPCA14, AL-DRAM'HPCA15, etc]



#### DRAM restore studies

- Identify the restore scaling issue [Co-arch'MEM14, tWR'Patent15, etc]
- Reduce restore timings [AL-DRAM'HPCA15, MCR'ISCA15, etc]



#### Memory-based approximate computing

- Optimize storage density and lifetime [PCM/SSD'MICRO13, PCM'ASPLOS16, etc]
- Skip DRAM refresh [Flikker'ASPLOS11, Alloc'CASES15, etc]



#### Sharing/Sensing timing reduction

- Optimize DRAM internal structures [CHARM'ISCA13, TL-DRAM'HPCA13, etc]
- Utilize existing timing margins [NUAT'HPCA14, AL-DRAM'HPCA15, etc] We are working at orthogonal restore issue in future DRAMs



#### DRAM restore studies

- Identify the restore scaling issue [Co-arch'MEM14, tWR'Patent15, etc]
- Reduce restore timings [AL-DRAM'HPCA15, MCR'ISCA15, etc]



#### Memory-based approximate computing

- Optimize storage density and lifetime [PCM/SSD'MICRO13, PCM'ASPLOS16, etc]
- Skip DRAM refresh [Flikker'ASPLOS11, Alloc'CASES15, etc]



#### Sharing/Sensing timing reduction

- Optimize DRAM internal structures [CHARM'ISCA13, TL-DRAM'HPCA13, etc]
- Utilize existing timing margins [NUAT'HPCA14, AL-DRAM'HPCA15, etc] We are working at orthogonal restore issue in future DRAMs



#### **DRAM restore studies**

- Identify the restore scaling issue [Co-arch'MEM14, tWR'Patent15, etc]
- Reduce restore timings [AL-DRAM'HPCA15, MCR'ISCA15, etc] We are working at future DRAMs with more effective solutions



#### Memory-based approximate computing

- Optimize storage density and lifetime [PCM/SSD'MICRO13, PCM'ASPLOS16, etc]
- Skip DRAM refresh [Flikker'ASPLOS11, Alloc'CASES15, etc]



#### Sharing/Sensing timing reduction

- Optimize DRAM internal structures [CHARM'ISCA13, TL-DRAM'HPCA13, etc]
- Utilize existing timing margins [NUAT'HPCA14, AL-DRAM'HPCA15, etc] We are working at orthogonal restore issue in future DRAMs



#### **DRAM restore studies**

- Identify the restore scaling issue [Co-arch'MEM14, tWR'Patent15, etc]
- Reduce restore timings [AL-DRAM'HPCA15, MCR'ISCA15, etc] We are working at future DRAMs with more effective solutions



#### Memory-based approximate computing

- Optimize storage density and lifetime [PCM/SSD'MICRO13, PCM'ASPLOS16, etc]
- Skip DRAM refresh [Flikker'ASPLOS11, Alloc'CASES15, etc]

We are the first work on restore-based approximation

### FUTURE RESEARCH DIRECTIONS



#### Solve restore from **reliability** perspective

- Treat Slow restore cells as faulty ones
- Design stronger error correction codes



#### Study security issues of restore variation

- Restore variation info is DRAM's fingerprint
- Solve both info leakage and slow restore



#### Explore restore in 3D stacked DRAM

- Stacking has thermal management issue
- Reduce restore with temperature-aware solutions

## PUBLICATIONS



Xianwei Zhang, Youtao Zhang, Bruce Childers and Jun Yang [HPCA'2016] Restore Truncation for Performance Improvement in Future DRAM Systems



Xianwei Zhang, Youtao Zhang, Bruce Childers and Jun Yang [TODAES'2017] On the Restore Time Variations of Future DRAM Memory [DATE'2015] Exploiting DRAM Restore Time Variations in Deep Sub-micron Scaling



Xianwei Zhang, Youtao Zhang, Bruce Childers and Jun Yang [PACT'2017] DrMP: Mixed Precision-aware DRAM for High Performance Approximate and Precise Computing [MemSys'2016] AWARD: Approximation-aWAre Restore in Further Scaling DRAM

Xianwei Zhang, Lei Zhao, Youtao Zhang and Jun Yang



[ICCD'2015] Exploit Common Source-Line to Construct Energy Efficient Domain Wall Memory based Caches Xianwei Zhang, Youtao Zhang and Jun Yang

[ICCD'2015] DLB: Dynamic Lane Borrowing for Improving Bandwidth and Performance in Hybrid Memory Cube [ICCD'2015] TriState-SET: Proactive SET for Improved Performance in MLC Phase Change Memories Xianwei Zhang, Lei Jiang, Youtao Zhang, Chuanjun Zhang and Jun Yang [ISLPED'2013] WoM-SET: Lowering Write Power of Proactive-SET based PCM Write Strategy Using WoM Code

### ACKNOWLEDGEMENTS



- Profs. Youtao Zhang, Bruce Childers and Jun Yang
  - great guidance, and all resources



- Profs. Wonsun Ahn and Guangyong Li
  - valuable inputs into research studies



- UPitt and NSF
- financial supports (TA/Fellowship and Research grants)



- All members in the lab
  - insightful discussions



#### Friends and colleagues

- help both in and outside researches



#### Family

- endless support and always understand



PhD Thesis Defense

Jul 14, 2017 (Friday)

# Addressing Prolonged Restore Challenges in Further Scaling DRAMs

# **Xianwei Zhang**

Committees:



Youtao Zhang (advisor) CS, Pitt



Bruce R. Childers CS, Pitt



Jun Yang ECE, Pitt



Wonsun Ahn CS, Pitt



Guangyong Li ECE, Pitt