Design of Adaptive Communication
Channel Buffers for Low-Power Area-Efficient Network-on-Chip Architecture

Avinash Kodi†, Ashwini Sarathy* and Ahmed Louri*
†Department of Electrical Engineering and Computer Science, Ohio University, Athens, OH 45701
*Department of Electrical and Computer Engineering, University of Arizona, Tucson, AZ 85719
E-mail: kodi@ohio.edu, sarathya@ece.arizona.edu, louri@ece.arizona.edu

Sponsored: National Science Foundation (NSF) grant ECCS-0725765 (at the High Performance Computing Architectures and Technologies Lab, University of Arizona, Tucson)

ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS’07)
Dec 3-4, 2007
Talk Outline

• Motivation & Introduction

• iDEAL – Inter-router Dual-function Energy and Area-efficient Links for NoC architectures
  – Link and Router Architecture

• Performance Evaluation
  – Power & Area estimation for the Links & Routers
  – Simulation results for Throughput, Latency & Overall network power

• Conclusions
Motivation

System-on-Chip (SoC) paradigm

- Increasing wire delay with decreasing feature size
- Scalable, modular interconnect – Network-on-Chip (NoC)
Motivation

Recent NSF-sponsored workshop on On-Chip Interconnection Networks:

- "The most important technology constraint for on-chip networks is **power consumption**".
- Power consumption of OCINs implemented with current techniques – exceeds expected needs by a **factor of 10**.

### iDEAL – Inter-router Dual-function Energy and Area-efficient Links for NoC architectures

**iDEAL Methodology (circuit and architectural techniques)**
- Reduce the number of router buffers
- To prevent performance degradation, use adaptive channel buffers to store data along the links when required
- Dynamic buffer allocation within the router buffers

![Diagram of Generic NoC architecture vs iDEAL architecture](image)

- Adaptive channel buffers along the link
- Reduced router buffer size

**Generic NoC architecture**

**iDEAL architecture**
Conventional Links

Output Port of Router A

---

Input Port of Router B
iDEAL – Channel Buffer Design (1/2)
iDEAL – Channel Buffer Design (2/2)

Functions as a conventional repeater when there is no congestion.

Control block is turned ‘OFF’.

Repeater tri-stated and holds the sampled value, during congestion.

Control block is turned ‘ON’.
iDEAL – Control Block

- Power efficient
- Stable at varying frequencies
iDEAL: Dual-function Link

**Cycle 1**
Data-In

**Cycle 2**
Data-In

**Cycle 3**
Data-In

Congestion Signal

Data-Out

Congestion Release
Link - Power & Area Estimation

- \( P_{\text{segment}(\text{repeater})} \)
  (Dynamic, leakage, short-circuit)

- \( P_{\text{segment}(\text{ch1-buffer})} \)
  (leakage, control block)

- \( P_{\text{control-blk}} \)
  (inverters, clock, switched-cap.)

Control block

Input Port of Router B

Output Port of Router A

Congestion

CLK1, CLK2

Congestion
iDEAL – Router Buffer Design

- Static buffer allocation
  - Fixed number of buffers per VC
  - HoL blocking

<table>
<thead>
<tr>
<th>VC</th>
<th>RP</th>
<th>WP</th>
<th>OP</th>
<th>OVC</th>
<th>CR</th>
<th>C*</th>
<th>Status</th>
</tr>
</thead>
</table>

RP = read pointer, WP = write pointer, OP = output port, OVC = output VC, CR = credits, C* = congestion status = status of the VC (idle, waiting, RC, VA, SA, ST)
**iDEAL – Router Buffer Design**

- Dynamic buffer allocation

- Approximately \((z + c)/v\) buffers per VC (\(z = \) router buffers, \(c = \) channel buffers, \(v = \) # of VCs)

<table>
<thead>
<tr>
<th>VC</th>
<th>RP</th>
<th>WP</th>
<th>OP</th>
<th>OVC</th>
<th>CR</th>
<th>Status</th>
<th>(F_0)</th>
<th>(F_1)</th>
<th>...</th>
<th>(F_{(z+c)/v})</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>3</td>
<td>N</td>
<td>N</td>
<td>N</td>
<td>N</td>
<td>N</td>
<td>3</td>
<td>3</td>
<td>...</td>
<td>N</td>
</tr>
<tr>
<td>1</td>
<td>6</td>
<td>N</td>
<td>N</td>
<td>N</td>
<td>N</td>
<td>N</td>
<td>6</td>
<td>6</td>
<td>...</td>
<td>N</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>v</td>
<td>5</td>
<td>N</td>
<td>N</td>
<td>N</td>
<td>N</td>
<td>N</td>
<td>5</td>
<td>5</td>
<td>...</td>
<td>N</td>
</tr>
</tbody>
</table>

**Unified VC State Table**

- **Buffer Slot Availability**
- **Congestion Control**

<table>
<thead>
<tr>
<th>Buffer Slot</th>
<th>Free</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Y</td>
</tr>
<tr>
<td>2</td>
<td>N</td>
</tr>
<tr>
<td>...</td>
<td></td>
</tr>
<tr>
<td>(z)</td>
<td>N</td>
</tr>
</tbody>
</table>

RP = read pointer, WP = write pointer, OP = output port, OVC = output VC, CR = credits, \(C^*\) = congestion Status = status of the VC (idle, waiting, RC, VA, SA, ST)
**iDEAL – Router Buffer Design**

- Example illustrating Dynamic buffer allocation in iDEAL

### Unified VC State Table

<table>
<thead>
<tr>
<th>Buffer Slot</th>
<th>Free</th>
<th>Write Pointer</th>
<th>Read Pointer</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>N</td>
<td>N</td>
<td>N</td>
</tr>
<tr>
<td>1</td>
<td>N</td>
<td>N</td>
<td>N</td>
</tr>
<tr>
<td>2</td>
<td>N</td>
<td>N</td>
<td>N</td>
</tr>
<tr>
<td>3</td>
<td>Y</td>
<td>N</td>
<td>N</td>
</tr>
<tr>
<td>4</td>
<td>N</td>
<td>N</td>
<td>N</td>
</tr>
<tr>
<td>5</td>
<td>N</td>
<td>N</td>
<td>N</td>
</tr>
<tr>
<td>6</td>
<td>N</td>
<td>N</td>
<td>N</td>
</tr>
<tr>
<td>7</td>
<td>N</td>
<td>N</td>
<td>N</td>
</tr>
</tbody>
</table>

### Buffer Slot Availability

- \( \text{Input Flit Tracking} \)
- \( \text{Output Flit Tracking} \)
- \( \text{Congestion Control} \)

### Congestion Control

Incoming flit (VCID = 1)

### VC Control Table

<table>
<thead>
<tr>
<th>VC</th>
<th>RP</th>
<th>WP</th>
<th>OP</th>
<th>OVC</th>
<th>CR</th>
<th>Status</th>
<th>( F_0 )</th>
<th>( F_1 )</th>
<th>( F_3 )</th>
<th>( F_4 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>3</td>
<td>N</td>
<td>0</td>
<td>1</td>
<td>4</td>
<td>ST</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>7</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>N</td>
<td>2</td>
<td>N</td>
<td>4</td>
<td>VC</td>
<td>1</td>
<td>6</td>
<td>N</td>
<td>N</td>
</tr>
<tr>
<td>2</td>
<td>N</td>
<td>N</td>
<td>N</td>
<td>N</td>
<td>4</td>
<td>Idle</td>
<td>N</td>
<td>N</td>
<td>N</td>
<td>N</td>
</tr>
<tr>
<td>3</td>
<td>0</td>
<td>N</td>
<td>5</td>
<td>2</td>
<td>4</td>
<td>SA</td>
<td>0</td>
<td>5</td>
<td>N</td>
<td>N</td>
</tr>
</tbody>
</table>

RP = read pointer, WP = write pointer, OP = output port, OVC = output VC, CR = credits, C* = congestion control
Status = status of the VC (idle, waiting, RC, VA, SA, ST)
Router - Power & Area Estimation

- Buffer Power ($P_{\text{write}} + P_{\text{read}}$)
- Crossbar Power (Switch + Arbiter)

- Power reduces on decreasing the buffer size

**Getting started on power/area estimation**

- Virtual Channel (VC)
- Switch Allocator (SA)
- Crossbar Switch
- Input Buffers
- Processing Element (PE)

**6T SRAM cell**

- Bitlines
- Wordlines
- Sense Amp
Performance Evaluation

- Evaluated on a cycle-accurate on-chip network simulator
- Simulated 8 x 8 Mesh and 8 x 8 Folded Torus topologies
- Synthetic benchmarks such as uniform, and non-uniform workloads (Butterfly, Complement, Perfect Shuffle, Matrix Transpose, Bit Reversal) were evaluated
- Parameters evaluated include throughput, latency and overall network power
- Considered 5 different configurations – (vn_V – r_n_R – c_n_C)
  \(n_V = \text{No. of VCs per input port, } n_R = \text{No. of router buffers per VC, } n_C = \text{number of channel buffers}\)
  - Baseline = 440
  - 434, 428, 344, 531
## Power Estimation - Summary

<table>
<thead>
<tr>
<th>vnV - rnR - cnC</th>
<th>Buffer Power (mW)</th>
<th>% Change</th>
<th>Mesh Link + Control Power (mW)</th>
<th>% Change</th>
<th>Folded Torus Link + Control Power (mW)</th>
<th>% Change</th>
</tr>
</thead>
<tbody>
<tr>
<td>v4-r4-c0</td>
<td>2.020</td>
<td>-</td>
<td>2.032 + 0</td>
<td>-</td>
<td>4.068 + 0</td>
<td>-</td>
</tr>
<tr>
<td>v4-r3-c4</td>
<td>1.646</td>
<td>-18.51</td>
<td>2.164 + 0.0122</td>
<td>+ 7.0</td>
<td>4.195 + 0.0122</td>
<td>+ 3.4</td>
</tr>
<tr>
<td>v4-r2-c8</td>
<td>1.272</td>
<td>-37.02</td>
<td>2.296 + 0.0205</td>
<td>+ 13.9</td>
<td>4.437 + 0.0205</td>
<td>+ 6.8</td>
</tr>
<tr>
<td>v3-r4-c4</td>
<td>1.646</td>
<td>-18.51</td>
<td>2.164 + 0.0122</td>
<td>+ 7.0</td>
<td>4.195 + 0.0122</td>
<td>+ 3.4</td>
</tr>
<tr>
<td>v3-r3-c7</td>
<td>1.365</td>
<td>-32.41</td>
<td>2.263 + 0.0184</td>
<td>+ 12.2</td>
<td>4.294 + 0.0184</td>
<td>+ 6.0</td>
</tr>
<tr>
<td>v5-r2-c6</td>
<td>1.459</td>
<td>-27.76</td>
<td>2.230 + 0.0164</td>
<td>+ 10.5</td>
<td>4.261 + 0.0164</td>
<td>+ 5.1</td>
</tr>
<tr>
<td>v5-r3-c1</td>
<td>1.926</td>
<td>-4.65</td>
<td>2.065 + 0.0059</td>
<td>+ 1.8</td>
<td>4.096 + 0.0059</td>
<td>+ 0.8</td>
</tr>
</tbody>
</table>

n_v = number of VCs per input port
n_R = number of router buffers per VC
n_C = number of channel buffers
**Buffer Power – 8x8 Mesh and Folded Torus**

- Uniformly distributed traffic

⇒ Nearly 40% power savings for 50% buffer size reduction (428), using Dynamic buffer allocation

(428 = 4 VCs per port, 2 router buffers per VC, 8 channel buffers)
Uniformly distributed traffic

⇒ Only about 5% drop in throughput for the 428 case (Dynamic buffer allocation)

(428 = 4 VCs per port, 2 router buffers per VC, 8 channel buffers)
• Total power consumed for a network load of 0.5

⇒ Nearly 20% savings for the 428, using Dynamic buffer allocation

(428 = 4 VCs per port, 2 router buffers per VC, 8 channel buffers)
• Reduction in power for all configurations, under all traffic patterns, compared to the baseline (440)
• For example, under Complement traffic the 428 configuration achieves 45% savings under Static allocation and 37.5% savings under Dynamic allocation
Throughput – 8x8 Mesh – all Traffic Patterns

Throughput (8x8 Mesh) at an offered load = 0.5

- No significant decrease in throughput under any traffic pattern, using Dynamic allocation
Conclusion

• **iDEAL** architecture provides a Low-Power Area-efficient solution for NoCs, by reducing power consumption through circuit-level and architecture-level techniques.

• Simulation results show that by reducing the buffer size in half, a 40-52% savings in power is achieved, with a significant reduction in router area. There is only a marginal 1-5% drop in performance, under dynamic buffer allocation.

• Future work will involve (a) Simulation using real-application traces (b) Exploring architectural improvements such as aggressive speculation in the credit loop.
Backup Slides
## Area Estimation – Summary with values from Synopsys Design Compiler

<table>
<thead>
<tr>
<th>vnV – rnR - cnC</th>
<th>Buffer Area (μm²)</th>
<th>Link Repeater Area (μm²)</th>
<th>Total Buffer + Link Area (μm²)</th>
<th>% Change</th>
</tr>
</thead>
<tbody>
<tr>
<td>v4-r4-c0</td>
<td>81,407</td>
<td>32</td>
<td>81,439</td>
<td>-</td>
</tr>
<tr>
<td>v4-r3-c4</td>
<td>63,991</td>
<td>52</td>
<td>64,011</td>
<td>-21.40</td>
</tr>
<tr>
<td><strong>v4-r2-c8</strong></td>
<td><strong>48,066</strong></td>
<td><strong>80</strong></td>
<td><strong>48,146</strong></td>
<td><strong>-40.88</strong></td>
</tr>
<tr>
<td>v3-r4-c4</td>
<td>63,250</td>
<td>52</td>
<td>63,302</td>
<td>-22.27</td>
</tr>
<tr>
<td>v3-r3-c7</td>
<td>50,373</td>
<td>73</td>
<td>50,446</td>
<td>-38.05</td>
</tr>
<tr>
<td>v5-r2-c6</td>
<td>53,712</td>
<td>66</td>
<td>53,778</td>
<td>-33.96</td>
</tr>
<tr>
<td>v5-r3-c1</td>
<td>73,797</td>
<td>38</td>
<td>73,803</td>
<td>-9.37</td>
</tr>
</tbody>
</table>

\( n_V \) = number of VCs per input port, \( n_R \) = number of router buffers per VC, \( n_C \) = number of channel buffers
Latency – 8x8 Mesh and Folded Torus

Average Latency (8x8 Mesh) - UN - Dynamic

Average Latency (8x8 Folded Torus) UN - Dynamic

- Uniformly distributed traffic

⇒ For all cases (except 531), saturation for a network load of about 0.3 in case of Mesh and about 0.4 in case of Folded torus
Comparison with FC-CB and DAMQ

- FC-CB shows similar performance as the dynamically allocated 440 case
- 434 and 428 achieve nearly 4% increase in saturation throughput compared to FC-CB
- 428 achieves nearly 12.5% improvement in saturation throughput compared to DAMQ
Power calculations using Synopsys Power Compiler

- 428 case shows nearly 40% reduction in buffer power alone
- Nearly 30% decrease in overall network power for the 428 case
Data flow Control Simulated with Synopsys VCS

500 MHz Clock Signal
Data_in
Congestion input
Congestion at stage 1
Congestion at stage 2
Congestion at stage 3
Congestion at stage 4
Data_out1 from stage 1
Data_out2 from stage 2
Data_out3 from stage 3
Data_out4 from stage 4

Time (ns)

5 10 15 20 25 30 35 40
# Router - Power Estimation

<table>
<thead>
<tr>
<th>Component</th>
<th>Power / Area Calculation</th>
<th>Explanation</th>
</tr>
</thead>
<tbody>
<tr>
<td>$C_{buf}$</td>
<td>$(1/2 \times W \times L \times C_{ox}) + (W \times L_{ov} \times C_{ox})$</td>
<td>$C_{buf}$ = additional capacitance due to three-state repeater along the links $W$, $L$ = Width &amp; Length of min. sized inverter $C_{ox}$ = oxide capacitance $L_{ov}$ = gate-drain/source overlap length</td>
</tr>
<tr>
<td>$\dot{P}_{\text{dynamic}}$</td>
<td>$a \times [k(C_o + C_p + C_{buf}) + \ell C_w] \times V_{DD}^2 \times \text{freq}$</td>
<td>$a$ = activity factor, $k$ = repeater sizing, $\ell$ = repeater spacing $C_o$ = diffusion capacitance $C_p$ = gate capacitance $C_w$ = wire capacitance $V_{DD}$ = supply voltage freq = operating frequency</td>
</tr>
<tr>
<td>$\dot{P}_{\text{leakage}}$</td>
<td>$2 \times [1/2 \times V_{DD} \times (I_{off}(W_n + W_p)k)]$</td>
<td>$I_{off}$ = subthreshold leakage current $W_n$ ($W_p$) = width of the NMOS (PMOS) in the repeater</td>
</tr>
<tr>
<td>$\dot{P}_{\text{short-ckt}}$</td>
<td>$a \times t_{\text{rise}} \times W_n \times k \times V_{DD} \times I_{sc} \times \text{freq}$</td>
<td>$t_{\text{rise}}$ = rise time of the short-ckt current $I_{sc}$</td>
</tr>
</tbody>
</table>
iDEAL – Control Block

- Self-checking Double-sampling technique for the Control block

- Slightly more power (0.02 uW v/s 0.06 uW) and area, but more reliable
Aggressive Speculation

- Aggressive speculation by increasing the number of credits available to 8
- Additional credits are accounted for by the channel buffers

⇒ Saturation throughput improves by 10% for the 428 case
# Power Estimation – Summary

with values from Synopsys Power Compiler

<table>
<thead>
<tr>
<th>vnV – rnR - cnC</th>
<th>Buffer Power (mW)</th>
<th>Mesh Link + Control Power (mW)</th>
<th>Total Power (Buffer + Link) (mW)</th>
<th>% Change</th>
</tr>
</thead>
<tbody>
<tr>
<td>v4-r4-c0</td>
<td>19.54</td>
<td>2.45</td>
<td>21.99</td>
<td>-</td>
</tr>
<tr>
<td>v4-r3-c4</td>
<td>14.51</td>
<td>2.91</td>
<td>17.42</td>
<td>-20.78</td>
</tr>
<tr>
<td>v4-r2-c8</td>
<td><strong>11.57</strong></td>
<td><strong>3.57</strong></td>
<td><strong>15.14</strong></td>
<td><strong>-31.15</strong></td>
</tr>
<tr>
<td>v3-r4-c4</td>
<td>15.09</td>
<td>2.91</td>
<td>18.00</td>
<td>-18.14</td>
</tr>
<tr>
<td>v3-r3-c7</td>
<td>12.56</td>
<td>3.50</td>
<td>16.06</td>
<td>-26.96</td>
</tr>
<tr>
<td>v5-r2-c6</td>
<td>14.41</td>
<td>3.31</td>
<td>17.72</td>
<td>-19.41</td>
</tr>
<tr>
<td>v5-r3-c1</td>
<td>19.29</td>
<td>2.81</td>
<td>22.10</td>
<td>+ 0.50</td>
</tr>
</tbody>
</table>

\( n_V \) = number of VCs per input port, \( n_R \) = number of router buffers per VC, \( n_C \) = number of channel buffers