elastic systems

143
Elastic Systems Jordi Cortadella Universitat Politècnica de Catalunya Marc Galceran-Oms Universitat Politècnica de Catalunya Mike Kishinevsky Intel Corp.

Upload: vanig

Post on 24-Feb-2016

55 views

Category:

Documents


5 download

DESCRIPTION

Elastic Systems. Jordi Cortadella Universitat Politècnica de Catalunya Marc Galceran-Oms Universitat Politècnica de Catalunya Mike Kishinevsky Intel Corp. Elasticity. Leonardo da Vinci’s catapult. Asynchronous elastic pipeline. ReqIn. ReqOut. C. C. C. C. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Elastic Systems

Elastic Systems

Jordi Cortadella Universitat Politècnica de Catalunya

Marc Galceran-Oms Universitat Politècnica de Catalunya

Mike Kishinevsky Intel Corp.

Page 2: Elastic Systems

Elasticity

Leonardo da Vinci’s catapult

Page 3: Elastic Systems
Page 4: Elastic Systems
Page 5: Elastic Systems
Page 6: Elastic Systems
Page 7: Elastic Systems
Page 8: Elastic Systems

Asynchronous elastic pipeline

C

ReqIn ReqOut

AckIn AckOut

C C C

David Muller’s pipeline (late 50’s)Sutherland’s Micropipelines (1989)

Page 9: Elastic Systems

The specification of a complex system is usuallyasynchronous (functional units, messages, queues, …),

… however the clock appears when we move downto the implementation levels

(Bill Grundmann, 2004)

Page 10: Elastic Systems

Asynchronous elasticity

req

ack

CLK

Page 11: Elastic Systems

Synchronous elasticity

valid

stop

CLKLatency-insensitive systems (Carloni et al., 1999)Synchronous handshake circuits (Peeters et al, 2001)Synchronous elastic systems (Cortadella et al., 2006)Latency-Insensitive Bounded Dataflow Networks (Vijayaraghavan et al., 2009)Synchronous emulation of asynchronous circuits (O’Leary, 1997)

Page 12: Elastic Systems

Many systems are already elastic

AMBA AXI bus protocol

Handshake signals

Page 13: Elastic Systems

Time uncertainty in chip design

How manycycles ?

Page 14: Elastic Systems

Why elastic circuits now ?

Need to live with time uncertainty

Need to formalize time uncertainty– For synthesis– For verification

Need for modularity

Page 15: Elastic Systems

Behavioral equivalence in Elastic Circuits

+…

147

+e

… 348201…

48…147…201… 3

Page 16: Elastic Systems

Behavioral equivalence in Elastic Circuits

+…

147

+e

… 348201…

348…147…201…

tokenbubble

Traces a preserved after hiding bubbles(stream-based equivalence)

Page 17: Elastic Systems

Unpipelined system

Page 18: Elastic Systems

Pipelined system

Page 19: Elastic Systems

Write Buffer

Page 20: Elastic Systems

Communication channelreceiversender

Data Data

Long wires: slow transmission

Page 21: Elastic Systems

Pipelined communicationsender receiver

DataData

Page 22: Elastic Systems

sender receiver

DataData

Pipelined communication

Page 23: Elastic Systems

The Valid bitsender receiver

Data Data

Valid Valid

Page 24: Elastic Systems

The Stop bit

00000

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

Page 25: Elastic Systems

The Stop bit

11000

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

Page 26: Elastic Systems

The Stop bit

11100

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

Page 27: Elastic Systems

The Stop bit

11111

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

Back-pressure

Page 28: Elastic Systems

The Stop bit

01111

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

Page 29: Elastic Systems

The Stop bit

00000

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

Page 30: Elastic Systems

The Stop bit

00000

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

Page 31: Elastic Systems

The Stop bit

00000

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

Page 32: Elastic Systems

The Stop bit

10000

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

Long combinational path

Page 33: Elastic Systems

Carloni’s relay stations (double storage)

shell

pearl

sender

shell

pearl

receiver

main main main

aux aux aux

Page 34: Elastic Systems

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

receiver

shell

pearl

sender

Page 35: Elastic Systems

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

receiver

shell

pearl

sender

Page 36: Elastic Systems

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

receiver

shell

pearl

sender

Page 37: Elastic Systems

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

sender

shell

pearl

receiver

Page 38: Elastic Systems

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

sender

shell

pearl

receiver

Page 39: Elastic Systems

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

sender

shell

pearl

receiver

Page 40: Elastic Systems

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

sender

shell

pearl

receiver

Page 41: Elastic Systems

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

receiver

shell

pearl

sender

• Handshakes with short wires• Double storage required

Page 42: Elastic Systems

Flip-flops vs. latches

sender receiver

1 cycle

FF FF

Page 43: Elastic Systems

Flip-flops vs. latches

sender receiver

1 cycle

H L H L

Page 44: Elastic Systems

Flip-flops vs. latches

sender receiver

1 cycle

H L H L

Page 45: Elastic Systems

Flip-flops vs. latches

sender receiver

1 cycle

H L H L

Page 46: Elastic Systems

Flip-flops vs. latches

sender receiver

1 cycle

H L H L

Page 47: Elastic Systems

Flip-flops vs. latches

sender receiver

1 cycle

H L H L

Flip-flops already have adouble storage capability, but …

Page 48: Elastic Systems

Flip-flops vs. latches

sender receiver

1 cycle

H L H L

Not allowed in conventionalFF-based design !

Page 49: Elastic Systems

Flip-flops vs. latches

sender receiver

1 cycle

H L LH

Let’s make the master/slave latches independent

Page 50: Elastic Systems

Flip-flops vs. latches

sender receiverH L H L

½ cycle ½ cycle

Let’s make the master/slave latches independent

Only half of the latches (H or L) can move tokens

Page 51: Elastic Systems

Latch-based elasticitysender receiver

V V V V

S S S S

En En En En

1 1

Data

Valid

Stop

Data

Valid

Stop

1 1

Page 52: Elastic Systems

Elastic netlists

ForkJoin

Join / Fork

EB

EBEB

EB

Enable signalto data latches

Page 53: Elastic Systems

Basic VS block

Si

Eni

Vi

Si-1

Vi-1

VSSi

Eni

Vi

Si-1

Vi-1

Page 54: Elastic Systems

Join

VS

+

V1

V2

S1

S2

V

S

VS

VS

Page 55: Elastic Systems

(Lazy) Fork

V1

V2

S1

S2

V

S

Page 56: Elastic Systems

Eager Fork

V1

V2

S1

S2

^

^

V

S

Page 57: Elastic Systems

Variable Latency Units

[0 - k] cycles

V/S V/S

donego clear

Page 58: Elastic Systems

Generalization: FIFOs(Bounded Dataflow Networks)

InOut

B1 B3

B2

Page 59: Elastic Systems

Elastic Buffers

=

=

Elastic Buffer with Token

Elastic Buffer with Bubble

m =Skid-Buffer (zero latency)

m

-kAnti-token injector

Page 60: Elastic Systems

Let’s do transformationsGoal:– Transform the system to improve performance, either

preserving or not preserving time (but preserving behavior)

Few transformations have been re-invented from asynchronous design and dataflow computation:

– Adding bubbles preserves behavior– Early evaluation and anti-tokens– Buffer resizing and slack matching to balance fork/join

structures

Page 61: Elastic Systems

Performance is abouttokens and bubbles

Page 62: Elastic Systems

How many bubbles do we need?

Page 63: Elastic Systems

How many bubbles do we need?

Page 64: Elastic Systems

How many bubbles do we need?

Page 65: Elastic Systems

How many bubbles do we need?

Page 66: Elastic Systems

How many bubbles do we need?

Page 67: Elastic Systems

How many bubbles do we need?

Page 68: Elastic Systems

How many bubbles do we need?

O(n2) cycles

Page 69: Elastic Systems

How many bubbles do we need?

O(n2) cycles

Page 70: Elastic Systems

How many bubbles do we need?

O(n2) cycles

Page 71: Elastic Systems

How many bubbles do we need?

O(n2) cycles

Page 72: Elastic Systems

How many bubbles do we need?

O(n2) cycles

Page 73: Elastic Systems

How many bubbles do we need?

O(n2) cycles

Page 74: Elastic Systems

How many bubbles do we need?

O(n2) cycles

Page 75: Elastic Systems

How many bubbles do we need?

O(n2) cycles

O(n) cycles

Page 76: Elastic Systems

How many bubbles do we need?

At least one bubble and one token per cycle, otherwise neither tokens nor bubbles can move (deadlock)

n/2 bubbles for optimum performance(in a balanced cycle)

Page 77: Elastic Systems

Performance of an N-stage ring

tokens

0 NN / 2

Throughput

Deadlock

Page 78: Elastic Systems

Adding bubbles (retiming & recycling)

9 8 6

4 4

39

10

4

10 - Combinational block with delay 10

- Initialized register (dot)

Cycle time is

Throughput is 1

Effective cycle time is 21

211916

Retiming can not do better!

Retiming and Recycling (R&R) can

Effective cycle time is 19

Effective cycle time is 16

Throughput is 4/5

Effective cycle time is 12 * 5/4 = 15

12

Find a minimal effective cycle time of the circuit represented as retiming graph (RG)!

The longest combinational path delayThe number of valid data/clock cyclecycle time/throughput

Retiming graph

5 registers, 4 tokens

Page 79: Elastic Systems

Any integer solution for r:

Final token assignment

Initial register assignment

Retiming vectors

R' contain tokens and anti-tokens

Retiming

Page 80: Elastic Systems

Any integer solution for R:

Retiming subset

R – max(R' ,0) = num of bubbles

Retiming & Recycling

-1

Page 81: Elastic Systems

Retiming & Recycling configuration

Throughput upper bound of R (Júlvez et al 2006)

Cycle time of R is at most (Bufistov et al 2007)

Effective cycle time(delay/throughput)

Non-convex quadratic optimization problem

Can be transformed into a Mixed IntegerLinear Programming model

Mixed Integer Program for R&R

Page 82: Elastic Systems

PC+4

Branch targetaddress

Example: next-PC calculation

Take branch

• Only wait for required inputs• Late arriving tokens are cancelled by anti-

tokens

No branch

Early Evaluation

Page 83: Elastic Systems

How to implement anti-tokens ?

Valid+ Valid+

Valid–

Valid+

Stop+ Stop+

Valid–

Stop–Stop–

+

-

Page 84: Elastic Systems

Dual elastic controllers

S+

V+

V-

S-

S+

V+

V-

S-

En En

Page 85: Elastic Systems

Fork/join

Dual fork/join Join with early evaluation

Page 86: Elastic Systems

Re-designing for average performance

FE

arly evaluation

slow / fast

Fslow

Ffast

Page 87: Elastic Systems

How can elasticity be used for design optimization?

In “regular” designs: – Take advantage of Don’t Cares = behaviors that never occur

In elastic designs: – Take advantage of “Little Cares” (LCs) = behaviors that rarely occur and “Critical Cores” (CC) = behaviors that occur often

– Can use variable latency: LCs can be made slower

Page 88: Elastic Systems

Exploiting “Little cares”

F100 G

100

Goal: minimize time per token (operation) executionMeasured by Effective Cycle Time, ECT

ECT = Clock Period / Throughput = 100 / 1 = 100

Design executes one token per 100 time units

Page 89: Elastic Systems

Exploiting “Little cares”

F100 G

100

CC1100 CC2

100

LC1100 LC2

100

p

1-p

CC150

CC250

Page 90: Elastic Systems

Exploiting “Little cares”

F100 G

100

LC1’’50 LC2’

50

p

1-p

CC150

CC250

LC2’’50

LC1’50

Page 91: Elastic Systems

0

20

40

60

80

100

120

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Performance as function of “critical core” probability

0

0.2

0.4

0.6

0.8

1

1.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Throughput

Effective CT

Probability

Probability

original design

new design @ p = 0.9

1.67x performance improvement

Page 92: Elastic Systems

H.264 CABAC decoder

Gotmanov, Kishinevsky and Galceran-OmsEvaluation of flexible latencies: designing synchronous elastic H.264 CABAC decoderProc. Problems in design of micro- and nano-electronic systemsMoscow, Oct. 2010 (in Russian)

Page 93: Elastic Systems

Profiling

Page 94: Elastic Systems

H.264 CABAC decoder

Page 95: Elastic Systems

0,90

1,00

1,10

1,20

1,30

1,40

1,50

1,60

1,70

0,50 0,70 0,90 1,10 1,30 1,50 1,70 1,90

Original Optimized

Area

Effective Cycle Time

Page 96: Elastic Systems

Elastic Transforms

=

=

-i -j -(i+j)

-1=

-k

=

...

k

Bubble Insertion :(recycling)

Anti-token Insertion :

Add Capacity:

-1=

= k

-1 -1-1

Anti-token Grouping :

Anti-token retiming:

Anti-token Insertion:Multiple

kernel

derivative

Page 97: Elastic Systems

Retiming

F FA

B

CA

B

C

FA

B

CFA

B

C

2

2

Capacity sizing may be needed in case of sharing

Page 98: Elastic Systems

Register File Bypass

rdwd

=rdwd

=

ra

0

1

Read data rd is forwarded from previous write wd’iff read address ra is the same as previous write address wa’.

wd’

wa’RE

AD

WRI

TEwa

RE

AD

WR

ITE

rawa

Page 99: Elastic Systems

Pipelining using elasticity

rdwd

ra

READ

WRI

TE

wa

A

B2B1

Kam, et al. Correct-by-construction Microarchitectural Pipelining, ICCAD 08

Sequentialexecution: R B1 B2R A R B1 B2 R A

Page 100: Elastic Systems

Pipelining using elasticity

rdwd

ra

READ

WRI

TE

wa

A

B2B1

Pipelinedexecution: B2B1R

R A

B2B1RR A

B2B1R

Bypass if data dependency

???

B2B1R

Page 101: Elastic Systems

Pipelining using elasticity

rdwd

ra

READ

WRI

TE

wa

A

B2B1

Pipelinedexecution: B2B1R

R A

B2B1RR A

B2B1RB2B1R

Page 102: Elastic Systems

Pipelining using elasticity

rdwd

ra

READ

WRI

TE

wa

A

B2B1

Page 103: Elastic Systems

Pipelining using elasticity

A

B2B1

2 bypasses

rd

wd

=

wd’

wa’

RE

AD

WR

ITE

rawa

wd’’

wa’’

Page 104: Elastic Systems

Pipelining using elasticity

A

B2B1

Forwarding

rd

=wa’

RE

AD

WR

ITE

rawa

wa’’

Page 105: Elastic Systems

Pipelining using elasticity

A

B2B1

Retiming

rd

=wa’

RE

AD

WR

ITE

rawa

wa’’

Page 106: Elastic Systems

Pipelining using elasticity

A

B2B1

Retiming with anti-tokens

rd

=wa’

RE

AD

WR

ITE

rawa

wa’’

-1

Anti-token insertion allows retiming combinations that are not possible in a conventional synchronous circuit

Page 107: Elastic Systems

Pipelining using elasticity

A

B2B1rd

=wa’

RE

AD

WR

ITE

rawa

wa’’

-10

System only stalls in case of RAWdependencies with B1-B2

Latency=2, Tokens=1

Page 108: Elastic Systems

Initial Graph

Add bypasses to one or more

memory elements

Improve?

Set of near-

optimal design points

YesNo

Since throughput analysis methods are not exact for early evaluation, the best design points found during exploration must be simulated in a second phase of the algorithm to determine the best one.

Exploration Algorithm

R&R MILP method

Simulate near-optimal design points to obtain

actual performance

Page 109: Elastic Systems

Write Buffer

Page 110: Elastic Systems

Micro-architectural exploration

Page 111: Elastic Systems

ConclusionsRigid systems preserve timing equivalence(data always valid at every cycle)

Elastic systems waive timing equivalence to enablemore concurrency

(bubbles decrease throughput, but reduce cycle time)

A new avenue of performance optimizations can emerge to build correct-by-construction pipelines

Θ Θ

Page 112: Elastic Systems

Backup slides

Page 113: Elastic Systems

Retiming and Recycling

-2

MILP-based approach

Bufistov, et al. Retiming and Recycling for Elastic Systems withEarly Evaluation. DAC 09

- delays at nodes- elastic buffers and anti-tokens at edges

R&R finds a set of pareto-point designs with different cycle time / throughput trade-offs

0 1 1

0 1 1

1

1

-1

3/1=3

2/0.66=3

1/0.5=2

Page 114: Elastic Systems

Coarse grain elasticity

114

Page 115: Elastic Systems

Deadlocks

Optimal throughput

Page 116: Elastic Systems

Notation for elastic systemsElastic buffer with one token of information

Empty elastic buffer (bubble)

Channel with an injector of k negative tokens

-k

Empty elastic buffer with bypass (Skid-buffer with no tokens of information)

m

Latches=2Capacity=2Tokens=1

Latches=2Capacity=2Tokens=0

Latches=0Capacity=0Tokens= -k

Latches=0Capacity=mTokens=0

Page 117: Elastic Systems

Marked Graph model

Page 118: Elastic Systems

Marked Graph model

Page 119: Elastic Systems

Marked Graph model

Page 120: Elastic Systems

Dual Marked Graph model

Enabled !

Page 121: Elastic Systems

How to implement anti-tokens ?

Positive tokens

Negative tokens

Page 122: Elastic Systems

How to implement anti-tokens ?

Positive tokens

Negative tokens

Page 123: Elastic Systems

Elastic controllers

V

S

V

S

Data

H

H

L

L

L

H

V

S

V

S

En En

Page 124: Elastic Systems

Complex systems need to be elastic

Intel IXP422 Network Processor

Page 125: Elastic Systems

Example: DLX Pipeline

Block mux2 EB ID nextPC ALU F RF W RF R

Delay 1.5 3.15 6.0 3.75 13.0 80.0 6.0 11.0

Area 1.5 4.5 72 24 1600 8000 6000

• Memory read latency = 10

• P(ALU)=0.35• P(F) = 0.2• P(MLOAD)=0.25• P(MSTORE)=0.075• P(BR) = 0.125• Mem dependency = 0.5• RF dependency = 0.2• Depth(F) = [1,…,8]

Page 126: Elastic Systems

Pipelined DLX

Page 127: Elastic Systems

Preserving behavioral equivalence

Combinational logic synthesisCombinational equivalence checking

Page 128: Elastic Systems

Preserving behavioral equivalence

Sequential logic synthesisSequential equivalence checking

Page 129: Elastic Systems

Different flavors of Elastic Buffers

Flip-flop = Master & Slave

main

aux(Carloni, 1999)

Jacobson 2002Synchronous Interlocked Pipelines

Page 130: Elastic Systems

PC+4

Branch targetaddress

Example: next-PC calculation

Take branch

• Only wait for required inputs• Late arriving tokens are cancelled by anti-

tokens

No branch

Early Evaluation

Page 131: Elastic Systems

F

Critical cycle(late arrival)

F

F Shorter cycle(but F duplicated)

Shannon decomposition

Early evaluation is useless!!!

0

1

0

1

Page 132: Elastic Systems

Speculation with Shared Units

F

Shared unit

Scheduler

1. Speculate which channel the mux will choose next cycle and exec F on it

2. Stop the other elastic channel

3. If next cycle we realize a mistake has been made, exec F on the other channel

0

1

Page 133: Elastic Systems

Shared unit

Scheduler

00

p=0.95

p=0.05

1

Cycle : 11

Sequence of correct predictions

F

0

1

Page 134: Elastic Systems

Shared unit

00

1

1

Cycle : 2

Scheduler

F

1

2

2

Sequence of correct predictions

0

1

Page 135: Elastic Systems

Shared unit

10

Cycle : 3

0

2

Scheduler

F

1

Shared unit

3

2

Misprediction

0

1

Page 136: Elastic Systems

Shared unit

Cycle : 4

Scheduler

F

Shared unit

4 23

110 Correction

23

Correction

0

1

Page 137: Elastic Systems

2

Shared unit

Cycle : 5

Scheduler

F

Shared unit

23

1

45

34 2

Misprediction stalled 2 cyclesCorrection

0

1

Page 138: Elastic Systems

Error Correction using Speculation

wd rd

F1 F2ECC

REGFILE

REA

D

WR

ITE

Page 139: Elastic Systems

Error Correction using Speculation

wd rd

F2ECC

REGFILE

REA

D

WR

ITE

F1

shared

Page 140: Elastic Systems

Pipelining using elasticity

A

B2B1

Retiming

=wa’

RE

AD

WR

ITE

rawa

wa’’

Page 141: Elastic Systems

Pipelining using elasticity

A

B2B1

=wa’

RE

AD

WR

ITE

rawa

wa’’

Page 142: Elastic Systems

Pipelining using elasticity

A

B2B1

=wa’

RE

AD

WR

ITE

rawa

wa’’

-1

Page 143: Elastic Systems

Pipelining using elasticity

A

B2B1

2 bypasses

=wa’

RE

AD

WR

ITE

rawa

wa’’

-1

System only stalls in case of RAWdependencies with B1-B2