elastic systems

Post on 24-Feb-2016

55 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

Elastic Systems. Jordi Cortadella Universitat Politècnica de Catalunya Marc Galceran-Oms Universitat Politècnica de Catalunya Mike Kishinevsky Intel Corp. Elasticity. Leonardo da Vinci’s catapult. Asynchronous elastic pipeline. ReqIn. ReqOut. C. C. C. C. - PowerPoint PPT Presentation

TRANSCRIPT

Elastic Systems

Jordi Cortadella Universitat Politècnica de Catalunya

Marc Galceran-Oms Universitat Politècnica de Catalunya

Mike Kishinevsky Intel Corp.

Elasticity

Leonardo da Vinci’s catapult

Asynchronous elastic pipeline

C

ReqIn ReqOut

AckIn AckOut

C C C

David Muller’s pipeline (late 50’s)Sutherland’s Micropipelines (1989)

The specification of a complex system is usuallyasynchronous (functional units, messages, queues, …),

… however the clock appears when we move downto the implementation levels

(Bill Grundmann, 2004)

Asynchronous elasticity

req

ack

CLK

Synchronous elasticity

valid

stop

CLKLatency-insensitive systems (Carloni et al., 1999)Synchronous handshake circuits (Peeters et al, 2001)Synchronous elastic systems (Cortadella et al., 2006)Latency-Insensitive Bounded Dataflow Networks (Vijayaraghavan et al., 2009)Synchronous emulation of asynchronous circuits (O’Leary, 1997)

Many systems are already elastic

AMBA AXI bus protocol

Handshake signals

Time uncertainty in chip design

How manycycles ?

Why elastic circuits now ?

Need to live with time uncertainty

Need to formalize time uncertainty– For synthesis– For verification

Need for modularity

Behavioral equivalence in Elastic Circuits

+…

147

+e

… 348201…

48…147…201… 3

Behavioral equivalence in Elastic Circuits

+…

147

+e

… 348201…

348…147…201…

tokenbubble

Traces a preserved after hiding bubbles(stream-based equivalence)

Unpipelined system

Pipelined system

Write Buffer

Communication channelreceiversender

Data Data

Long wires: slow transmission

Pipelined communicationsender receiver

DataData

sender receiver

DataData

Pipelined communication

The Valid bitsender receiver

Data Data

Valid Valid

The Stop bit

00000

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

The Stop bit

11000

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

The Stop bit

11100

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

The Stop bit

11111

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

Back-pressure

The Stop bit

01111

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

The Stop bit

00000

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

The Stop bit

00000

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

The Stop bit

00000

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

The Stop bit

10000

sender

Data

Valid

Stop

receiver

Data

Valid

Stop

Long combinational path

Carloni’s relay stations (double storage)

shell

pearl

sender

shell

pearl

receiver

main main main

aux aux aux

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

receiver

shell

pearl

sender

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

receiver

shell

pearl

sender

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

receiver

shell

pearl

sender

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

sender

shell

pearl

receiver

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

sender

shell

pearl

receiver

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

sender

shell

pearl

receiver

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

sender

shell

pearl

receiver

Carloni’s relay stations (double storage)

main main main

aux aux aux

shell

pearl

receiver

shell

pearl

sender

• Handshakes with short wires• Double storage required

Flip-flops vs. latches

sender receiver

1 cycle

FF FF

Flip-flops vs. latches

sender receiver

1 cycle

H L H L

Flip-flops vs. latches

sender receiver

1 cycle

H L H L

Flip-flops vs. latches

sender receiver

1 cycle

H L H L

Flip-flops vs. latches

sender receiver

1 cycle

H L H L

Flip-flops vs. latches

sender receiver

1 cycle

H L H L

Flip-flops already have adouble storage capability, but …

Flip-flops vs. latches

sender receiver

1 cycle

H L H L

Not allowed in conventionalFF-based design !

Flip-flops vs. latches

sender receiver

1 cycle

H L LH

Let’s make the master/slave latches independent

Flip-flops vs. latches

sender receiverH L H L

½ cycle ½ cycle

Let’s make the master/slave latches independent

Only half of the latches (H or L) can move tokens

Latch-based elasticitysender receiver

V V V V

S S S S

En En En En

1 1

Data

Valid

Stop

Data

Valid

Stop

1 1

Elastic netlists

ForkJoin

Join / Fork

EB

EBEB

EB

Enable signalto data latches

Basic VS block

Si

Eni

Vi

Si-1

Vi-1

VSSi

Eni

Vi

Si-1

Vi-1

Join

VS

+

V1

V2

S1

S2

V

S

VS

VS

(Lazy) Fork

V1

V2

S1

S2

V

S

Eager Fork

V1

V2

S1

S2

^

^

V

S

Variable Latency Units

[0 - k] cycles

V/S V/S

donego clear

Generalization: FIFOs(Bounded Dataflow Networks)

InOut

B1 B3

B2

Elastic Buffers

=

=

Elastic Buffer with Token

Elastic Buffer with Bubble

m =Skid-Buffer (zero latency)

m

-kAnti-token injector

Let’s do transformationsGoal:– Transform the system to improve performance, either

preserving or not preserving time (but preserving behavior)

Few transformations have been re-invented from asynchronous design and dataflow computation:

– Adding bubbles preserves behavior– Early evaluation and anti-tokens– Buffer resizing and slack matching to balance fork/join

structures

Performance is abouttokens and bubbles

How many bubbles do we need?

How many bubbles do we need?

How many bubbles do we need?

How many bubbles do we need?

How many bubbles do we need?

How many bubbles do we need?

How many bubbles do we need?

O(n2) cycles

How many bubbles do we need?

O(n2) cycles

How many bubbles do we need?

O(n2) cycles

How many bubbles do we need?

O(n2) cycles

How many bubbles do we need?

O(n2) cycles

How many bubbles do we need?

O(n2) cycles

How many bubbles do we need?

O(n2) cycles

How many bubbles do we need?

O(n2) cycles

O(n) cycles

How many bubbles do we need?

At least one bubble and one token per cycle, otherwise neither tokens nor bubbles can move (deadlock)

n/2 bubbles for optimum performance(in a balanced cycle)

Performance of an N-stage ring

tokens

0 NN / 2

Throughput

Deadlock

Adding bubbles (retiming & recycling)

9 8 6

4 4

39

10

4

10 - Combinational block with delay 10

- Initialized register (dot)

Cycle time is

Throughput is 1

Effective cycle time is 21

211916

Retiming can not do better!

Retiming and Recycling (R&R) can

Effective cycle time is 19

Effective cycle time is 16

Throughput is 4/5

Effective cycle time is 12 * 5/4 = 15

12

Find a minimal effective cycle time of the circuit represented as retiming graph (RG)!

The longest combinational path delayThe number of valid data/clock cyclecycle time/throughput

Retiming graph

5 registers, 4 tokens

Any integer solution for r:

Final token assignment

Initial register assignment

Retiming vectors

R' contain tokens and anti-tokens

Retiming

Any integer solution for R:

Retiming subset

R – max(R' ,0) = num of bubbles

Retiming & Recycling

-1

Retiming & Recycling configuration

Throughput upper bound of R (Júlvez et al 2006)

Cycle time of R is at most (Bufistov et al 2007)

Effective cycle time(delay/throughput)

Non-convex quadratic optimization problem

Can be transformed into a Mixed IntegerLinear Programming model

Mixed Integer Program for R&R

PC+4

Branch targetaddress

Example: next-PC calculation

Take branch

• Only wait for required inputs• Late arriving tokens are cancelled by anti-

tokens

No branch

Early Evaluation

How to implement anti-tokens ?

Valid+ Valid+

Valid–

Valid+

Stop+ Stop+

Valid–

Stop–Stop–

+

-

Dual elastic controllers

S+

V+

V-

S-

S+

V+

V-

S-

En En

Fork/join

Dual fork/join Join with early evaluation

Re-designing for average performance

FE

arly evaluation

slow / fast

Fslow

Ffast

How can elasticity be used for design optimization?

In “regular” designs: – Take advantage of Don’t Cares = behaviors that never occur

In elastic designs: – Take advantage of “Little Cares” (LCs) = behaviors that rarely occur and “Critical Cores” (CC) = behaviors that occur often

– Can use variable latency: LCs can be made slower

Exploiting “Little cares”

F100 G

100

Goal: minimize time per token (operation) executionMeasured by Effective Cycle Time, ECT

ECT = Clock Period / Throughput = 100 / 1 = 100

Design executes one token per 100 time units

Exploiting “Little cares”

F100 G

100

CC1100 CC2

100

LC1100 LC2

100

p

1-p

CC150

CC250

Exploiting “Little cares”

F100 G

100

LC1’’50 LC2’

50

p

1-p

CC150

CC250

LC2’’50

LC1’50

0

20

40

60

80

100

120

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Performance as function of “critical core” probability

0

0.2

0.4

0.6

0.8

1

1.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Throughput

Effective CT

Probability

Probability

original design

new design @ p = 0.9

1.67x performance improvement

H.264 CABAC decoder

Gotmanov, Kishinevsky and Galceran-OmsEvaluation of flexible latencies: designing synchronous elastic H.264 CABAC decoderProc. Problems in design of micro- and nano-electronic systemsMoscow, Oct. 2010 (in Russian)

Profiling

H.264 CABAC decoder

0,90

1,00

1,10

1,20

1,30

1,40

1,50

1,60

1,70

0,50 0,70 0,90 1,10 1,30 1,50 1,70 1,90

Original Optimized

Area

Effective Cycle Time

Elastic Transforms

=

=

-i -j -(i+j)

-1=

-k

=

...

k

Bubble Insertion :(recycling)

Anti-token Insertion :

Add Capacity:

-1=

= k

-1 -1-1

Anti-token Grouping :

Anti-token retiming:

Anti-token Insertion:Multiple

kernel

derivative

Retiming

F FA

B

CA

B

C

FA

B

CFA

B

C

2

2

Capacity sizing may be needed in case of sharing

Register File Bypass

rdwd

=rdwd

=

ra

0

1

Read data rd is forwarded from previous write wd’iff read address ra is the same as previous write address wa’.

wd’

wa’RE

AD

WRI

TEwa

RE

AD

WR

ITE

rawa

Pipelining using elasticity

rdwd

ra

READ

WRI

TE

wa

A

B2B1

Kam, et al. Correct-by-construction Microarchitectural Pipelining, ICCAD 08

Sequentialexecution: R B1 B2R A R B1 B2 R A

Pipelining using elasticity

rdwd

ra

READ

WRI

TE

wa

A

B2B1

Pipelinedexecution: B2B1R

R A

B2B1RR A

B2B1R

Bypass if data dependency

???

B2B1R

Pipelining using elasticity

rdwd

ra

READ

WRI

TE

wa

A

B2B1

Pipelinedexecution: B2B1R

R A

B2B1RR A

B2B1RB2B1R

Pipelining using elasticity

rdwd

ra

READ

WRI

TE

wa

A

B2B1

Pipelining using elasticity

A

B2B1

2 bypasses

rd

wd

=

wd’

wa’

RE

AD

WR

ITE

rawa

wd’’

wa’’

Pipelining using elasticity

A

B2B1

Forwarding

rd

=wa’

RE

AD

WR

ITE

rawa

wa’’

Pipelining using elasticity

A

B2B1

Retiming

rd

=wa’

RE

AD

WR

ITE

rawa

wa’’

Pipelining using elasticity

A

B2B1

Retiming with anti-tokens

rd

=wa’

RE

AD

WR

ITE

rawa

wa’’

-1

Anti-token insertion allows retiming combinations that are not possible in a conventional synchronous circuit

Pipelining using elasticity

A

B2B1rd

=wa’

RE

AD

WR

ITE

rawa

wa’’

-10

System only stalls in case of RAWdependencies with B1-B2

Latency=2, Tokens=1

Initial Graph

Add bypasses to one or more

memory elements

Improve?

Set of near-

optimal design points

YesNo

Since throughput analysis methods are not exact for early evaluation, the best design points found during exploration must be simulated in a second phase of the algorithm to determine the best one.

Exploration Algorithm

R&R MILP method

Simulate near-optimal design points to obtain

actual performance

Write Buffer

Micro-architectural exploration

ConclusionsRigid systems preserve timing equivalence(data always valid at every cycle)

Elastic systems waive timing equivalence to enablemore concurrency

(bubbles decrease throughput, but reduce cycle time)

A new avenue of performance optimizations can emerge to build correct-by-construction pipelines

Θ Θ

Backup slides

Retiming and Recycling

-2

MILP-based approach

Bufistov, et al. Retiming and Recycling for Elastic Systems withEarly Evaluation. DAC 09

- delays at nodes- elastic buffers and anti-tokens at edges

R&R finds a set of pareto-point designs with different cycle time / throughput trade-offs

0 1 1

0 1 1

1

1

-1

3/1=3

2/0.66=3

1/0.5=2

Coarse grain elasticity

114

Deadlocks

Optimal throughput

Notation for elastic systemsElastic buffer with one token of information

Empty elastic buffer (bubble)

Channel with an injector of k negative tokens

-k

Empty elastic buffer with bypass (Skid-buffer with no tokens of information)

m

Latches=2Capacity=2Tokens=1

Latches=2Capacity=2Tokens=0

Latches=0Capacity=0Tokens= -k

Latches=0Capacity=mTokens=0

Marked Graph model

Marked Graph model

Marked Graph model

Dual Marked Graph model

Enabled !

How to implement anti-tokens ?

Positive tokens

Negative tokens

How to implement anti-tokens ?

Positive tokens

Negative tokens

Elastic controllers

V

S

V

S

Data

H

H

L

L

L

H

V

S

V

S

En En

Complex systems need to be elastic

Intel IXP422 Network Processor

Example: DLX Pipeline

Block mux2 EB ID nextPC ALU F RF W RF R

Delay 1.5 3.15 6.0 3.75 13.0 80.0 6.0 11.0

Area 1.5 4.5 72 24 1600 8000 6000

• Memory read latency = 10

• P(ALU)=0.35• P(F) = 0.2• P(MLOAD)=0.25• P(MSTORE)=0.075• P(BR) = 0.125• Mem dependency = 0.5• RF dependency = 0.2• Depth(F) = [1,…,8]

Pipelined DLX

Preserving behavioral equivalence

Combinational logic synthesisCombinational equivalence checking

Preserving behavioral equivalence

Sequential logic synthesisSequential equivalence checking

Different flavors of Elastic Buffers

Flip-flop = Master & Slave

main

aux(Carloni, 1999)

Jacobson 2002Synchronous Interlocked Pipelines

PC+4

Branch targetaddress

Example: next-PC calculation

Take branch

• Only wait for required inputs• Late arriving tokens are cancelled by anti-

tokens

No branch

Early Evaluation

F

Critical cycle(late arrival)

F

F Shorter cycle(but F duplicated)

Shannon decomposition

Early evaluation is useless!!!

0

1

0

1

Speculation with Shared Units

F

Shared unit

Scheduler

1. Speculate which channel the mux will choose next cycle and exec F on it

2. Stop the other elastic channel

3. If next cycle we realize a mistake has been made, exec F on the other channel

0

1

Shared unit

Scheduler

00

p=0.95

p=0.05

1

Cycle : 11

Sequence of correct predictions

F

0

1

Shared unit

00

1

1

Cycle : 2

Scheduler

F

1

2

2

Sequence of correct predictions

0

1

Shared unit

10

Cycle : 3

0

2

Scheduler

F

1

Shared unit

3

2

Misprediction

0

1

Shared unit

Cycle : 4

Scheduler

F

Shared unit

4 23

110 Correction

23

Correction

0

1

2

Shared unit

Cycle : 5

Scheduler

F

Shared unit

23

1

45

34 2

Misprediction stalled 2 cyclesCorrection

0

1

Error Correction using Speculation

wd rd

F1 F2ECC

REGFILE

REA

D

WR

ITE

Error Correction using Speculation

wd rd

F2ECC

REGFILE

REA

D

WR

ITE

F1

shared

Pipelining using elasticity

A

B2B1

Retiming

=wa’

RE

AD

WR

ITE

rawa

wa’’

Pipelining using elasticity

A

B2B1

=wa’

RE

AD

WR

ITE

rawa

wa’’

Pipelining using elasticity

A

B2B1

=wa’

RE

AD

WR

ITE

rawa

wa’’

-1

Pipelining using elasticity

A

B2B1

2 bypasses

=wa’

RE

AD

WR

ITE

rawa

wa’’

-1

System only stalls in case of RAWdependencies with B1-B2

top related