UDB parallel in/out: confusion about internal timing/states

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
RaAl_264636
Level 6
Level 6
50 sign-ins 25 sign-ins 10 solutions authored

Hi,

in preparation of migrating some custom components from Verilog to UDB ALU function block, I used the parallel PI/PO example of AN82156 as a base. This example is a 8-bit adder, where one of the summands comes from the D0 register, written by CPU, the other summand comes from UDB parallel input. The example component uses dynamic switching between ALU input of A0/A1 or parallel input PI and uses two states: fetch PI & add.

AN82156 uses the state diagram below:

pastedImage_0.png

The functional Verilog code:

STATE_LOAD:

begin

    state <= STATE_ADD;

    /*we must lacth the PO value here, because in the next state PO is not valid*/

    Parallel_Out <= po;

end

STATE_ADD:

begin

    state <= STATE_LOAD;

end

So, from the above I'd expect the component needs two clock cycles to get the result ready. However, it seems it takes three clock cycles until the result is ready. Can anyone explain why three clock pulses are needed?

From the datapath description, the "Ax WR source" description explicitely states that the Ax register is written _after_ the ALU operation has completed, so it can't be expected that any ALU operation is finished immediately when the clock rising edge occurs. But from the documentation, I'd expect that result of an ALU instruction is ready when the next clock rising edge occurs and therefore, the above should only need two clock cycles. Where am I wrong? Does each ALU operation need two clock cycles until ready, but the next operation is started with the next clock edge (interleaved operation)?

Regards

0 Likes
1 Solution

Hello BharadhwajaS_91,

I've made some additional tests. My inital statement is partially wrong: the component doesn't need three clock cycles in general, but the result is ready with an offset of one clock pulse. I wrote a small test program to verify, the flow is as follows:

1) set parallel input value, starting at 0x00 and incrementing for each run, D0 (Add_Value remains at 0x02 as prepared in the example)

2) force a clock pulse  //transition from state = load to state = add

3) read the output

4) force clock pulse  //transition from state = add to state = load

5) read the output

6) back to #1

The output:

First clock pulse, input   0, result   0        //after state = load

Second clock pulse, input   0, result   0    //after state = add (here's the point where I'd expect result = input + 2)

First clock pulse, input   1, result   2       //next run, input incremented, but result from previous run => offset of one clock cycle

Second clock pulse, input   1, result   2

First clock pulse, input   2, result   3

Second clock pulse, input   2, result   3

First clock pulse, input   3, result   4

...

So, I modified the parallel adder UDB implementation:

1) the dynamic reg 0 configuration is changed to ADD, SRCA = A0, SRCB = D0, A0_WR_SRC = ALU, A1_WR_SOURCE = NONE

2) the dynamic reg 0 configuration is copied into reg 1, they're now both the same

3) State_Add is modified, now also latching of PO

4) added a ready output signal, assigned the inverted value of the state machine

The output:

First clock pulse, input   0, result   0

Second clock pulse, input   0, result   2  //output ready on the second clock pulse as expected

First clock pulse, input   1, result   0

Second clock pulse, input   1, result   3

First clock pulse, input   2, result   0

Second clock pulse, input   2, result   4

First clock pulse, input   3, result   0

...

Note that the result is read with a status register with sticky configuration, clock input is the ready signal. This is for verification that the result changes with the rising edge of the ready signal (that's why the result of the first clock pulse is always zero).

So, it seems that this modification improves the parallel adder functionality. I assume(!) the PI/PO example uses the two different state configurations for simplification, but the clock offset was not taken into account.

The above modification are quick'n'dirty, but this was really helpful in learning UDB. Next steps are to extend the adder to get both summands by parallel input. Not sure if this can also be done in two cycles

Regards

View solution in original post

0 Likes
4 Replies
bharadhwajas_91
Employee
Employee
First like received First like given

Hi,

I'm guessing the reason for 3 clock cycles is that the second state still has the source 1/left hand side input for ALU as A1 and the actual result is stored in A0. The parallel output value is always the left input to the datapath ALU. The left input to the ALU can only be A0 or A1.You can have a pin toggle in State_Load and probe with the clock to check if this is true.

0 Likes

Hello BharadhwajaS_91,

I'm not sure if this explanation is true - on a first look, a two state machine can't have the result ready in three states, right? I'll do the test with pin toggle.

However, it's not clearly described how the ALU works in case of using PI and PO: You wrote that the PO output is the left input of ALU. This is A0, A1 or PI. What is PO when ALU input is PI? Is it also PI or is the value of SRCA?

Regards

0 Likes

The PO output is fed from the input branch from the accumulators of the mux between PI and Accumulator. So it can either be A0 or A1.

You can see this , if you hook up the output of PO to control reg and observe while changing PI value.

                   

A0/A1 -------

                    | --------> PO

                    Mux --->SRCA

                    |

PI -------------

Thanks

0 Likes

Hello BharadhwajaS_91,

I've made some additional tests. My inital statement is partially wrong: the component doesn't need three clock cycles in general, but the result is ready with an offset of one clock pulse. I wrote a small test program to verify, the flow is as follows:

1) set parallel input value, starting at 0x00 and incrementing for each run, D0 (Add_Value remains at 0x02 as prepared in the example)

2) force a clock pulse  //transition from state = load to state = add

3) read the output

4) force clock pulse  //transition from state = add to state = load

5) read the output

6) back to #1

The output:

First clock pulse, input   0, result   0        //after state = load

Second clock pulse, input   0, result   0    //after state = add (here's the point where I'd expect result = input + 2)

First clock pulse, input   1, result   2       //next run, input incremented, but result from previous run => offset of one clock cycle

Second clock pulse, input   1, result   2

First clock pulse, input   2, result   3

Second clock pulse, input   2, result   3

First clock pulse, input   3, result   4

...

So, I modified the parallel adder UDB implementation:

1) the dynamic reg 0 configuration is changed to ADD, SRCA = A0, SRCB = D0, A0_WR_SRC = ALU, A1_WR_SOURCE = NONE

2) the dynamic reg 0 configuration is copied into reg 1, they're now both the same

3) State_Add is modified, now also latching of PO

4) added a ready output signal, assigned the inverted value of the state machine

The output:

First clock pulse, input   0, result   0

Second clock pulse, input   0, result   2  //output ready on the second clock pulse as expected

First clock pulse, input   1, result   0

Second clock pulse, input   1, result   3

First clock pulse, input   2, result   0

Second clock pulse, input   2, result   4

First clock pulse, input   3, result   0

...

Note that the result is read with a status register with sticky configuration, clock input is the ready signal. This is for verification that the result changes with the rising edge of the ready signal (that's why the result of the first clock pulse is always zero).

So, it seems that this modification improves the parallel adder functionality. I assume(!) the PI/PO example uses the two different state configurations for simplification, but the clock offset was not taken into account.

The above modification are quick'n'dirty, but this was really helpful in learning UDB. Next steps are to extend the adder to get both summands by parallel input. Not sure if this can also be done in two cycles

Regards

0 Likes