UDB parallel in/out: confusion about internal timing/states

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
RaAl_264636
Level 6
Level 6
50 sign-ins 25 sign-ins 10 solutions authored

Hi,

in preparation of migrating some custom components from Verilog to UDB ALU function block, I used the parallel PI/PO example of AN82156 as a base. This example is a 8-bit adder, where one of the summands comes from the D0 register, written by CPU, the other summand comes from UDB parallel input. The example component uses dynamic switching between ALU input of A0/A1 or parallel input PI and uses two states: fetch PI & add.

AN82156 uses the state diagram below:

pastedImage_0.png

The functional Verilog code:

STATE_LOAD:

begin

    state <= STATE_ADD;

    /*we must lacth the PO value here, because in the next state PO is not valid*/

    Parallel_Out <= po;

end

STATE_ADD:

begin

    state <= STATE_LOAD;

end

So, from the above I'd expect the component needs two clock cycles to get the result ready. However, it seems it takes three clock cycles until the result is ready. Can anyone explain why three clock pulses are needed?

From the datapath description, the "Ax WR source" description explicitely states that the Ax register is written _after_ the ALU operation has completed, so it can't be expected that any ALU operation is finished immediately when the clock rising edge occurs. But from the documentation, I'd expect that result of an ALU instruction is ready when the next clock rising edge occurs and therefore, the above should only need two clock cycles. Where am I wrong? Does each ALU operation need two clock cycles until ready, but the next operation is started with the next clock edge (interleaved operation)?

Regards

0 Likes
1 Solution

Hello BharadhwajaS_91,

I've made some additional tests. My inital statement is partially wrong: the component doesn't need three clock cycles in general, but the result is ready with an offset of one clock pulse. I wrote a small test program to verify, the flow is as follows:

1) set parallel input value, starting at 0x00 and incrementing for each run, D0 (Add_Value remains at 0x02 as prepared in the example)

2) force a clock pulse  //transition from state = load to state = add

3) read the output

4) force clock pulse  //transition from state = add to state = load

5) read the output

6) back to #1

The output:

First clock pulse, input   0, result   0        //after state = load

Second clock pulse, input   0, result   0    //after state = add (here's the point where I'd expect result = input + 2)

First clock pulse, input   1, result   2       //next run, input incremented, but result from previous run => offset of one clock cycle

Second clock pulse, input   1, result   2

First clock pulse, input   2, result   3

Second clock pulse, input   2, result   3

First clock pulse, input   3, result   4

...

So, I modified the parallel adder UDB implementation:

1) the dynamic reg 0 configuration is changed to ADD, SRCA = A0, SRCB = D0, A0_WR_SRC = ALU, A1_WR_SOURCE = NONE

2) the dynamic reg 0 configuration is copied into reg 1, they're now both the same

3) State_Add is modified, now also latching of PO

4) added a ready output signal, assigned the inverted value of the state machine

The output:

First clock pulse, input   0, result   0

Second clock pulse, input   0, result   2  //output ready on the second clock pulse as expected

First clock pulse, input   1, result   0

Second clock pulse, input   1, result   3

First clock pulse, input   2, result   0

Second clock pulse, input   2, result   4

First clock pulse, input   3, result   0

...

Note that the result is read with a status register with sticky configuration, clock input is the ready signal. This is for verification that the result changes with the rising edge of the ready signal (that's why the result of the first clock pulse is always zero).

So, it seems that this modification improves the parallel adder functionality. I assume(!) the PI/PO example uses the two different state configurations for simplification, but the clock offset was not taken into account.

The above modification are quick'n'dirty, but this was really helpful in learning UDB. Next steps are to extend the adder to get both summands by parallel input. Not sure if this can also be done in two cycles

Regards

View solution in original post

0 Likes
4 Replies