I'm guessing the reason for 3 clock cycles is that the second state still has the source 1/left hand side input for ALU as A1 and the actual result is stored in A0. The parallel output value is always the left input to the datapath ALU. The left input to the ALU can only be A0 or A1.You can have a pin toggle in State_Load and probe with the clock to check if this is true.
I'm not sure if this explanation is true - on a first look, a two state machine can't have the result ready in three states, right? I'll do the test with pin toggle.
However, it's not clearly described how the ALU works in case of using PI and PO: You wrote that the PO output is the left input of ALU. This is A0, A1 or PI. What is PO when ALU input is PI? Is it also PI or is the value of SRCA?
The PO output is fed from the input branch from the accumulators of the mux between PI and Accumulator. So it can either be A0 or A1.
You can see this , if you hook up the output of PO to control reg and observe while changing PI value.
| --------> PO
I've made some additional tests. My inital statement is partially wrong: the component doesn't need three clock cycles in general, but the result is ready with an offset of one clock pulse. I wrote a small test program to verify, the flow is as follows:
1) set parallel input value, starting at 0x00 and incrementing for each run, D0 (Add_Value remains at 0x02 as prepared in the example)
2) force a clock pulse //transition from state = load to state = add
3) read the output
4) force clock pulse //transition from state = add to state = load
5) read the output
6) back to #1
First clock pulse, input 0, result 0 //after state = load
Second clock pulse, input 0, result 0 //after state = add (here's the point where I'd expect result = input + 2)
First clock pulse, input 1, result 2 //next run, input incremented, but result from previous run => offset of one clock cycle
Second clock pulse, input 1, result 2
First clock pulse, input 2, result 3
Second clock pulse, input 2, result 3
First clock pulse, input 3, result 4
So, I modified the parallel adder UDB implementation:
1) the dynamic reg 0 configuration is changed to ADD, SRCA = A0, SRCB = D0, A0_WR_SRC = ALU, A1_WR_SOURCE = NONE
2) the dynamic reg 0 configuration is copied into reg 1, they're now both the same
3) State_Add is modified, now also latching of PO
4) added a ready output signal, assigned the inverted value of the state machine
First clock pulse, input 0, result 0
Second clock pulse, input 0, result 2 //output ready on the second clock pulse as expected
First clock pulse, input 1, result 0
Second clock pulse, input 1, result 3
First clock pulse, input 2, result 0
Second clock pulse, input 2, result 4
First clock pulse, input 3, result 0
Note that the result is read with a status register with sticky configuration, clock input is the ready signal. This is for verification that the result changes with the rising edge of the ready signal (that's why the result of the first clock pulse is always zero).
So, it seems that this modification improves the parallel adder functionality. I assume(!) the PI/PO example uses the two different state configurations for simplification, but the clock offset was not taken into account.
The above modification are quick'n'dirty, but this was really helpful in learning UDB. Next steps are to extend the adder to get both summands by parallel input. Not sure if this can also be done in two cycles