in preparation of migrating some custom components from Verilog to UDB ALU function block, I used the parallel PI/PO example of AN82156 as a base. This example is a 8-bit adder, where one of the summands comes from the D0 register, written by CPU, the other summand comes from UDB parallel input. The example component uses dynamic switching between ALU input of A0/A1 or parallel input PI and uses two states: fetch PI & add.
AN82156 uses the state diagram below:
The functional Verilog code:
state <= STATE_ADD;
/*we must lacth the PO value here, because in the next state PO is not valid*/
Parallel_Out <= po;
state <= STATE_LOAD;
So, from the above I'd expect the component needs two clock cycles to get the result ready. However, it seems it takes three clock cycles until the result is ready. Can anyone explain why three clock pulses are needed?
From the datapath description, the "Ax WR source" description explicitely states that the Ax register is written _after_ the ALU operation has completed, so it can't be expected that any ALU operation is finished immediately when the clock rising edge occurs. But from the documentation, I'd expect that result of an ALU instruction is ready when the next clock rising edge occurs and therefore, the above should only need two clock cycles. Where am I wrong? Does each ALU operation need two clock cycles until ready, but the next operation is started with the next clock edge (interleaved operation)?