It seems that my measurements are more or less consistent with the Cypress' own application notes:
"Toggle GPIOs Faster with Data Registers", please refer to figures 28 and 29. The API toggler produces a waveform of 2us period, while a loop exactly equivalent to my code produces a period of 530ns. I don't see their clock settings to be mentioned, but my period is 250ns for a 64MHz clock.
So again, what mechanism causes even the faster toggler to be so slow? Please note that I am not asking about the fastest way to toggle a pin, because I can accomplish that easily with a UDB, I just want to fill a gap in my understanding of the circuits and activities behind the software pin access.
Toggling a pin your way involves CPU action only. So the SYCLK speed will have the most influence on execution time, the needed instruction sequence is next.
I would suggest you to set the build mode from "Debug" to "Release" which in turn will cause some code optimizations.
Nonetheless the toglle-testing is more or less meaningless, you do not have any cpu-time left to do something else, the pin will never give a 50% duty cycle (due to the needed branch instruction) . So what does it help you or make any difference when you find a method to toggle a pin faster ?
as already mentioned, the numbers come from a release build with all the optimization options bumped to their highest levels (even including LTO, which should not change anything in this particular test case, BTW). The 64-MHz 32-bit ARM CPU is able to produce exactly 4M pulses per second, which is equal to the abilities of an ATMega168 clocked at 16MHz. In the Atmel's case the performance exactly matches the calculations based on the opcode definitions, in the case of an ARM it is beyond my current abilities of grepping the tons of documentation. The math clearly shows that this loop on a PSOC5LP needs 16 cycles per iteration, which is kind of a lot. I'd like to know where do these cycles go, hence the posted question.
The real problem comes from my research on whether a dedicated 3-wire SPI bus should be handled by a fully software bit-banging, or a UDB-based accelerator (albeit handicrafted, as the SPI blocks from the Cypress library have insane resource footprints, not to mention the UARTs). My fear was that this implementation will be too fast, not too slow and how exactly I should manage the setup/hold times. It turns out that the software implementation will be so unexpectedly slow that no waitstate management will be necessary. This is a dedicated system bus with real-time latency requirements (namely, RTC to external atomic clock reference synchronizer with 15us accuracy), so the CPU would otherwise poll the UDB. Currently the software implementation wins, because it consumes no precious UDB resources and is sufficiently slow for the slave chip to handle.