Bitbanging Cycle Count

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
Skoe
Level 2
Level 2
10 sign-ins 5 replies posted 5 sign-ins

 

For a project that needs "intelligent bitbanging" the PSoC for seems to be perfect. I'm evaluating following chain:

Pin => combinatorial logic 1 => status register => CPU processing => control register => combinatorial logic 2 => Pin

I connected the two pins externally and put an inverter as a minimal combinatorial logic 1 into the schematic. The registers/pins are set to transparent for minimal latency (I'll add synchronisation as appropriate later, that's just an experiment). Combinatorial logic 2 does not yet exist, it's a direct connection for now.

The code contains an unrolled loop like that (compiled from C):

ldrb r1, [r2]
uxtb r1, r1
strb r1, [r3]
ldrb r1, [r2]
uxtb r1, r1
strb r1, [r3]

or handwritten assembly:

ldrb r0, [r5]
strb r0, [r6]
ldrb r0, [r5]
strb r0, [r6]

This roundtrip takes approximately 11 cycles (230 ns) at 48 MHz, maybe 10 cycles plus propagation delay.

Is this correct? Where are the cycles needed?

0 Likes
1 Solution

Hi Skoe.

Curiosity got the better of me.  I did a few more tests.
I moved the code to SRAM.  Rats!!! Exactly the same result.
I tried running at 24MHz=HFCLK=SYSCLK, same result.
Tried 48MHz=HFCLK, 24MHz=SYSCLK, same result.
Tried Control Reg mode set to Clock (instead of transparent), same result.

My conclusion: since M0 processor (CY8C4245) only has 1 bus (unlike 5LP), it's bus-bound when writing to hardware.  There doesn't appear to be any Cypress documentation explaining the clock cycles being consumed.

I found it interesting to investigate this.
Good luck with your project.

View solution in original post

6 Replies
Skoe
Level 2
Level 2
10 sign-ins 5 replies posted 5 sign-ins

One addition: Writing 1/0/1/0 to the Control register using a line of strb instructions results in a pin change every ~ 105 ns (5 cycles). Are there wait states involved? I thought everything is clocked with HCLK. This could explain a delay by a line of flip flops, but no wait states.

0 Likes

Hello Skoe.

A quick Cypress search brings up MANY discussions on this topic with explanations.
Here's a video explaining the 5 clock cycle for max s/w based GPIO toggle speed.
C++ vs Assembly vs Verilog. on Vimeo

 

0 Likes
lock attach
Attachments are accessible only for community members.
Skoe
Level 2
Level 2
10 sign-ins 5 replies posted 5 sign-ins

Thank you for your response. Unfortunately there are lots of post with similar questions and answers, but either not the same issue or I didn't find them.

The video you linked explains 2 cycles for a store (see screenshot from 07:22), resulting in 2 * 2 for two stores + 1 for the branch => 5. What I am talking about is 5 for _one_ store,  without branch.

I guess the wait state (stall) shown in the video is related to the way the periphery is connected in the PSoC. Just FYI, there are µCs that have the GPIOs connected to the AHB bus of the Cortex-M3, there you can toggle a GPIO with 0 ws. But they don't work in my application because I want to implement at least a part of the logic in HDL, but I must have the CPU in the  path for data processing, that's why PSoC or an FPGA would work.

To come back to the question: Okay, there's one additional cycle for a write to the periphery (GPIO in the video). But why do I get 5? It feels like the periphery is clocked slower than the CPU and thus the UDB Control Register causes additional wait states?

If I'd understand this, the answer to the question about 10/11 cycles "roundtrip" will be easier to answer, I guess.

0 Likes

Hi Skoe.

I ran a couple of tests on CY8C4245 running 48MHz toggling 10101010 out a port pin along with HFCLK/2 on another port pin for reference.

ldr r3, .L3   address of Control_Reg
mov r1, #1
mov r2, #0
strb r1, [r3]
strb r2, [r3]
strb r1, [r3]
strb r2, [r3]
strb r1, [r3]
strb r2, [r3]
strb r1, [r3]
strb r2, [r3]
Each instruction took 5 HFCLK's.  Same as your results.

Using System port pin Write API's,
ldr r1, [r3]
orr r1, r2
str r1, [r3]
ldr r1, [r3]
bic r1, r2
str r1, [r3]
repeat 3 more times...
Each "1" took 9 HFCLK's.  Each "0" took 9 HFCLK's.

The faster GPIO toggle seen in video, is using PSoC 5LP (with a small cache).  We know PSoC4 has a small pipeline, so this could be a factor in performance differences.

Not knowing how the Control Reg is implemented, there's at least 1 clock edge to latch the data into Reg.  Maybe an additional clock cycle is used to clock Control Reg output into GPIO output latch (even if set to transparent)?  Architecture TRM doesn't have much on clocking Control Reg (that I could find).

I was really surprised to see 9 cycles for set (or clear) the port pin.

So, still a mystery where these extra clock cycles are coming from.
Sorry I couldn't help out more.

0 Likes

Thank you for looking into it. So you share my surprise 🙂

> Maybe an additional clock cycle is used to clock Control Reg output into GPIO output latch (even if set to transparent)?  Architecture TRM doesn't have much on clocking Control Reg (that I could find).

That's what I thought too, but I don't see a reason why the traveling from control register to port register (if any) should stall the CPU. I'd not be surprised if it would delay the output, but it even stretches the "toggle-cycles".

Not that it matters, but the CM0 in the PSoC 4200M has a three stage pipeline too, like the CM3, but no branch prediction. The CM0+ has 2 stages, which reduces the impact of branches.

So, still a mystery where these extra clock cycles are coming from.
Sorry I couldn't help out more.


Thank you nevertheless. I bet a PSoC engineer could explain it 🙂 I thought that I made a mistake but obviously you came to similar results, this is also helpful to know.

0 Likes

Hi Skoe.

Curiosity got the better of me.  I did a few more tests.
I moved the code to SRAM.  Rats!!! Exactly the same result.
I tried running at 24MHz=HFCLK=SYSCLK, same result.
Tried 48MHz=HFCLK, 24MHz=SYSCLK, same result.
Tried Control Reg mode set to Clock (instead of transparent), same result.

My conclusion: since M0 processor (CY8C4245) only has 1 bus (unlike 5LP), it's bus-bound when writing to hardware.  There doesn't appear to be any Cypress documentation explaining the clock cycles being consumed.

I found it interesting to investigate this.
Good luck with your project.