PSoC 1 Software PWM Performance and Advice Wanted

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
Pauven
Level 2
Level 2
10 sign-ins 10 replies posted 5 replies posted

Background info for my question: 

I am an amateur/hobby circuit designer (self taught), so apologies in advance.   Also sorry for the long post.  I have many skills, but brevity is not one of them.

I've designed a pinball lighting control circuit board.  When I originally spec'ed out my requirements, and searched for components, the PSoC 1 rose to the top of my list.  Many may think the PSoC 1 is not the best choice for this application (and they're probably right), but at this point I'm too heavily invested in time and money to do anything different.  

I'm using the PSoC 1 with 56 GPIO, and configuring all 56 as LED outputs.  I will say that my main goal is 100% achieved, I have a working solution that meets my original objectives of directly controlling all 56 outputs without matrixing, and minimal latency, and decent PWM for LED's (good enough for pinball, anyway, though a little less flicker would be a nice-to-have).  Not only that, but my self-designed board is significantly cheaper than the commercial options that are available, so win-win even if my PSoC 1 choice was ill-conceived.

These same LED outputs are also optionally used to control a separate Power Driver board, allowing me to turn on/off pinball solenoids just like turning on/off an LED, except I don't use PWM for those signals, just on/off.

Keep in mind that for my following questions, each of the 56 outputs needs to be individually controllable for on/off and PWM duty cycle.

It was only after I designed my pinball lighting solution that someone asked if I could use it to control servos, and at first I thought sure, absolutley!  But upon further inspection I don't think the PWM performance is high enough for accurate servo control, bringing me here.

 

Question 1:  Hardware PWM vs. Software PWM?

I know the PSoC 1 can be configured with hardware PWM by using a digital block.   But even with only 8-bit PWM, I have to dedicate a full digital block to PWM calculations.  And best I can tell, I can only drive 1 GPIO individually with hardware PWM.  I did see some options for using one PWM signal to drive multiple pins, but it seems like this needs to use shared PWM duty cycles, so this wouldn't allow me to individually control multiple GPIO using a single PWM digital block.

Am I correct?  So a software PWM solution is the only option if I need 56 GPIO individually valued?

My chosen PSoC only has 4 digital blocks, so I quickly settled on software PWM, but I'm wondering if I misunderstand how best to leverage the PSoC 1's capabilities.

I would love it if there was a way to use a digital block to improve PWM performance for all 56 GPIO simultaneously, but while maintaining individual addressability.

 

Question 2:  Software PWM Performance?

After significant code refinement, I have gotten my software PWM performance up to a staggering 8Hz (laugh), using 64 PWM levels (6-bit), while setting all 56 pins on each pass.   

To provide a bit more detail: my software does 64 loops through the on/off settings for each PWM Period, and is able to complete 8 Period loops in 1 second (8*64 = 512 total loops per second).  And it is doing this for all 56 GPIO, so technically it's processing the on/off state at a rate of around 28,672/GPIO Pins per second.

8Hz is pretty low.  Like I wrote above, this is good enough for basic pinball lighting, and triggering solenoids on/off, but a bit slow for more advanced RGB lighting, and way too low for servos which commonly need around 50Hz.  Also, servos need very fine pulse duration control often measured in fractions of a millisecond.  My smallest duration right now is around 2ms, far too long to control servos.  Even if my basic loop was magically running 6x faster at 50Hz, I don't think the duration control is fine enough for servos, and would require a hardware PWM implementation for adequate control.

So does the performance I'm achieving sound reasonable for a PSoC 1?  Somehow I thought a 24MHz processor would be a little faster.  It seems my code is taking 837 cycles per pin to process my code, which seems high to me. 

The PSoC 1 Clocks and Global Resources documentation states that M8C assembly language instructions take between 4 and 15 cycles of CPUCLK to execute.  I'm using C, not assembly, so I'm not sure if C has a performance overhead.  For each GPIO, I use each GPIO's target PWM level (0-64) to query a 65x65 constant array of predetermined 6-bit PWM values, then use GetState to compare if the new On/Off state is different, and if it is different then I issue the LED_1_On or Off Pragma command.  That's it, the code is super simple, so I didn't think it would take so many CPU clocks to look up a boolean value in a 2-dimensional array and see if it is different than the current On/Off state.

It also seems that if I scaled back my grand ambitions from 56 GPIO to just 1, at best this would only be 56x faster, essentially a 448 Hz software PWM solution for 1 GPIO. 

Since I need 56 PWM LED outputs, I've never tested the digital block to see how fast that PWM solution performs, though I'd wager it's quite a bit faster.  Still, 448 Hz for a single 6-bit software PWM coded GPIO seems pitiful.

 

Question 3:  Best Practices for Accessing GPIO - Registers vs. Pragmas?

In my early test code, I was directly accessing the GPIO using the registers, like PRT4DR, thinking that would be fastest.  But I've since changed to using Pragmas, like LED_01_Start, LED_01_On or LED_01_Off.  From my tests, it seems performance is the same, and the Pragmas make for easier coding.  Am I wrong?

Similarly, I was originally setting the GPIO ON or OFF on every pass, even if there was no change from the previous pass.  I refined my code to only set the state if it had changed, and this seemed to boost performance.  I started with an array of 56 booleans to track the On/Off states manually, but then started using the LED_01_GetState instead, and this Pragma seemed to have no adverse effect on performance.  The only downside I discovered to using GetState is that it only reported correctly if I used the Pragmas for On and Off, otherwise it returned the wrong value if I set On/Off directly via registers. 

Am I right that Pragmas have no performance impact, and possibly even memory benefits since I don't need to track and array of on/off states in my code?

 

Question 4:  Can I Set All GPIO with 1 Command?

In my software PWM solution, I am stepping through each GPIO, one-by-one, to set their On/Off state (but only if it has changed).  This seems really inefficient.  In a worst case scenario of a 50% duty cycle, I'm flipping each GPIO's state with every pass, one-by-one.

It seems it would be more efficient if I could pass a single command with 7-byte value that represents the on/off state of all 56 GPIO. 

I couldn't find any commands like this.  Perhaps I'm overestimating the potential impact, as in my testing it seems that my code sets 1 GPIO at the same rate as all 56 GPIO, so perhaps toggling the GPIO state doesn't have much of a performance penalty.  I guess this makes sense, as I would expect the PSoC to be highly optimized for setting the GPIO states.  But if I could eliminate a loop through all 56 GPIO one-by-one, perhaps PWM frequency would improve due to code efficiency.

 

Question 5:  Best Global Resource Settings?

Performance is my main concern for this device.  Low latency is #1, followed by higher PWM frequency.  For that reason, I've pretty much maxed out the Global Resource settings (or, at least I think I have maxed them out).  But I understand that some of these settings consume more power without any benefit to my device, so minimizing wasted power drain seems wise if possible.

Below is what I've configured, and my reasoning.  Does this look right?

Power Setting is set to 5.0v / 24MHz - I believe this is the fastest option.

CPU_Clock is SysClk/1 - I figure this is what is controlling my code speed, and that this is the fastest clock.

Sleep_Time is 1_Hz - I'm never "sleeping" in my code, it simply runs my simple loop forever as fast as possible, no delays, so I'm not sure if this has any impact at all for my purposes.  Though perhaps I'm confused how this affects performance and it is slowing down my code execution by waking up the CPU every 1Hz!!!

VC1 is set to 16 - I don't see any performance difference vs. setting VC1 to 1

VC2 is set to 16 - I don't see any performance difference vs. setting VC2 to 1

VC3 is set to VC2/16 - I don't see any performance difference vs. setting to 1.  The documentation indicates I can set this lower than VC1/2, going as low as 256.  Should I change this to 256 for lower power drain?  Diminishing returns?

SycClk Source is set to Internal - I'm using the built in IMO, not an external crystal solution

SysClk*2 Disable is set to No - I don't think the SYSCLK Doubler affects software PWM, as I don't think software code is able to take advantage of the special 48MHz clock timings that are available to the digital blocks, so I thought I should set this to Yes for more power savings. But every time I set it to Yes, the PSoC 1 failed to connect properly through USB.  Perhaps that indicates a hardware issue, but leaving it set to No works just fine, even though this seems like a higher performance setting.

Analog Power is set to All Off - I originally was the default of SC On/Ref Low. All 56 GPIO are using the LED module configured to Active High.  Since I'm using digital IO, my assessment is that these Analog Power settings don't apply to me, and it seems to behave the same with the All Off setting, which should have the least power drain.

Ref Mux is set to (Vdd/2)+/-BandGap - This too seems to be related to Analog Power for use in Analog Blocks, so I'm thinking this value has no impact at all for my design.

AGndBypass is set to Disable - This seems to be related to Analog Power's Ground, for use in Analog Blocks, so I think this has no impact on my design.

Op_Amp Bias is set to Low - Another analog setting that doesn't apply to my design?

A_Buff_Power is set to Low - And another analog setting that doesn't affect my design, right?

Trip Voltage [LVD] is set to 4.81V - The fact that my tests are operating without restart hiccups suggests that my USB 5V power delivery is working correctly, otherwise I would expect even minor voltage sag to cause restarts with this being the most aggressive setting.  Side note, this is the 2nd revision of my design.  My 1st design used an external power source shared with the LED's, and lighting more than a handful at once caused enough voltage sag to trigger a PSoC reset.  I'm now using USB power for the PSoC, and separate power planes for my LED's, and I'm ecstatic that it's working well.

LVDThrottleBack is set to Disable - Combined with my take on the LVD behavior, I think this suggests that I am successfully running at 24MHz, and not scaling back to 12MHz due to low voltage.

Watchdog Enable is set to Disable - My understanding of the Watchdog is that it can check if the PSoC is hung, and restart it if it is non-responsive.  Other than low software PWM performance, my board seems to be working flawlessly, even without the Watchdog enabled.  I'm thinking that I might need to set this to Enabled for a production version of my design, to make it more reliable for end-users, but I'm not sure if that's the right way to think about this feature.  But since I have a software loop that runs non-stop forever without sleep, I'm not sure that this even gives the Watchdog an opportunity to be effective.

 

Question 6 - What am I overlooking?

Sorry to be so greedy for input, but I thought it wise to ask the most general of questions - what am I completely overlooking?

All I need is for the PSoC 1 to individually address all 56 GPIO with a PWM signal, at the highest possible frequency, and with the most accurate duty cycle period duration.

Should I be using timers?  Interrupts? Assembly instead of C?

I have a very refined loop that simply updates all GPIO as fast as possible with the current On/Off state based upon predefined PWM values. The code occasionally checks if the USB buffer has new PWM data to process (this is a tiny 56 byte record), a query that I've varied from once per PWM Period to 64 times per period with no discernable impact.  I'm currently checking the USB buffers for data once every 2 PWM periods, approximately 256 checks/sec, or every 4ms.  This is to keep latency low for my pinball control software.

Because it is simply running a loop as fast as possible, I didn't see a need for timers or interrupts, even though these would be typical in a PWM solution.  I do understand that my timer-less approach leads to some performance variability, though it seems consistent enough as the processing load is essentially the same with every loop.

 

You made it to the end!

If you actually read everything above, you're awesome! Thank you!  Hopefully you have some good advice to share back my way...  

If I need to post code, let me know.

-Paul

0 Likes
1 Solution
Pauven
Level 2
Level 2
10 sign-ins 10 replies posted 5 replies posted

USB Data Transfer Bugs Fixed I worked around my USB data transfer issues, though I feel my solution is a bit of a hack and maybe not the preferred methodology. 

The problem was that I needed to receive and reassemble 4 packets of 64 bytes into a 256 byte 2-dimensional array.  I got really close by looping through bReadOutEP 4 times and loading the data into the array using offsets, but I was still getting data corruption, most likely caused by packets arriving out of order.  The solution to this is probably using control bytes in the data to assist in re-assembly, but this wasn't a direction I was too eager to try.  

Besides, I scoured the internet and this forum for hours looking for any real-world examples of this technique, and could find none.

My hack was that I realized that my PSoC 1 has 4 Endpoints, so I split my data array into 4 smaller 64 byte arrays, and send each on a separate Endpoint.  Luckily I don't need any more Endpoints for anything else, as this solution is working perfectly.

 

Final Results

With the data transmission bugs resolved, for the first time, I am seeing all 31 brightness levels on my LEDs!  And my scope is showing perfectly formed and spaced duty cycles.  Perfection!!!

And was I able to achieve that hoped for 210Hz refresh rate?  Not exactly... I blew it away!  For PWM Level 1 the scope sees a very constant 780 Hz!!!  This climbs in 780Hz per level all the way to 11,700 Hz at PWM Levels 15 & 16, where it gradually decreases back to 780Hz at PWM Level 30, before going to solid ON (no Hz) at Level 31!!!

I don't know why these frequencies are even faster than my hardcoded tests, but I'm not complaining one bit.

That's still not good enough for servos, so I'm still using 4 digital blocks for hardware PWM, so 4 of the 56 LED's are running true PWM at 48MHz for servo control.  

What's most amazing to me is that the 52 software PWM LED's now look identical to the hardware PWM's at all 6-bit brightness levels, and everything is perfectly flicker free!

Can you tell I'm quite excited?!?!  I had hoped to achieve 60Hz minimums, and somehow I've achieved 13x better frequencies!  And the latency from the host PC is the best I've observed yet, easily under 40ms and probably closer to 10ms.

Hopefully this novel of data I've posted here will help some other users in the future.

 

Executive Summary

By pre-calculating on the host PC the On/Off states for all GPIO to simulate 5-bit PWM (32 brightness levels), and sending the new array of On/Off states over USB every time there is a change, you can get software PWM for LEDs that rivals hardware PWM on a PSoC 1, with minimum frequencies of 780Hz (if your PSoC isn't busy doing other things).

EDIT:  To be clear, this approach is technically Pulse-Frequency Modulation (PFM), as the pulse width is fixed based upon the loop speed, and the frequency is adjusted by controlling the spacing to the next pulse.

This does require a constant connection to a host PC, so this won't be feasible for all applications.  And 5-bit PFM is the limit for controlling all 56 GPIO of the top-end PSoC 1,  as the state array for all 56 GPIO is 256 bytes, and there is not enough RAM to hold larger arrays for 8, 7 or even 6-bit PFM.

If you don't need 56 PWM outputs, scaling back the solution actually improves performance and PWM bit-depths.  For example, for only 8 software PWM outputs (one full Port), you should see minimum frequencies 7x faster (5,480Hz), and you can use full 8-bit PFM as that array size is also 256 bytes.

If you don't have a host PC to calculate the On/Off states, then the PSoC 1 can at best achieve about 52Hz for 52 GPIO if it has to do all calculations itself, which increases to about 350Hz if you only need software PWM on 1 Port of 8 GPIOs.  At 350Hz, you will get accurate LED dimming, nearly identical visually to hardware PWM.

Note that software PWM is not accurate enough for servo control, which requires precise frequencies and duty cycles that this approach can't replicate.  

View solution in original post

0 Likes
14 Replies
Pauven
Level 2
Level 2
10 sign-ins 10 replies posted 5 replies posted

While I'm still hopeful for feedback on my questions above, I've made significant progress and thought I would share my results.

With further testing, I found that querying the GPIO status and only updating the status if it needed to change was actually slowing performance, contrary to my earlier results.  By skipping this query and simply updating every GPIO on every pass, I got about 5% more performance.

I tried 5-bit PWM, using only 32 values, but of course I got no speed improvement as the time to update a GPIO state remained constant regardless of the brightness fidelity.  So I'm sticking with 6-bit.  Because I'm using a lookup table for predefined PWM values, 6-bit is as large as I can go.  The table for 7-bit is 4x larger and won't fit in ROM.  I had hoped to see if treating constants as RAM instead of ROM would bring a speed boost, but even the 5-bit array data was too big for RAM. 

I could potentially use bits instead of bytes to represent the PWM LUT, and that should be small enough to fit in RAM, but this may also introduce a performance hit to check the individual bit values instead of full bytes.  I may play around with this, but I'm not optimistic.

My big breakthrough is that by updating all GPIO in a Port at the same time, I got double the speed!  Apparently looping through each GPIO one-by-one, issuing separate register updates, has a performance impact.  So now I calculate all the GPIO values and update the entire Port at once.  Here's a sample of how I'm doing this:

PRT0DR = 0x01*PWMStates[LEDReport.PWMLevel[44]][PWMCycle] + 0x02*PWMStates[LEDReport.PWMLevel[51]][PWMCycle] + 0x04*PWMStates[LEDReport.PWMLevel[45]][PWMCycle] + 0x08*PWMStates[LEDReport.PWMLevel[50]][PWMCycle] +
0x10*PWMStates[LEDReport.PWMLevel[46]][PWMCycle] + 0x20*PWMStates[LEDReport.PWMLevel[49]][PWMCycle] + 0x40*PWMStates[LEDReport.PWMLevel[47]][PWMCycle] + 0x80*PWMStates[LEDReport.PWMLevel[48]][PWMCycle];


The cool thing about the above code is that it's both setting or clearing each GPIO at the same time.  I do the above 7 times, once for each of the 7 Ports.

This tweak brought my worst case PWM Frequency up to 16Hz.

Next I went ahead and added 4 PWM digital blocks, linked them to 4 LED's, and skipped calculating those 4 GPIO in my code.  That reduced the # of software PWM GPIO from 56 to 52, and my worst case Frequency rose to 18Hz (scoped at a very steady 17.82Hz).

Since my pinball controller actually has 2 of these PSoC's, that will give me support for up to 8 servos using those hardware driven PWM outputs.  The other 104 software PWM outputs won't work for servos, but they will still be great for LED's and perfect for triggering solenoids.

Now, you may have picked up on my use of the term "Worst Case PWM Frequency" and wondered what that meant.  Typical hardware PWM has a variable duration duty cycle, while the output frequency stays constant.  My software PWM solution has a constant duration ON cycle (approximately 0.89ms), for which I can turn them ON/OFF up to 1123Hz. 

So worst case is a PWM Level of 1/63, at 18Hz - tons of flicker.  But at 2/63 it doubles to 36Hz, and at 3/63 the flicker is almost gone at 53 Hz.  Beginning with PWM Level 4/63, the 71Hz equivalent frequency is flicker free to my eyes, and it keeps improving from there.

Due to the relatively long ON time of 0.89ms, the apparent brightness at low levels does not match true hardware PWM.  Even though PWM Level 6/63 is completely flicker free at 107Hz, visually it is significantly brighter than PWM at the same level.  So dim levels are a weakness of this approach.  But by PWM Level 8/63, or about 1/8th PWM duty cycle, my software PWM solution is matching the apparent brightness of hardware PWM.

The equivalent PWM frequency peaks at 50% duty cycle, reached at 6-bit PWM Levels 31 and 32,  where the ON/OFF state flips on each pass.  This equates to a rather decent 552 Hz.  Beyond that level, equivalent frequency actually drops again, heading back towards 18Hz at 62/63, but to the human eye you can't tell and it looks equivalent to the hardware PWM.

I also determined that checking the USB data buffer for data has no performance impact, so now I am checking between every GPIO Port update to ensure maximum responsiveness to the connected PC.  That should be equivalent to 1123Hz, for that same 0.89ms latency.  Of course, there's probably a timing mis-alignment between the USB polling speed my program loop, so I imagine latency is somewhat variable and a bit higher.  At some point in the future I plan to measure round-trip latency, but for now I'll just say it looks instantaneous.

The best news is that this methodology can go much, much faster if you only need a few PWM GPIO.

While I was developing my new code, at one point I was only updating a single Port, instead of all 7 on my PSoC.  This actually resulted in equivalent PWM frequencies that were 7x faster, but just for 8 GPIO.

So worst case would be a 125 Hz equivalent PWM frequency at 1/63.  When I tested this I couldn't believe my eyes, as this looked really great, nicely dim and flicker-free.  This climbs to over 3800Hz at 50% duty cycle.  This still wouldn't work for servos, but if you have a project that needs a lot of PWM output for LED's, this will get you 8 more than what your digital blocks can deliver and will still look great.

-Paul

0 Likes

Paul,

I made through 1/3 of your original post. Can you summarize the main issue that need resolution? My guess that you need ~50 PWM outputs to drive servos, is that correct?  

I recommend to check this solution using PSoC4, which utilizes a single hardware PWM to drive 4 independent servos.

Cypress PSoC4M 4x10-bit PWM outputs using single TCPWM block 

odissey1_0-1627789836870.png

 

You can find the project files and discussion here:

Using a single PWM block to drive multiple pins using DEMUX

 

The idea behind this approach is that most servos need maximum 2ms pulse each 20ms (50Hz), so, a single PWM can potentially drive up to 10 independent servos if output can be multiplexed, while PWM duty cycle is synchronously updated. Using this approach one can potentially control 48-60 servos using 6 hardware PWMs.

 

I will second others opinion that it may be safer to switch to PSoC4 now. I believe that to port the code is not that hard and you will get a wider support.

/odissey1 

 

 

0 Likes

Hi odissey1.  Thanks for trying to make it through my novel.  You missed a few details in skimming.

 

Executive Summary:

I am looking for best possible software PWM performance for PSoC 1.

I am hardware heavy in PSoC 1 ($$$$ in PCB's and PSoC's on hand).  While PSoC 1 might not be the best choice, my focus here is on making the best of what I have, so PSoC 1 guidance only.

I do not expect software PWM to handle servos.  Even in the best case, it is far too slow and does not replicate the precise duty cycles and frequency required to control servos.  But hardware PWM can handle a few servos so that's perfect.  I still want better software PWM for LED lighting effects, especially RGB color blending.

 

Project Goals:

I have designed and fabricated a pinball controller using multiple PSoC 1's.  To control a pinball machine, you need:

  • 16-24 Solenoid triggers (very basic & fast LED on/off signals to control a Power Driver)
  • 70+ LED outputs (On/Off for core functionality, PWM for advanced effects or RGB)
  • 0-8 Servo outputs (to move ramps, doors, toys, etc. - most pinball machines have few)

At this point, I believe I have mostly achieved my goals.  With 2x PSoC 1's dedicated to GPIO outputs, I can have 8 hardware PWM GPIO to handle servos, 24 GPIO allocated to solenoids, and still have 80 GPIO left over for LED's. 

The only real limitation at this time is that the software PWM frequency is still a bit slow to replicate dim lighting.  For PWM levels 7-63, my software PWM matches the effect of hardware PWM.  But for PWM levels 3-6, my software PWM remains flicker free but is too bright compared to hardware PWM.  And for PWM levels 1-2 there is too much flicker to be useful. 

But these limitations are because I am driving 52 software PWM simultaneously.  If you only need 8 outputs, then it works 7x faster and looks as good as hardware PWM.  Obviously my design is an unusual way to utilize a PSoC 1, and I didn't realize how challenging it would be to achieve the desired results.

 

Replying to your PSoC 4 Answer


@odissey1 wrote:

I will second others opinion that it may be safer to switch to PSoC4 now. I believe that to port the code is not that hard and you will get a wider support.

/odissey1 

 

 Am I missing something?  I don't see any other responses to second, so I wasn't aware of any other opinions on this. Your answer was completely unexpected, and you certainly had me doing some research on pin compatibility (it's not).

PSoC 4 is simply not an option.  I've already committed to a run of PSoC 1 based boards, and the PSoC 4 is not a drop in replacement, so I would have to throw my entire investment away.  That's not going to happen.

 

Is Improving PSoC 1 Software PWM Even Possible?

Regarding improving PSoC 1 software PWM performance, this is definitely possible.  When I first posted, I was achieving a 6-bit variable PWM frequency range of 7.5 - 233 Hz, and through code tweaks I have now increased this to 23.5  - 730 Hz when driving 52 software PWM simultaneously, so a more than 200% improvement.  I'll post a separate reply on how I reached this new benchmark.

For a user that only needs a few software PWM, you could drive a full Port's worth (8 GPIO) at 153 - 4700+ Hz.  That could free up hardware blocks for other purposes.  Of course, my code is only doing software PWM, and is not burdened by other tasks, so users doing other tasks in code would likely do worse.

Perhaps more is still possible.  While I'm really happy with what I have achieved, some of my code still looks computationally expensive, so I'm hopeful that smarter PSoC 1 programmers can offer suggestions.

 

Thanks again for taking the time to reply.

-Paul

0 Likes
Pauven
Level 2
Level 2
10 sign-ins 10 replies posted 5 replies posted

With some more code refinement, I was able to boost my effective worst case PWM frequency (at PWM Level 1/63) from 17.82Hz to 23.52Hz!  And the best case (at PWM Level 31/63) climbed from around 550 Hz to over 750 Hz.

Flicker free software PWM on 52 GPIO concurrently is now possible with 6-bit PWM levels as low as 3/63 (equivalent duty cycle of around 5%), with an equivalent frequency of around 70Hz. PWM level 2/63 still has noticeable flicker at 47Hz.  Brightness at low duty cycles is still an issue, with anything up to 6/63 (about 10% duty cycle) still abnormally bright compared to true hardware PWM.  7/63 is very similar to hardware PWM, and by 8/63 it appears to visually match the brightness of hardware PWM.

The biggest improvement was from changing a 2-dimensional const array, which holds the predefined PWM On/Off States,  from using full bytes for each value to using just a single bit for each value.  I really had low hopes for this change, as it looks computationally more expensive to get the value of each bit versus just grabbing the whole byte, but this was responsible for 95% of the improvement.  It's possible that this is really a side effect of freeing up memory, but I have no idea.

 

Code Tweaks

Before I had a constant array of On/Off PWM states for 64 PWM levels using bytes like this:

 

const bool PWMStates[64][63] = {
{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
... skipping to PWM Level 31 ...
{0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1},
... skipping to PWM Level 63 ...
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} };

 

 

Even though I had configured that array to use Booleans, in C those are full bytes not bits, a wasteful excess to prevent potential software bugs.  So that 64x63 array is consuming 4032 Bytes.

Now I have the same data stored as bits, using just 8 bytes per PWM level:

 

const BYTE PWMStates[64][8] = {
{0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00},
... skipping to PWM Level 31 ...
{0x54, 0x55, 0x55, 0x55, 0xAA, 0xAA, 0xAA, 0xAA},
... skipping to PWM Level 63 ...
{0xFE, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF} };

 

 

And magically that change gave me the boost from 17.82 Hz to 23.25 Hz.  This might be from freeing up memory, as the new constant array only takes up 512 Bytes.  I can't see any other reason this would be faster.

I was able to get a further boost to 23.52 Hz by turning on math optimizations in the project settings - previously this drastically hurt performance when using the larger Boolean array - I'm not sure why this helps now with the smaller Byte array, but it's free performance so I'll take it.

 

Rest of the Code

My main procedure just does a simple loop, 1 to 63, checking for new USB control data packets on each pass, and updating the 52 GPIO for software PWM for each PWM On/Off state according to the PWMStates array.  When new USB data arrives, my loop also updates the 4 hardware PWM states.  From my testing, checking for data seems to be computationally free as I see no impact on observed PWM frequencies, so I check this as often as I can.  With my current code, this check is now occurring around 1480 times per second, further reducing my USB data check latency to 0.68ms (actual USB latency is still to be measured but that's a separate issue).

 

	while(1) {
		for (PWMCycle = 1; PWMCycle <= 63; PWMCycle++) {
			if ( (USBFS_bGetEPState(1)) == OUT_BUFFER_FULL) {          	
				bLength = USBFS_bReadOutEP(1, (char*)&LEDRpt, 56);
				USBFS_EnableOutEP(1);                                  					
				LED_01_PWM_WritePulseWidth(LEDRpt.PWMLev[ 0]);
				LED_02_PWM_WritePulseWidth(LEDRpt.PWMLev[ 1]);
				LED_31_PWM_WritePulseWidth(LEDRpt.PWMLev[30]);
				LED_32_PWM_WritePulseWidth(LEDRpt.PWMLev[31]);
			}
			UpdateAllPins();
		}
	}

 

 

And here's part of my UpdateAllPins() procedure.  To keep the post shorter, I'm only showing a single Port PRT0DR, but the code actually processes all 7 Ports:

 

void UpdateAllPins(void) { 
	int X    = PWMCycle / 8;
	int XBit = PWMCycle % 8;
	int P;
	switch(XBit) {
		case 0: P = 0x01; break;
		case 1: P = 0x02; break;
		case 2: P = 0x04; break;
		case 3: P = 0x08; break;
		case 4: P = 0x10; break;
		case 5: P = 0x20; break;
		case 6: P = 0x40; break;
		case 7: P = 0x80; break;			
	}

	PRT0DR = 0x01*!!(PWMStates[LEDRpt.PWMLev[44]][X] & P) + 0x02*!!(PWMStates[LEDRpt.PWMLev[51]][X] & P) + 0x04*!!(PWMStates[LEDRpt.PWMLev[45]][X] & P) + 0x08*!!(PWMStates[LEDRpt.PWMLev[50]][X] & P) + 
			 0x10*!!(PWMStates[LEDRpt.PWMLev[46]][X] & P) + 0x20*!!(PWMStates[LEDRpt.PWMLev[49]][X] & P) + 0x40*!!(PWMStates[LEDRpt.PWMLev[47]][X] & P) + 0x80*!!(PWMStates[LEDRpt.PWMLev[48]][X] & P);
	
	return ;
}

 

 

As you can see, there's a lot of hard-coded the LEDReport's PWM Levels array access, since I numbered my LED ports counter-clockwise on the chip's physical pinout instead of according to the internal PSoC 1's port/pin layout.  While changing my physical port numbering to align with the PSoC's logical port numbering might allow my to do a simpler loop and reduce code, I'm not sure that would be faster.  From a hardware trouble-shooting perspective, I like my current physical numbering.

Perhaps my code is as efficient as it can get, but it looks computationally expensive, and by my napkin math it is still taking about 312 CPU cycles per every GPIO state update (as I'm running the PSoC 1 at 24MHz, and I'm updating around 76,986 GPIO states every second).

Part of what looks expensive to me is the double-negative, the !! around grabbing the bit value from the PWMStates array.  If I don't do the !!, I was getting the full bit value (i.e. 32, 64, 128) instead of just a 1 or 0.  The !! turns it into a True/False instead of the bit value, so it works for turning on each bit in the Port register, but that's a lot of math going on.

There's a simpler way to get the bit address for querying the PWMStates table, P = (1<<XBit), but I haven't tested it yet to see  if it will improve performance versus the current case statement.  That case statement only runs once per cycle, or around 1480 times per second, so any gain there will be small compared to potential gains from calculating the individual GPIO On/Off states.

 

Next Steps

I've mostly exhausted everything I know to do and can think to try.  I will try that alternate way of grabbing the bit address instead of using that case statement. I may also test to see if aligning my USB data record with the PSoC 1's Port/Register layout provides any further performance benefits, though that seems unlikely.  But I also though changing from Bytes to Bits in the PWMStates table was unlikely to improve performance, so that only proves I have to test to be sure.

But everything else looks about as optimal as I can get it.  If I'm not able to extract any further performance gains, I may simply turn off PWM levels below 7, since levels 1-6 have too much flicker or are simply too bright (the pulse width is too wide for accurate dimming).  This may even allow me to simplify some logic and the PWMStates array, and tweak out a bit more performance.

But I'm not a C programmer, so this has been a major struggle for me.  I'm still hopeful that someone can provide some tips for further optimizing the code or tweaking the PSoC1 configuration settings to further improve performance.  

 

-Paul

0 Likes
Pauven
Level 2
Level 2
10 sign-ins 10 replies posted 5 replies posted

I'm very close to a major performance breakthrough.  But first a recap of my performance improvements:

In earlier tests, I was using a 64x63 array of booleans to store predefined PWM On/Off states (wasting a full byte per state).  Hoping to reclaim memory but not expecting any performance gain, I changed to a 64x8 array of bytes and stored the On/Off states as bits.  Huge performance jump from 17.82 to 23.6Hz.  This was very surprising, because the extra steps to access the bit values looked computationally expensive, so perhaps performance was impacted by memory paging with the larger array.

I then tried a smaller 48x6 array of 5.5-bit PWM values.  This should have bumped performance to about 31Hz, but oddly performance dropped to 15.8Hz, almost exactly 50% of what should have been.  I disabled math optimizations and it climbed to 19.3Hz.  My best guess here is that this is some kind of weird harmonic anomaly.  I tested and troubleshot this for hours, but no change.  Keep in mind that this is just a simple software loop processing values as fast as possible, the PSoC doesn't actually know I'm simulating 48 levels of PWM, so it's not any kind of weird 5.5-bit PWM compatibility limitation.

Next I dropped down to 5-bit PWM, using a 32x4 array of bytes.  This time performance jumped back up to 49.7Hz.  Now, that's not actually twice as fast, as 5-bit PWM level 1 is the same as 6-bit PWM level 2, which was 47.0Hz before.  So this is a minor improvement.  The 5-bit PWM levels change 3% with every level, big enough steps to be noticeable to the human eye, especially in slow fades.

While the 48x6 array wouldn't fit in RAM, I was finally able to use the option to treat Const as RAM with the smaller 32x4 byte array.  And sure enough, performance jumped up to 51.6Hz.

And finally I re-enabled the math optimizations and topped out at 52.2Hz.  Considering that when I first posted this topic I was getting 16 Hz at 6-bit PWM level 2, this is a really nice performance boost.  But this might be the limit for fully on-board software PWM.

 

The Breakthrough

Looking at my code to set the PWM states individually per pin for all 52 GPIO, I couldn't help but think it was taxing the little PSoC 1 processor a bit too much.  Wondering what it could do without all my code in the way, I wrote a little hardcoded 5-bit PWM level 1 loop and let it run as fast as it could go:

for (PWMCycle = 1; PWMCycle <= 31; PWMCycle++) {
	if (PWMCycle % 31 == 0) {
		PRT0DR = 0xFF;
		PRT1DR = 0xFF;
		PRT2DR = 0xFF;
		PRT3DR = 0xFF;
		PRT4DR = 0xFF;
		PRT5DR = 0xFF;
		PRT7DR = 0xFF;
	} else {
		PRT0DR = 0x00;
		PRT1DR = 0x00;
		PRT2DR = 0x00;
		PRT3DR = 0x00;
		PRT4DR = 0x00;
		PRT5DR = 0x00;
		PRT7DR = 0x00;
	}
}

Amazingly, the little PSoC 1 achieved 417.5 Hz!!! And that's for all 56 GPIO!!!

This gave me a new idea, what if I could offload the PWM state calculations to the host PC?  In theory, I could put the logic to create the individual PWM on/off states in the DLL, so that my main pinball program sends the PWM levels to the DLL, which builds an array of 5-bit on/off states for all GPIO, and sends that over USB to the PSoC.  

Wondering what performance might look like, I made a minor change to my PSoC code.  I created a new 31x7 byte array for holding the 31 On/Off states for each of the 7 Ports.  Then I looped through them like this:

struct { BYTE PWMLev[31][7]; } LEDRpt;

for (PWMCycle = 0; PWMCycle <= 30; PWMCycle++) {
	PRT0DR = LEDRpt.PWMLev[PWMCycle][0];	
	PRT1DR = LEDRpt.PWMLev[PWMCycle][1];	
	PRT2DR = LEDRpt.PWMLev[PWMCycle][2];	
	PRT3DR = LEDRpt.PWMLev[PWMCycle][3];	
	PRT4DR = LEDRpt.PWMLev[PWMCycle][4];	
	PRT5DR = LEDRpt.PWMLev[PWMCycle][5];	
	PRT7DR = LEDRpt.PWMLev[PWMCycle][6];
}

There was a performance hit from using this 2-dimensional array, as performance dropped to 221.4 Hz.  Still, this is a huge improvement versus 52.2 Hz, so I felt this method had great potential.

I did some quick testing to see if I could use a larger array, hoping to get closer to 48x6 for 48 PWM levels.  But even a minor bump to 36x5 was too large for RAM.  Seems like 5-bit PWM is the best I can hope for.

In my test application, I ported over my PSoC C routine for building the PWM on/off states.  I loaded that data into an array, and sent that over USB to the PSoC.  One thing I was worried about was how much latency it would add to calculate all 217 bytes, so I added a timer to my code and was pleasantly surprised it only takes about 0.005 ms to calculate them on my 10 year old laptop - that's right, only 5ns, so fast it would have zero impact on my program speed.

USB Data Transfer Problems

Argh, it's frustrating to be so close to a working solution yet have major roadblocks.  Originally I had the PSoC USBFS's Endpoint configured as INT.  While some data was making it from my application to the PSoC, it seemed to be getting corrupted at delivery, and I was getting very weird GPIO activations.

So I switched over to using BULK transfers.  I configured the PSoC's EP1 for BULK (Dir OUT, Interval 1, Max Packet Size 64).  In my program code I switched to LibUSB0's BULK write method:

TYPE
    tLEDRptV2 = packed record
      PWMLevels: packed Array[0..30] of packed Array[0..6] of Byte;
    end;

VAR
    LEDRpt: tLEDRptV2;

<code populates LEDRpt.PWMLevels with desired values>

if CIOCLED1Claimed then 
     USB_Bulk_Write(CIOCLED1.handle, 1, @LEDRpt, SizeOf(LEDRpt), 100);

I've tripled-checked the data in the LEDRpt.  It's exactly 217 bytes, has the correct 31x7 byte array structure, and the data inside looks pristine.

If I flash all 56 GPIO on/off simultaneously, it looks perfect on the PSoC.  But if I try to flash just a single GPIO, crazy stuff happens.

For example, if I try to flash LED 03, which is PRT04DR 0x08, the PSoC instead flashes both PRT02DR 0x08 and PRT03DR 0x08!!!  This weird behavior continues for all pins, so LED lighting animations I've programmed do not look right.

I don't know what I am doing wrong, but somehow I am messing up the USB bulk transfer.

I did doublecheck the PSoC 1's spec sheet for my chip, and it has a 256 Byte USB transfer buffer, so that should be more than enough for my 217 byte array.

Here's my USB data processing routine:

BYTE bLength;
WORD wCount;

if ( (USBFS_bGetEPState(1)) == OUT_BUFFER_FULL) {
	wCount = USBFS_wGetEPCount(1);
	bLength = USBFS_bReadOutEP(1, (BYTE*)&LEDRpt, wCount);
	USBFS_EnableOutEP(1);
}

Any suggestions?  I'm at a loss.  I've done BULK writes to other  USB devices before and never had an issue, so my guess is I've misconfigured the PSoC for BULK.  I've followed the tutorial example for BULK Ping, by the way.

My only guess is maybe the byte array is somehow transposed.  Even though I set it up as 31x7 in both my program and the PSoC code, perhaps it's actually treated as 7x31 in one of the systems so the data gets misaligned upon receipt.

I really want to get this solution to work, as it is achieving around 210 Hz refresh rates at 5-bit PWM level 1, very close to the hard-coded array's performance of 221 Hz.  Not only does this represent 4x better PWM dimming versus my previous attempts, but also theoretically reduces USB data transfer latency, an important benefit for solenoid response time on a pinball machine.

Paul

0 Likes
Pauven
Level 2
Level 2
10 sign-ins 10 replies posted 5 replies posted

Success!

In the bulk transfer examples I found, they always created a 512 byte buffer variable to pass the data into from the host.  I didn't even try this, as I knew the PSoC 1 couldn't handle an array so large.

But on a hunch, and since my PSoC 1 has a 256 byte USB buffer, I increased the size of my 31x7 2-dimensional 217 byte array to 256 bytes, 32x8.  I did this both in the PSoC's main.c and in my program.  And like magic, now it works!

I'm getting humongous frequencies out of the PSoC 1 now.  My scope is reporting a solid 712 Hz for PWM level 1/31, which seems abnormally high.  After all, I was only achieving 417 Hz when I hard-coded the solution, and 221 Hz when I hardcoded using the byte array.  Earlier host PC controlled testing, when things weren't working quite right, produced 210 Hz.  I don't see how my fix more than tripled those speeds.  I think it's safe to assume I'm really only getting around 210 Hz and PWM level 1.  That also means frequencies peak out in the 3000 Hz+ range for PWM level 15.  Amazing!

I think that means I still have a bug to work out.  Perhaps I'm sending the wrong PWM values from my source program.  It's also possible there's still some data transfer shenanigans going on.  Perhaps what I'm sending as PWM states for level 1 are somehow turning into level 4.  I can't fathom how, but it only seems impossible until you find the cause.

When I scope out PWM level 15, which should be a consistent 0101010101010101 ON/OFF pattern, I instead see something like 01010101010000000000000001010101010000000000, with bunching of ON/OFF values, followed by long gaps of OFF values.  That might be what's fooling my scope's Hz calculation.

I also have a fade animation that slowly goes from 1 to 31 in increments of 1, and I only see 7 very distinct brightness levels, with obvious big jumps and only on every 4th value.  This seems to be further confirmation that somehow my values are being passed incorrectly or factored incorrectly.

The good news is that my journey is near complete, and the final solution is at hand.  These PWM rates are smooth as butter now, for all 52 GPIO.  I'm still setting up the other 4 GPIO as hardware PWM for servo applications.

When I first started this project, I merely wanted at least 60Hz for all 56 LED outputs.  And when I first posted for help a little over a week ago, I was worried I was stuck looking at 8Hz values.  Now I've significantly exceeded my own goals.  Hooray!

And come to think of it, those extra 39 bytes that I had to add to the array will come in useful for passing config parameters and servo PWM values separately from the 217 PWMState bytes.

 

0 Likes
Pauven
Level 2
Level 2
10 sign-ins 10 replies posted 5 replies posted

I got a little closer on debugging my USB data transfer issue.  I added debug lights (convenient, since I have created an LED controller with 56 outputs...), and here's what's happening:

My source program sends the 256 byte array over USB.

My PSoC program sees there is data and loads the first 64 bytes into the array, then continues processing the loop.  

Then on the next pass, it loads the next 64 bytes into the array, overlaying the first 64 bytes, then continues processing the loop.

What I don't see is it ever processing bytes 129 - 256.

This is why my PWM behavior looks like 3-bit PWM with only 8 values, I've lost 75% of the data.

Here's my current code to grab data off the USB bus, not sure how to fix this:

if ( (USBFS_bGetEPState(1)) == OUT_BUFFER_FULL) { 
	wCount = USBFS_wGetEPCount(1);
	bLength = USBFS_bReadOutEP(1, (BYTE*)&LEDRpt, wCount);    	
	USBFS_EnableOutEP(1);
}

EP1 is configured as BULK, and I'm doing a BULK write from my host PC.

Any ideas?

0 Likes
Pauven
Level 2
Level 2
10 sign-ins 10 replies posted 5 replies posted

USB Data Transfer Bugs Fixed I worked around my USB data transfer issues, though I feel my solution is a bit of a hack and maybe not the preferred methodology. 

The problem was that I needed to receive and reassemble 4 packets of 64 bytes into a 256 byte 2-dimensional array.  I got really close by looping through bReadOutEP 4 times and loading the data into the array using offsets, but I was still getting data corruption, most likely caused by packets arriving out of order.  The solution to this is probably using control bytes in the data to assist in re-assembly, but this wasn't a direction I was too eager to try.  

Besides, I scoured the internet and this forum for hours looking for any real-world examples of this technique, and could find none.

My hack was that I realized that my PSoC 1 has 4 Endpoints, so I split my data array into 4 smaller 64 byte arrays, and send each on a separate Endpoint.  Luckily I don't need any more Endpoints for anything else, as this solution is working perfectly.

 

Final Results

With the data transmission bugs resolved, for the first time, I am seeing all 31 brightness levels on my LEDs!  And my scope is showing perfectly formed and spaced duty cycles.  Perfection!!!

And was I able to achieve that hoped for 210Hz refresh rate?  Not exactly... I blew it away!  For PWM Level 1 the scope sees a very constant 780 Hz!!!  This climbs in 780Hz per level all the way to 11,700 Hz at PWM Levels 15 & 16, where it gradually decreases back to 780Hz at PWM Level 30, before going to solid ON (no Hz) at Level 31!!!

I don't know why these frequencies are even faster than my hardcoded tests, but I'm not complaining one bit.

That's still not good enough for servos, so I'm still using 4 digital blocks for hardware PWM, so 4 of the 56 LED's are running true PWM at 48MHz for servo control.  

What's most amazing to me is that the 52 software PWM LED's now look identical to the hardware PWM's at all 6-bit brightness levels, and everything is perfectly flicker free!

Can you tell I'm quite excited?!?!  I had hoped to achieve 60Hz minimums, and somehow I've achieved 13x better frequencies!  And the latency from the host PC is the best I've observed yet, easily under 40ms and probably closer to 10ms.

Hopefully this novel of data I've posted here will help some other users in the future.

 

Executive Summary

By pre-calculating on the host PC the On/Off states for all GPIO to simulate 5-bit PWM (32 brightness levels), and sending the new array of On/Off states over USB every time there is a change, you can get software PWM for LEDs that rivals hardware PWM on a PSoC 1, with minimum frequencies of 780Hz (if your PSoC isn't busy doing other things).

EDIT:  To be clear, this approach is technically Pulse-Frequency Modulation (PFM), as the pulse width is fixed based upon the loop speed, and the frequency is adjusted by controlling the spacing to the next pulse.

This does require a constant connection to a host PC, so this won't be feasible for all applications.  And 5-bit PFM is the limit for controlling all 56 GPIO of the top-end PSoC 1,  as the state array for all 56 GPIO is 256 bytes, and there is not enough RAM to hold larger arrays for 8, 7 or even 6-bit PFM.

If you don't need 56 PWM outputs, scaling back the solution actually improves performance and PWM bit-depths.  For example, for only 8 software PWM outputs (one full Port), you should see minimum frequencies 7x faster (5,480Hz), and you can use full 8-bit PFM as that array size is also 256 bytes.

If you don't have a host PC to calculate the On/Off states, then the PSoC 1 can at best achieve about 52Hz for 52 GPIO if it has to do all calculations itself, which increases to about 350Hz if you only need software PWM on 1 Port of 8 GPIOs.  At 350Hz, you will get accurate LED dimming, nearly identical visually to hardware PWM.

Note that software PWM is not accurate enough for servo control, which requires precise frequencies and duty cycles that this approach can't replicate.  

0 Likes
odissey1
Level 9
Level 9
First comment on KBA 1000 replies posted 750 replies posted

Paul,

It is possible to overclock PSoC by up to 50%,  which should improve PWM speed

0 Likes

Thanks Odissey1, but how would I overclock the PSoC 1? 

I just googled and searched this site, and can find no reference to overclocking them.

I've also maxed out the SysClk in PSoC Designer 5.4, the highest setting being 24MHz.

Since I've turned down every setting I don't need for my application, even with it running at 24MHz in a non-stop loop as fast as possible, I can't detect any heat increase from the PSoC package.  Certainly seems like there is some safe margin for overclocking. 

Though I do have concerns about shortening the lifespan of the chip.  I have no prior experience with PSoC 1's, and don't know how reliable they are.  My goal is to have a finished product that lasts at least 10+ years, running for 12 hours a day or more.  I'm close enough to my goals that I don't want to take unnecessary risks.

0 Likes

One way to boost the frequency is to put external XTAL, say, 12MHz, but tell PSoC that it is 8MHz. So when setting the PLL to 24 MHz, it will actually produce 36MHz. I tested this with PSoC5, and able overclock it to 114MHz.

That's absolutely brilliant.

But I designed my PCB around using the internal clock source, both for simplicity and cost savings.  I'll keep this in mind for any future revisions, but that's a long-long way off.  Got to work through my current inventory first.

I would really appreciate any help on the 217 byte USB bulk data transfer issue I described above in my most recent post.  I was able to verify it wasn't a 2-dimensional array translation issue - I see the multiple pins lighting up issue either way, and many of the LED animations are visually worse if I flip the array around in my source program.  Seems I already had it correct.

0 Likes

There is one more way to cheat PSoC. When setting the PLL output frequency using Cypress IDE it automatically selects P and Q dividers to limit the output not to exceed the maximum allowed frequency (e.g. 24MHz). After code execution has started, however, it is possible to apply custom P and Q values, producing higher PLL frequency output, as the associated API does not check for frequency limit at this point. I can't check that on PSoC1, but it works on PSoC5.

0 Likes
DennisS_46
Employee
Employee
100 sign-ins 50 likes received 50 solutions authored

Paul:

Questions first.

1. You apparently selected CY8C24x94, the one part with only 4 digital blocks but 56 I/Os. I guess you picked it for the USB connection. This absolutely limits you to software PWM. If you had selected CY8C29x66, you could have 16 hardware PWMs and only have to do 40 in software, but then you don't have USB. You can use one of the PWMs as an interrupt trigger to execute start on all of your PWM calculations. I would keep a pin free to use as a timing test toggle.

2. Software speed: Data retrieval out of RAM is faster that out of Flash, that said, there isn't enough RAM to execute your 65x65 table lookup. Is asserted by software engineers (particularly compiler writers) that C compilers make better, faster code than hand-written assembly. BS. Carefully written assembly with a minimum of function calls will be faster than C. (Opinion of this old-school analog engineer who has done hundreds of PSoC projects)

3. Writing the Port(x) DR port-wise will always be faster than individual pin writes. My recommendation is to calculate each individual port/pin value, then concatenate the DRs and do a single write. Pragmas execute the same as macros, they still spend a lot of time jumping around. Speed of hardware PWM is moot, you only have 4.

4. All PSoC registers are 8 bit, data is 8 bit, commands are 8 bit, there is no way to address multiple registers simultaneously.

5. Global resources are fine. I saw a comment in one of the answers to try to overclock the system at 36 MHz, "because it worked on PSoC5." PSoC5 will run to 100 MHz by spec, so the writer suggested 14% overclock .... might work. This absolutely will not work on PSoC1. The IMO and the logic train run at 24 MHz max, not with 50% over-clock. Pushing more than a few % past 24 MHz will not work reliably. Period.

6. My inclination is to check the USB input data once per scan of 56 PWMs. You don't need faster than that because you aren't changing the PWMs any faster than that. To check PWM write time, test it 3 ways:

a. continuous code loop: write PortDR_(0:7). 
b. separate line of codes write PortDR(0.0), write PortDR(0.1) . . . . write PortDR(0.7)
real men write assembly:
c.    mov A,[expr for port0.0]
       mov reg[add for port0.0], A
       and so on for ports 0.1 .. 0.7
set up a toggle and test execution time for each. my guess is that c (not C code) is faster.

Device TRM will list the execution time for each op code.

Finally:
You can check the list file to see how efficiently the compiler is converting to assembly.
If you want direct help an review, contact me at dennis.seguine@infineon.com.
I was the first PSoC applications manager starting in 2000; I like PSoC1 because of the analog capability. I wrote  most of the analog user modules, I wrote the block to gpio auto-router, I did the first cap-sense demo. I wrote the spec for the CY8C28xxx, I wrote the spec for the PSoC3/5 digital filter block, I've done most of the analog training for PSoC. I'm happy to help with the way-back machine on PSoC1.