PSOC 4 inline assembly ARM or Thumb?

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
Anonymous
Not applicable

Hi folks,

   

I'm trying to embed some inline assembly in my project to work around areas where I can't get the compiler and optimizer to do what I want.

   

I pulled the lst file for the main loop, and I can see a couple of places that I want to tweak, to save a few precious clock cycles. What has me confused is that the lst formatting looks like ARM assembly, but AN89610 (Code Optimization document for PSOC 4) says that PSOC 4 uses Thumb 2.

   

Is there a way to choose ARM or Thumb?

   

I modified the main loop portion of the lst code, and swapped it in within asm(" .... "); The result is a "Cannot represent THUMB_OFFSET relocation in this object file" error during build. There's a referenced .s file and line number, but I can't find the actual file. I'm guessing it's just temporary during the build process....? 

   

At this point, I'm getting almost the performance I need from my solution in C, but I have tried many variants, and only been able to find options that don't work. I do think I need the extra savings I can get from using assembly in the critical sections.

   

Thanks for the help,

   


Edit: I think I see the cause of the error: I missed that an array I'm using is represented as a label within the assembly block. I'll need to fix the assembly to correct address the array. I'm still rather confused about the ARM vs. Thumb stuff, though.

   

 

   

Paul

0 Likes
1 Solution
Anonymous
Not applicable

This almost works:

   

        asm(
            "ldr r3, [%[datawrite]]\n"  //Get address of data write function
            "mov    r2, #125\n"
            "str r2, [r3]\n"  //write data bus from r2
            :
            :   [addhigh] "l" (Pin_Address_High_PS),
                [addlow] "l" (Pin_Address_Low_PS),
                [datawrite] "l" (Pin_Data_DR),
                [dataread] "l" (Pin_Data_PS),
                [pinrw] "l" (CYREG_PRT0_PS),
                [buffer] "l" (buffer)
            );
 

   

It compiles, and builds, and programs, but it doesn't actually set the 125 value on the data bus. However, the resultant lst output does look pretty similar, so I think I'm close to getting this right.

   

On the other hand, one difference I observed between what's in the lst files and what will compile is f after the label of a forward jump. The generated assembly in the lst files had this, but it caused a compile failure when I used it in my code.

View solution in original post

0 Likes
15 Replies
Anonymous
Not applicable

OK. Here's what I'm seeing now.

   

The label I had overlooked (.L61 in my project) is the base address for the addresses of the port status and data registers. Some of these are pre-loaded into registers before the block of code that I'd like to modify, but a couple of them are read "on the fly" (.L61+16 or .L61+20).

   

Any recommendations on accessing the port registers reliably through inline assembly? I would think even the pre-loaded ones should be something I'm not counting on.

0 Likes
cadi_1014291
Level 6
Level 6
25 likes received 10 likes received 10 likes given

I think in CortexM (ARMv7) devices you can only go for Thumb instruction set (ARM was used in older versions).

   

Labels like .L61 seems to be literal pools, and i think using literal pools are the best way to access registers, anyway, you can not load 32bit immediate values in ARM asm, so you have to load it "by pieces", first the lower half and then the higher half, that's what the ldr instruction do under the hood (it´s a macro, not an instruction).

   

Here are some useful links:

   

https://community.arm.com/docs/DOC-7869

   

http://www.ethernut.de/en/documents/arm-inline-asm.html

   

 

   

i'm trying to learn inline asm too, remember to use asm volatile (); this way the compiler will respect your inline asm.

   

Hope it helps

0 Likes
Anonymous
Not applicable

Thanks.

   

I started with the ethernut cookbook before posting here. It seemed helpful, but is fairly contradictory versus the inline assembly comments in http://www.cypress.com/file/46521/download.

   

According to the Cypress documentation, something like this:

   

            "ldr r4, =CYREG_PRT2_PS\n"    //read address high
            "ldr r3, [r4]\n"

   

Should work, but, in fact, causes a no-information compiler failure. That's the piece I'm trying to figure out at the moment.

0 Likes
Anonymous
Not applicable

Then again, I'm not too sure about the Cypress doc. I tried this example from it (page 15), won't even compile, let alone build:

   

    int foo = 5L;
    int bar;
 
    bar = foo + 1;
 
    /* bar = foo + 1 */   
    asm("LDR r0, =foo\n"        
        "LDR r1, =bar\n"        
        "LDR r2, [r0]\n"        
        "ADD r2, r2 #1\n"        
        "STR r2, [r1]"); 

   

Removing the extra(?) r2 in the ADD line allows it to pass the first pass of the compiler, but still fails with the same general error as I was getting from my code.
 

0 Likes
Anonymous
Not applicable

This almost works:

   

        asm(
            "ldr r3, [%[datawrite]]\n"  //Get address of data write function
            "mov    r2, #125\n"
            "str r2, [r3]\n"  //write data bus from r2
            :
            :   [addhigh] "l" (Pin_Address_High_PS),
                [addlow] "l" (Pin_Address_Low_PS),
                [datawrite] "l" (Pin_Data_DR),
                [dataread] "l" (Pin_Data_PS),
                [pinrw] "l" (CYREG_PRT0_PS),
                [buffer] "l" (buffer)
            );
 

   

It compiles, and builds, and programs, but it doesn't actually set the 125 value on the data bus. However, the resultant lst output does look pretty similar, so I think I'm close to getting this right.

   

On the other hand, one difference I observed between what's in the lst files and what will compile is f after the label of a forward jump. The generated assembly in the lst files had this, but it caused a compile failure when I used it in my code.

0 Likes
lock attach
Attachments are accessible only for community members.
Anonymous
Not applicable

OK. This is not making much sense. I tried really simplifying the C code, to compare the lst output to the lst output from my really simplified inline assembly. To my eye, the resultant lst content looks logically equivalent, but the C works, and the inline assembly does not. By "works", I mean that the C writes 125 to the register associated with Pins_Data, and the assembly leaves that as all high values (255).

   

The two results (source and lst) are attached (because of the Cypress spam filter that seems to fire whenever too much example code is embedded in a post).
 

0 Likes
Anonymous
Not applicable

I got the really simple case working: writing a byte to a pin set:

   

         asm(
            "mov r3, %[datawrite]\n"  //Get address of data write function
            "mov    r2, #125\n"
            "str r2, [r3]\n"  //write data bus from r2
            :
            : [datawrite] "l" (&Pin_Data_DR)
            );

   

The optimizer still messes it up a bit, but it works. Now just to figure out all the other issues....
 

0 Likes
Bob_Marlowe
Level 10
Level 10
First like given 50 questions asked 10 questions asked

Well, Paul, I can assure you that it is a challenge to be better than the GCC optimizer! Did you try to set (you can do that on a .c file basis)  the optimization level to "speed" or "size"? There are even other settings to try mentioned in the GNU compiler manual.

   

 

   

Bob

0 Likes
Anonymous
Not applicable

Yes. It's built using speed optimization. I also had to add some noinline optimizer hints to keep the optimizer from really messing up parts of it.

   

The inline assembly is to deal with areas where an if/else would be more efficient than the current code, but the optimize won't accept it. The optimizer is also doing things like checking an if condition both at the top and bottom of a block of code,

0 Likes
Anonymous
Not applicable

OK. So, the current piece I'm fighting with is:

   

1. Variables passed through using symbolic names (or position identifiers) in the input section get mapped automatically with ldr statements into registers the compiler picks

   

2. I don't see a way to control these ldr commands or predict which registers will be used for which variables

   

3. My code picks up with the mov commands to put the variable addresses into particular registers

   

4. The compiler doesn't care which registers I've selected.

   

The current state of things is that I want to use r3 and r4 for the variable addresses, but the compiler uses r2 and r3 for its ldr commands, and then executes my mov commands in such a way that r2 overwrites r3 before I ever get a chance to use it.

   

So, the main question I have is, is there a way to know or to symbolically use the registers that the compiler is going to select for its ldrs? 

   

Edit: If I look at the lst code, and then swap my selected registers to match what the compiler picked for its ldr targets, things work (although there's a pointless mov r3, r3); but this seems like the wrong way to do things, and likely to break easily when any code is changed.

   

Edit 2: Nevermind. I figured it out. I was using the mov commands because the ARM Thumb2 documentation was pretty clear about needing to first load a variable address into a register before you could do anything with it, but the compiler is actually doing that for me, so the mov commands are not needed. The correct simple example is this:

   

         asm(
            "mov    r2, #125\n"
            "str r2, [%[datawrite]]\n"  //write data bus from r2
            :
            : [datawrite] "l" (&Pin_Data_DR)
            );

   

The compiler will pick a register for &Pin_Data_DR, add an ldr into it, and then will substitute that register into the str command.

0 Likes
Anonymous
Not applicable

However, I'm still have problems with the optimizer (even with volatile), doing annoying things, like using r3 for a variable address when I'm using r3 in my loop. The address gets overwritten on the first iteration of the loop.

   

Edit: I've tried working around this by using the registers the compiler is not taking, but this isn't working out. The compiler is only leaving me r6 and r7 to work with. I need more than two registers. I would have thought push and pop would help, but, when I add a push call, the code beyond that call stops behaving correctly, and there's no indication why in the lst code.

   

In an overlapping problem, I can't figure out the correct notation for addressing an array in the inline assembly. The code to read a byte from the array is an ldrb with an offset, e.g.

   

ldr r6, [%[buffer], r7] 

   

Does buffer link to &buffer, or buffer, or buffer[0], or &buffer[0], or ....?

   


Edit 2: Though push and pop don't seem to work, moving low registers into high registers, e.g. mov r8, r4 , does work.

   

For the buffer address. The compiled C code is doing something with the stack pointer.... in the assembly block for it's code, and well outside of it for my code. It's difficult to see where it's pointing... but, at least, now that I have a solution for the register shortage, I can try a bunch of things to see if one of them works. 

0 Likes
Anonymous
Not applicable

OK. Pretty well stuck on this part. I've searched, but have only really found people facing the same problem with unhelpful answers, or no answers.

   

I can pass a regular variable value into a register using something like MOV r6, %[buffer] where [buffer] is mapped to a a regular variable (test) or a particular value in the array (buffer[0]). I can't find the correct way, though, to pass &buffer into the assembly so I can index it with ldrb.

0 Likes
Bob_Marlowe
Level 10
Level 10
First like given 50 questions asked 10 questions asked

ARM is a RISC design. You need to define a location containing the base address, load it into a register and then issue your ldrb instruction.

   

 

   

Bob

0 Likes
Anonymous
Not applicable

Do you have an example of that?

   

For instance, this works:

   

uint8 test = 125;

   

asm ("mov r6, %[test]\n" : : [test] "l" (test));

   

This does not:

   

uint8 test = 125;

   

asm ("ldrb r6, [%[test]]\n" : : [test] "l" (&test));

   


The second example seems like it should work, but it gets 0 into r6 instead of the value of test. I also tried adding uint8 *testptr = &test and passing testptr into [test]. No difference.

   

------------------------------------------------------------

   

And... the answer is.... Clobber list!!!!

   

Even though I couldn't see any reordering or clobbering happening in the lst files, the final code was apparently doing other things than what the lst shows. I had to add everything to the clobber list, and then I started getting correct results.

   


In short, things like like this:

   

....

   

            "ldrb   r7, [%[buffer], r6]\n"  //read byte from buffer at offset r6
            "str r7, [%[datawrite]]\n"  //write data bus from r7
...

   

            :
            :   [addhigh] "l" (&Pin_Address_High_PS),
                [addlow] "l" (&Pin_Address_Low_PS),
                [datawrite] "l" (&Pin_Data_DR),
                [dataread] "l" (&Pin_Data_PS),
                [pinrw] "l" ((reg32*)CYREG_PRT0_PS),
                [buffer] "l" (buffer)
            : "r6", "r7", "r8", "cc", "memory"
            );

   

With the key difference being my calling out that I may modify r6, r7, r8, flags, and memory in my code.
 

0 Likes
Anonymous
Not applicable

And, at the end(?) of all this, I have some assembly that's more compact and (on paper) more efficient than what the compiler produced... yet it doesn't run as fast. I think the compiler is cheating (by not sharing what it's "really" doing). 😉

   

Actually, there are a couple of things it's doing that I couldn't get to work, that might make the compiler faster. For instance, the C code uses a status register with hardware logic to check for chip select (1000 in upper four bits of address), and I'm doing that in assembly by ANDing and CMPing. Probably three clock cycles for that... though I can't believe that would be much faster by reading the status register. The other thing the compiler can do that I can't is use a pool of addresses so it doesn't have to pre-allocate all the variable addresses to registers... on the other hand, I was able to code my stuff to just use the two available low registers (r6 and r7) and having the addresses pre-loaded in the other register I would have thought would make things faster.

   

So, overall, I'm a bit mystified that my code runs slower than the compiled C. I don't think I can squeeze any more performance out of the C... although, maybe I can use the lst file as a tool to get a better idea of impact of some other code changes and see if I can trick the optimizer into something that runs just a little bit faster. What I have now is C code that can serve up bytes fast enough for the 1.3MHz pocket PC processor to run machine language code from it... for anywhere from 10 to maybe 30 seconds before glitching. The assembly code won't even allow the first ML call to start. I'm actually astounded that the C code can run as long as it does before crashing... that's probably a few million bytes fetched before one is misfetched... meaning that the current performance must be just on the cusp of what's needed.

0 Likes