How to up the CODE efficient.

Anonymous · ‎Mar 02, 2016

Answer:

Using the the 20bit address mode
Option specification at using division instruction (div step)
Adjust number of local variable in order to not exceed 512 bytes for number of stack use of function
Avoid to use a lot of signed 1 byte/2 byte data
Control of loop-unrolling optimization
Review of necessity for inline expansion
Control of standard library expansion
Others

Using the 20bit address mode

Generally FR processes with following 3 steps at operation.

Resister set of memory address
Load data to resister
Operation

Especially when using a lot of external variable, there is case of large code size, because a lot of instruction, which load 32-bit address, is used

[C source ]	[In case of FR]
a=b+c;	LDI:32,	#_b R12
	LD	@R12, R0
	LDI:32	#_c, R12
	LD	@R12, R1
	ADD	R1, R0
	LDI:32	#_a, R12
	ST	R0, @R12

Therefore when the code or data is possible t to locate to RAM/ROM in locating to 20-bit address space (0x0 to 0xFFFFF), set of 20-bit address mode (-K shortaddress option) is recommended. If the location is impossible, the use of external variable should be changed to local variable if possible.

[C source]	[default]		[-Kshortaddress specifying]
a=b+c;	LDI:32	#_b, R12	LDI:20	#_b, R12
	LD	@R12, R0	LD	@R12, R0
	LDI:32	#_c, R12	LDI:20	#_c, R12
	LD	@R12, R1	LD	@R12, R1
	ADD	R1, R0	ADD	R1, R0
	LDI:32	#_a, R12	LDI:20	#_a, R12
	ST	R0, @R12	ST	R0, @R12
	----------------		----------------
	26 byte		20byte

Option specification at using division instruction (div step)

FR has div step instruction for division. But when this instruction is used, more code size than 72 bytes by division is made, because of 1 division with 36 instructions.
Compiler makes the code in order to call the library at executing at default for division process. Therefore if there are some division instruction, reduced code size is outputted at default set.
However if the optimization of speed priority (-Kspeed) is specified, it is directly expanded div step instruction. When to increase of code size for division process at specifying the optimization of speed priority has a problem, to not specify the optimization of speed priority is recommended.

[C source]	[In case of speed priority]		[default]
a=b/c;	LDI:20	#_b, R12	LDI:20	#_b, R12
	LD	@R12, R0	LD	@R12, R4
	LDI:20	#_c, R12	LDI:20	#_c, R12
	LD	@R12, R1	LD	@R12, R5
	MOV	R0, MDL	CALL20	_divi, R12
	DIV0S	R1	LDI:20	#_a, R12
	DIV1	R1	ST	R4, @R12
	DIV1	1
	DIV1	R1
	DIV1	R1
	IV1	R1
	-----------------------		-----------------------
	74 byte		20 byte*

*:divi function of 78 bytes is made separately.
(When divi function is used as library, to reduce code size at executing some division instructions is possible to expect.)

Adjust number of local variable in order to not exceed 512 bytes for number of stack use of function

LD/ST instruction is possible to use FP relative address. However the offset, which is possible to specify, is in maximum -512 to +508 (at 4 bytes type), because of restriction of 16-bit instruction length. Therefore when local variable area, which is exceeded 512 bytes, is used, the operation in order to calculate stack address is increased, and code size is larger and access efficiency is decreased.
So in order to not exceed 512 bytes for number of stack use of function, code size is reduced and access efficiency is improved by adjusting number of local variable.

Number of stack use for each function is possible to confirm with SOFTUNE C/C++ Analyzer.

(Note) When local variable is the type of 2 bytes or 1 byte, the offset, which is possible to specify, is -256 to 254 or -128 to 127 for each type. Therefore the size, which is possible to generate of effective code, is different.

[C source]	[In case of -520 for offset]		[In case of -4 for offset]
	(at using larger size than above mention)		(at using the size within above mention)
a=10;	LDI	#10, R0	LDI	#10, R0
	LDI	#-520, R13	ST	R0, @(FP,-4)
	ST	R0, @(R13,FP)
	-------------------------		-------------------------
	8 byte		4 byte

Avoid to use a lot of signed 1 byte/2 byte data

FR architecture does not have load instruction of signed data. Therefore when loading signed 1 byte/2 bytes data, sign expansion is needed after loading. When using a lot of signed 1 byte/2 bytes data, code size is increased at comparing as unsigned data.
So code size is reduced and access efficiency is improved by using unsigned type as possible.

(Note) For Softune Compiler char type is use as unsigned char type. Therefore char type is possible to use as it is.

[C source]	[In case of signed char type]		[In case of char type]
a=b+c;	LDI:20	#_b, R12	LDI:20	#_b, R12
	LDUB	@R12, R0	LDUB	@R12, R0
	EXTSB	R0	LDI:20	#_c, R12
	LDI:20	#_c, R12	LDUB	@R12, R1
	LDUB	@R12, R1	ADD	R1, R0
	EXTSB	R1	LDI:20	#_a, R12
	ADD	R1, R0	STB	R0, @R12
	LDI:20	#_a, R12
	STB	R0, @R12
	----------------------		----------------------
	24 byte		20 byte

Control of loop-unrolling optimization

Loop-unrolling optimization is improved of execution speed by reducing number of loop. But object size is increased.
How to describe the code in case of speed priority and code size priority should be reviewed as an aim.

[Before unrolling]
	for(i=0;i<6;i++){ a=0;}
[After unrolling]
	for(i=0;i<6;i+3){
		a=0;
		a[i+1]=0;
		a[i+2]=0;
	}

And when unrolling control is not specified even above [Before unrolling] description, code size is larger. Therefore corresponded compiler to code size is possible with specifying size priority optimization (-Ksize) or loop-unrolling control (-Knounroll).

[C source]

for(i=0;i<6;i++){a=0;}

[Loop unrolling optimization]			[unrolling determent]
	LDI:20	#_a, R6		LDI	#0, R4
	LDI	#0, R4	L_26:	LDI	#0, R0
	LDI	#2, R5		LDI:20	#_a, R13
L_32:	LDI	#0, R0		STB	R0, @(R13,R4
	MOV	R4, R13		ADD	#1, R4
	STB	R0, @(R13,R6)		CMP	#6, R4
	MOV	R6, R0		BLT20	L_26, R12
	ADD	R4, R0		LDI	#0, R4
	LDI	#0, R1
	LDI	#1, R13
	STB	R1, @(R13,R0)
	MOV	R6, R0
	ADD	R4, R0
	LDI	#0, R1
	LDI	#2, R13
	STB	R1, @(R13,R0)
	ADD	#3, R4
	ADD	#-1, R5
	CMP	#1, R5
	BGE20	L_32, R12
	---------------------			---------------------
	42 byte			18 byte

Review of necessity for inline expansion

Inline expansion optimization is expanded the process of function for call ahead instead of function call to defined function in C source. When the process of expanded function is very small, code size after inline expansion may be small. But generally object size is increased.

In case of object size priority, this optimization is not recommended.
(Not use -xauto option, -x option, #pragma inline, inline type qualifier (only C++))

[C source]
unsigned short ADD_sat16(unsigned short a, unsigned short b){
	int tmp;
	if((tmp=a+b)>0xffff) return 0xffff;
	return (unsigned short)tmp;
}
unsigned short a,b,c,d,e,f;
func(){
	a=ADD_sat16(b,c);
	d=ADD_sat16(e,f);
}

[In-line expansion optimization]			[In-line optimization control]
_func:	LDI:20	#_b, R12	_func:	ST	RP, @-SP
	LDUH	@R12, R4		LDI:20	#_b, R12
	LDI:20	#_c, R12		LDUH	@R12, R4
	LDUH	@R12, R5		LDI:20	#_c, R12
	ADD	R5, R4		LDUH	@R12, R5
	LDI	#65535, R0		CALL20	_ADD_sat16, R12
	CMP	R0, R4		LDI:20	#_a, R12
	BLE20	L_32, R12		STH	R4, @R12
	LDI	#65535, R4		LDI:20	#_e, R12
	BRA20	L_28, R12		LDUH	@R12, R4
L_32:	EXTUH	R4		LDI:20	#_f, R12
L_28:	LDI:20	#_a, R12		LDUH	@R12, R5
	STH	R4, @R12		CALL20	_ADD_sat16, R12
	LDI:20	#_e, R12		LDI:20	#_d, R12
	LDUH	@R12, R4		STH	R4, @R12
	LDI:20	#_f, R12		LD	@SP+, RP
	LDUH	@R12,		R5	RET
	ADD	R5, R4
	LDI	#65535, R0
	CMP	R0, R4
	BLE20	L_36, R12
	LDI	#65535, R4
	BRA20	L_34, R12
L_36:	EXTUH	R4
L_34:	LDI:20	#_d, R12
	STH	R4, @R12
	RET
	-----------------------			-----------------------
	74 byte			46 byte

However with argument and by using static function for the function of small code size or by specifying #plagma inline, code size is possible to reduce. To use "inline candidate selection function" in Softune C Analyzer is recommended.

Control of standard library expansion

Standard library expansion replaces to standard function of higher speed, which inline expansion of standard function and same operating is performed, with recognizing the operating of standard function. In case of object size priority, not use this optimization. Use standard library inline expansion control (-Knolib).

Others

Locate structure member, which number of reference is large, to head.

Access of structure member is fixed actual location address by calculating head address + offset.
Head member is not needed the calculation because of offset=0. When there is high member for static access frequency, review whether it is possible to locate to head.

Void of function, which is returned structure.

The function, which is returned structure, is occurred structure transfer into work area. address of structure to substitution destination is handled with argument, and it is possible to make void function by directly substituting.

Within 4 argument

Within 4 argument, it is not needed the code for access because of handling with resister. Therefore execution speed is improved. When there is argument, which is handled uselessly, review to reduce it.