(ARM).ARM6 in DSP applications.Use of the MUL instruction
.pdfApplication Note 19
ARM6 in DSP Applications :
Use of the MUL Instruction
Document Number: ARM DAI 0019D
Issued: December 1994
Copyright Advanced RISC Machines Ltd (ARM) 1994
All rights reserved
ARM
Advanced RISC Machines
Proprietary Notice
ARM, the ARM Powered logo, BlackICE and ICEbreaker are trademarks of Advanced RISC Machines Ltd.
Neither the whole nor any part of the information contained in, or the product described in, this application note may be adapted or reproduced in any material form except with the prior written permission of the copyright holder.
The product described in this application note is subject to continuous developments and improvements. All particulars of the product and its use contained in this application note are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied warranties or merchantability, or fitness for purpose, are excluded.
This application note is intended only to assist the reader in the use of the product. ARM Ltd shall not be liable for any loss or damage arising from the use of any information in this application note, or any error or omission in such information, or any incorrect use of the product.
Change Log
Issue |
Date |
By |
Change |
A |
Jan. 93 |
IR |
Doc creation in Frame |
B |
Jul. 93 |
IR |
Formally signed off |
C |
Nov. 94 |
AW |
Change of ARM Ltd address |
D |
Dec. 94 |
AW |
Reformatted into new style |
ii
Application Note 19
ARM DAI 0019D
ARM6 in DSP Applications : Use of the MUL Instruction
1 Introduction
The performance of the MUL instruction is important for many DSP applications where multiply is used extensively (eg. digital filters, correlation, Fast Fourier Transforms etc.)
The ARM6 family of RISC processors is based on a core architecture which has features that can be used for DSP applications. One of these architectural features is a hardware Booth’s Multiplier which performs multiply operations on data. The design of the multiplier is based on an algorithm to reduce the number of stages needed to perform a multiplication.
This document explains how to predict the timing of a MUL instruction, and suggests some ideas to improve the performance of speed-critical programs, when using the ARM6 family of processors from Advanced RISC Machines (ARM).
2 Overview
The multiply instructions have the following syntax:
MUL |
Rd, |
Rm, |
Rs |
MLA |
Rd, |
Rm, |
Rs, Ra |
|
|
|
^^ |
The MUL calculates the least-significant 32 bits of the 64-bit product of Rm and Rs, and places the result in Rd (the destination register). MLA specifies an additional register which is added to the product (with zero overhead) before the result is stored in Rd.
The speed of the multiply depends on the value in Rs. It is important to place the smallest value in Rs so that the multiply takes the least number of cycles. See Appendix A for a detailed explanation of Booth’s multiplication.
The following table shows how the speed of the multiply depends on the value in Rs:
Min |
Max |
Speed |
|
|
|
0 |
1 |
1S + 1I |
|
|
|
2 |
7 |
1S + 2I |
|
|
|
8 |
31 |
1S + 3I |
|
|
|
32 |
127 |
1S + 4I |
|
|
|
128 |
511 |
1S + 5I |
|
|
|
512 |
2047 |
1S + 6I |
|
|
|
2048 |
8191 |
1S + 7I |
|
|
|
8192 |
32767 |
1S + 8I |
|
|
|
Table 2-1: Multiply Speed Variations with Differing values of Rs
|
|
Application Note 19 |
1 |
|
|
ARM DAI 0019D |
|
|
|
|
|
ARM6 in DSP Applications : Use of the MUL Instruction
Min |
Max |
Speed |
|
|
|
|
|
32768 |
131071 |
1S |
+ 9I |
|
|
|
|
131072 |
524287 |
1S |
+ 10I |
|
|
|
|
524288 |
2097151 |
1S |
+ 11I |
|
|
|
|
2097152 |
8388607 |
1S |
+ 12I |
|
|
|
|
8388608 |
33554431 |
1S |
+ 13I |
|
|
|
|
33554432 |
134217727 |
1S |
+ 14I |
|
|
|
|
134217728 |
536870911 |
1S |
+ 15I |
|
|
|
|
536870912 |
2147483648 |
1S |
+ 16I |
|
|
|
|
Table 2-1: Multiply Speed Variations with Differing values of Rs
The general formula is that multiplication by values between 2(2n-3) and 2(2n-1)-1 inclusive takes 1S + nI-cycles (n>1, multiplication by 0 or 1 takes 1S + 1I-cycle).
The cycle timings used throughout this document refer to the cycle types as seen by the ARM6 core. It is important to convert these numbers into actual times within your system in order to correctly estimate the performance. For an ARM6 or ARM60, this document specifies the precise cycle types visible in your system. The cached processors (ARM600 and ARM610) contain the ARM6 macrocell linked to a cache, so both S- and I-cycles may run at the fast core clock (FCLK) rate since they are internal to the processor. In this case, both S-cycles and I-cycles are F-cycles, so simply add together the number of S- and I-cycles to calculate the number of processor cycles.
This document explains several methods which may be used to improve the speed of multiplication:
•Place smallest operand in Rs
•Ensure that Rs is positive so that early termination occurs
•If one operand is a constant, use the multiply-by-constant technique
3 Overflow issues
It is common to use the current multiply instruction in, for instance a 16-bit x 16-bit -> 32-bit mode to avoid overflow. For DSP work, a 32-bit result is all that is required, but it is important to ensure that overflow does not cause an incorrect answer.
This situation can be generalised, as the number of result bits is just the sum of the operand bits. Thus, the MUL can perform 16-bit x 16-bit -> 32-bit, 8-bit x 24-bit -> 32-bit etc. all without overflow. For MLA, the total is accumulated (overflow of the total must be avoided). Hence, MLA would be used as, for example 12x12->24, leaving 8 bits to accumulate up to 256 values without the possibility of overflow.
2 |
Application Note 19 |
|
|
|
ARM DAI 0019D |
|
|
|
|
|
|
|
|
|
|
ARM6 in DSP Applications : Use of the MUL Instruction
4 Use the smallest operand as Rs
In practice it is possible to arrange for a worst case which is at most 1S + 9I-cycles for unsigned operands (by putting the operand with the least number of bits into Rs, so that Rs <= 16bits). In most cases it is known that one of the operands is at most 16bits, as the sum of the number of operand bit must be no greater than 32.
5 Negative operands
The multiplier yields correct results for negative operand values, since it calculates the low 32-bits of the full 64-bit product. For positive values of Rs, a 16x16->32 MUL takes at most 1S + 9I-cycles (the average should be better than this). But, the MUL always takes 1S + 16I-cycles if Rs is negative. Early termination cannot take place because the top bit of Rs is a 1, so the Booth's multiplier register never contains all 0s (the maximum-of-16 limit is reached instead). A negative value for Rs is considered to be a very large positive number from the point of view of the Booth’s multiplier.
The example below shows how to guard the multiply instruction, but this does not really improve things unless Rs is very small so that the gain exceeds the (3- instruction) overhead.
; |
|
Cycles |
|
CMP |
Rs, #0 |
; 1S |
|
RSBMI |
Rs, Rs, #0 |
; 1S |
make Rs positive |
MUL |
Rd, Rm, Rs |
; 1S+nI |
|
RSBMI |
Rd, Rd, #0 |
; 1S |
make result negative if Rs |
|
|
; |
was negative |
It is sometimes possible to do away with the CMP by incorporating this into another instruction.
In the special case when squaring, the result does not need to be negated after the multiplication as it will always be positive (thus the second RSB instruction can be eliminated).
For example, consider this critical bit of code which uses MUL to square the difference between pairs of signed 5-bit input values. This demonstrates the importance of ensuring that Rs is positive, as the worst-case performance is improved to just 1S + 3I-cycles for the MUL.
; on entry, |
r8 & r9 contain packed binary data |
||||
; |
|
|
Cycles |
|
|
AND |
r1, r8, #31 ; |
1S |
extract 5-bit field of interest |
||
AND |
r2, r9, #31 ; |
1S |
extract 5-bit field of interest |
||
SUBS |
r1, r1, r2 |
; |
1S |
|
|
RSBMI |
r1, r1, #0 |
; |
1S |
|
|
MUL |
r0, r1, r1 |
; |
1S+nI |
n<=3 |
; on exit, r0 contains result
As you can see, it has been arranged to cost only 1 S-cycle (for the RSBMI instruction) to ensure the multiply is fast.
|
|
Application Note 19 |
3 |
|
|
ARM DAI 0019D |
|
|
|
|
|
ARM6 in DSP Applications : Use of the MUL Instruction
The issue of negating the value in Rs is more complicated if MLA is used, as it is not possible to negate the product before the accumulate. There are two possible solutions to this:
1 Negating the total, then negating it again after the MLA
; |
|
|
Cycles |
CMP |
Rs, #0 |
|
; 1S |
RSBMI |
Rs, Rs, |
#0 |
; 1S |
MLA |
Ra, Rm, |
Rs, Ra |
; 1S+nI |
RSBMI |
Ra, Ra, |
#0 |
; 1S |
; side-effect: Rs is |
now positive |
|
2 Negating the other operand (Rm)
; |
|
Cycles |
|
CMP |
Rs, #0 |
; 1S |
|
RSBMI |
Rs, Rs, #0 |
; 1S |
|
RSBMI |
Rm, Rm, #0 |
; 1S |
negate Rm to |
|
|
; make product correct |
|
MLA |
Ra, Rm, Rs, Ra |
; 1S+nI |
|
;side-effect: Rs is now positive
;side-effect: Rm may have been negated
Method 2 uses fewer instructions which is preferable (although it has more sideeffects). It is possible to incorporate the CMP #0 into a previous instruction as with the MUL example above.
It is questionable whether these techniques are useful in general for MUL or MLA, as there must be a performance increase on average to make it worth the effort. However, these methods can help for some specific applications, for example, squaring small signed values where the overhead can be reduced and the saving is greatest.
6 Multiplication by constant
This final section shows you:
•how to construct a sequence of ARM instructions to multiply by a constant
•how to create ARM assembly output from the C-compiler (using 'armcc -S')
In many cases where a multiply is used, one of the values is a constant (eg. weeks*7). In some applications (eg. 8-point DCT algorithm by Loeffler, Ligtenberg & Moschytz, ICASSP '89 p988 used for JPEG coding) multiplication-by-constant is used extensively.
A programmer might assume that the only way to calculate weeks*7 would be to use the MUL instruction. However, it is possible to express the multiplication as a sequence of arithmetic instructions, which in many circumstances can be faster.
When multiplying by a constant value, it is possible to replace the general multiply with a fixed sequence of adds and subtracts which have the same effect. For instance, multiply by 5 could be achieved using a single instruction:
; this instruction takes 1S-cycle
4 |
Application Note 19 |
|
|
|
ARM DAI 0019D |
|
|
|
|
|
|
|
|
|
|
ARM6 in DSP Applications : Use of the MUL Instruction
ADD |
Rd, Rm, Rm, LSL #2 |
; Rd = Rm + (Rm * 4) = Rm * 5 |
|
Compare this with the MUL version below: |
|
||
; this sequence takes a total of |
1S+2I-cycles |
||
MOV |
Rs, |
#5 |
|
MUL |
Rd, |
Rm, Rs |
|
The cost of the general multiply includes the instructions needed to load the constant into a register (up to 4 may be needed, or an LDR from a literal pool) as well as the multiply itself. In some cases, such as multiply-by-5 above, the multiply-by-constant uses fewer instructions as well!
7 The problem of finding the optimum sequence
The difficulty in using a sequence of arithmetic instructions is that the constant must be decomposed into a set of operations which can be done by one instruction each.
Consider multiply by 105. This could be achieved by decomposing as shown below:
105 == 128 - 13
==128 - 16 + 3
==128 - 16 + (2 + 1)
;this sequence takes 3S-cycles
ADD |
Rd, Rm, |
Rm, LSL |
#1 ; |
Rd = |
Rm + |
Rm *2 |
|
||||
SUB |
Rd, |
Rd, |
Rm, |
LSL |
#4 |
; |
Rd |
= |
Rm*3 |
- Rm*16 |
|
ADD |
Rd, |
Rd, |
Rm, |
LSL |
#7 |
; |
Rd |
= |
Rm*(-13) + |
Rm*128 |
Or, decomposing differently:
105 == 15 * 7
==(16 - 1) * (8 - 1)
;this sequence takes 2S-cycles
;uses temporary register Rt to store intermediate value
RSB |
Rt, |
Rm, |
Rm, |
LSL |
#4 |
; |
Rt |
= |
-Rm + Rm*16 = Rm*15 |
RSB |
Rd, |
Rt, |
Rt, |
LSL |
#3 |
; |
Rd |
= |
Rt*7 = Rm*105 |
The second method is the optimal solution which is not difficult to find by hand for small values such as 105. However, the problem of finding the optimum becomes much more difficult for larger constant values. There are no known algorithms which solve this problem quickly.
Temporary registers can be used to store intermediate results to help achieve the shortest sequence. For a very large constant, more than one temporary register may be needed to find the shortest sequence.
Looking carefully at the 2-instruction optimal code sequence, it is can be seen that Rt could be the same register as Rd. Rd is usually the first-choice for a temporary register since it will be used to store the final result in the last instruction of the sequence.
|
|
Application Note 19 |
5 |
|
|
ARM DAI 0019D |
|
|
|
|
|
ARM6 in DSP Applications : Use of the MUL Instruction
8 Experimenting with ARMCC assembly output
When writing a speed-critical ARM assembler program, it is a good idea to code it in C first (to check the algorithm) before converting it to handtuned assembler. It is helpful to see the ARM code which the compiler generates as a starting point for your work. armcc -S will generate an assembly file instead of an object file. For example, consider the following simple C code:
int mulby105( int num )
{
return num * 105;
}
Compile this using:
armcc -S mulby105.c
Now, examine the file "mulby105.s" which has been created:
; generated by Norcroft ARM C vsn 4.41 [Aug 26 1992] AREA |C$$code|, CODE, READONLY
|x$codeseg|
EXPORT |
mulby105 |
mulby105 |
|
RSB |
a1,a1,a1,LSL #4 |
RSB |
a1,a1,a1,LSL #3 |
MOV |
pc,lr |
AREA |C$$data|,DATA |x$dataseg|
END
Notice that the compiler has found the short multiply-by-constant sequence.
The C-compiler restricts the amount of searching it performs in order to minimise the impact on compilation time. The current version of armcc has a cut-off so that it uses a normal MUL if the number of instructions used in the multiply-by-constant sequence exceeds some number N. This is to avoid the sequence becoming too long.
9 Discussion of speed improvement
To evaluate the speed gains from using multiply-by-constant, consider multiplying by 11585 as an example.
A normal multiply consists of:
;first load the constant 11585 (0x2D41 in hex) into Rs
;these 2 instructions take 1S-cycle each
MOV |
Rs, #0x2D |
<< 8 |
||
ORR |
Rs, |
Rs, |
#0x41 |
|
; now do the multiply, |
which takes 1S+8I-cycles for Rs=11585 |
|||
MUL |
Rd, |
Rm, |
Rs |
; do the multiply |
6 |
Application Note 19 |
|
|
|
ARM DAI 0019D |
|
|
|
|
|
|
|
|
|
|
ARM6 in DSP Applications : Use of the MUL Instruction
The load-a-constant stage may take up to four 4 instructions (in this case 2) or an LDR, and the multiply takes 1 instruction fetch plus a number of internal cycles to calculate the multiply.
The optimal multiply-by-constant sequence consists of:
11585 == ((Rm*3)*15)*256 + Rm + Rm*64
; this sequence takes 4S-cycles
ADD Rd, Rm, Rm, LSL #1 ; Rd = Rm*3
RSB Rd, Rd, Rd, LSL #4 ; Rd = Rd*15 = Rm*45
ADD Rd, Rm, Rd, LSL #8 ; Rd = Rm + Rd*256 = Rm*11521 ADD Rd, Rd, Rm, LSL #6 ; Rd = Rd + Rm*64 = Rm*11585
This is just 4 data processing instructions.
Method |
Cycles |
|
|
Normal multiply: |
3 instructions + MUL internal cycles |
|
|
Multiply-by-constant: |
4 instructions |
|
|
Table 9-2: Comparison between MUL and multiply-by-constan
With slow memory systems and non-cached processors, I-cycles can be much faster than other cycles because they are internal to the ARM core. This means that the general multiply can sometimes be the fastest option (for large constants where an efficient solution cannot be found). It should also use less memory. If the load-a- constant stage could be moved outside a loop, the balance would tip further in favour of the general multiply as there is only the MUL to execute.
Method |
Cycles on ARM60 |
Cycles on ARM610 |
|
|
|
Normal multiply: |
3S + 8I |
11F |
|
|
|
Multiply-by-constant: |
4S |
4F |
|
|
|
Table 9-3: MUL Cycle Timing for ARM6 Processors
10 Conclusions
This application note explains how the speed of the multiply depends on the value in Rs. It is important to ensure that the smallest positive number is used for Rs so that MUL is fastest.
If MUL is used to generate a single result of at most 32-bits, it is usually possible to ensure that Rs contains at most 16 bits so that the MUL takes at most 1S + 9I-cycles, providing that the operands are unsigned.
|
|
Application Note 19 |
7 |
|
|
ARM DAI 0019D |
|
|
|
|
|
ARM6 in DSP Applications : Use of the MUL Instruction
MUL can be used with negative operands; the result is the bottom 32-bits of the 64-bit product as expected. If Rs is negative, the MUL will take the full 1S + 16I-cycles as early termination does not occur. It is possible to ensure that Rs is always positive by putting the negative operand in Rm, or by negating Rs to make it positive.
When multiplying by a constant, a technique can be used to decompose the multiply into a sequence of arithmetic instruction which achieves the same as a MUL. The performance of this method depends on the precise value of the constant (smaller values are faster). The sequence of arithmetic operations is likely to be faster than the equivalent MUL, but it depends on the memory system and variant of ARM processor used.
This document can offer no universal solution to speed up multiplication. It has been shown that there are several techniques which can be applied, but the speed gains are always depend on the precise values involved (which are application-specific). Even then, it is not always clear which method will be faster, as the actual timings depend on whether the processor is cached or uncached, and on the performance of the memory system.
8 |
Application Note 19 |
|
|
|
ARM DAI 0019D |
|
|
|
|
|
|
|
|
|
|