## Fixed Point Multiplication

Fixed point addition and subtraction are straightforward. Additions and subtractions are performed using integer operations. For example, if two 16 bit Q15 format numbers are added, the result is a Q15 number. But what about fixed about multiplication? What happens if two Q15 numbers are multiplied?

Let’s try an example. Take 0.5 multiplied by 0.25. In Q15 the number 0.5 is represented (in hexadecimal) as 0x8000 times 0.5 or 0x4000. Similarly, 0.25 is 0x2000. When we multiply these together, the product is 0x08000000. Obviously the result is not a Q15 number since the number of bits required is more than 16. The expected product, 0.125, is 0x1000 in Q15.

To see what is going on, define the following two Q15 numbers a and b:

where and are the integer representations of our numbers (0x4000 and 0x2000 in our example). The product of a and b is:

From the above, it can be seen that the product is a Q30 number. Going back to our example, 0x4000 times 0x2000 is 0x08000000, which is 0.125 times .

A general rule when multiplying a Qm format number by a Qn format number, is that the product will be a Q(m+n) number. The number of bits required to represent the product is at least (n+m) for unsigned multiplication and (n+m+1) for signed (twos complement) multiplication.

For the more general case of a Qa.b number times a Qc.d number, the product is Q(a+c).(b+d). The number of bits needed for the result is (a + b + c + d + 1) for signed numbers (and one less for unsigned numbers).

Consider the example of a Q16 unsigned multiplication between the two largest unsigned numbers that can be represented. The largest Q16 number is 65535/65536 = 0.9999847412109375. The product is 0xffff times 0xffff or 0xfffe0001. The result is a Q32 number requiring at least 32 bits. If we divide by then we get 0.99996948265470564365386962890625, the expected result.

There are a number of things that are done with the product of a multiplication, depending on the application. Some of the commonly seen options are:

- Convert the product to a different Q format.
- Use the product in the resulting Q format.
- Add the product to a running sum in an accumulator register.
- Convert the product to a different Q format, then add to a running sum.

Let’s look at some of these options for the case of signed multiplication using Q15 format numbers. For case 1, assume we want to multiply two Q15 numbers and get a Q15 result. The required operation is to take the Q30 product, and shift it right by 15 bits. The result can then be stored in 16 bits. There is also the option of rounding the product before shifting out the lower 15 bits (I may discuss rounding in a future post). Some CPU architectures are better set up to shift the product left by 1, and then store the upper 16 bits. This is almost exactly the same as shifting right by 15 bits and keeping the lower 16 bits.

Multiply-accumulate (MAC) operations are used a lot in many DSP algorithms. Many processors have one or more dedicated accumulator registers for this purpose (often with 32 or 40 bits). For the case of Q15 multiplies, each Q30 product can be summed to the accumulator.

I have seen a lot of code that shifts each product left by 1 when performing the MAC operations. Some DSP chips can do the left shift in hardware using a special mode of the ALU. In this case, the value in the accumulator is in Q31 format. Although very common, this method has a greater chance of overflow problems since each product is effectively two times bigger. I think this method became popular because certain older DSP chip architectures required the storing of the high 16 bits of the accumulator or product register, rather than having a single cycle instruction allowing a shift by 15 bits.

In summary, because multiplication operations are often a chief component of signal processing implementations, it is important to understand how they work. This is especially true for fixed point operations, where one must know the effect of multiplication on the format of the numbers themselves.

August 26, 2014 at 3:19 am

i am using MAC instruction in DSP and it is difficult to understand may you please explain below

w4 = 0x0055

w5 = 0xDA51

AccumulatorA = 0x00000000

after

MAC w4,w5,a

AccumulatorA = 0xFFFFFFE6F9CA

please explain with details.

August 27, 2014 at 10:15 am

Hi Naresh. Using 32 bits, 0x00000055 times 0xFFFFDA51 is equal to 0xFFF37CE5. If you multiply this result by 2, you will get 0xFFE6F9CA. So your DSP is shifting left by one after the multiplication. If your inputs are Q.15, then the result is Q.31. What part do you not understand?

December 2, 2014 at 5:42 pm

Hi Shawn,

quick question:

two signed Q0.15 multiplication.

#1: 0xFF00 = sign bit = negative & fraction part = (0x7f00/32768) = -0.9921875

#2:0xFE00 = sign bit = negative & fraction part = (0x7e00/32768) = -0.984375

0xFF00 * 0xFE00 = 0xFD020000 (Q30)

conversion to Q15 ==> 0xFA04

number = negative (0x7A04)/32768 = 0.953247.

Two questions:

1. why is my result negative?

2. answer seems to be wrong. correct answer being 0.9766845

December 3, 2014 at 10:20 am

Hi Jatin,

You have a misunderstanding about how two’s complement arithmetic works. Try searching online and reading up on it. It looks like you are trying to do calculations with the “sign magnitude” format.

And your multiplication result is wrong as well. I get:

0xFF00 * 0xFE00 = 0x20000

You can get that result by doing 32 bit multiplications. In the Microsoft Windows Calculator for example, you can set the word size to “Dword” and do:

0xFFFFFF00 * 0xFFFFFE00

Or you can convert both the numbers to positive numbers first:

0x100 * 0x200 = 0x20000

Which I get by doing:

0x10000 – 0xFF00 = 0x100

and

0x10000 – 0xFE00 = 0x200

Thanks for visiting the site!

Cheers,

Shawn

December 3, 2014 at 11:38 pm

Thanks Shawn!

I totally missed that my numbers are 16-bit. I did a 32-bit multiplication without sign-extension.

However, result = 0x20000 (Q30) ==> 0x0004 (Q15) = 0.00012207 which is not what I expected.

Did I convert my operands correctly below?

#1: 0xFF00 = sign bit = negative & fraction part = (0x7f00/32768) = -0.9921875

#2:0xFE00 = sign bit = negative & fraction part = (0x7e00/32768) = -0.984375

Jatin

December 4, 2014 at 10:02 am

You did not convert the operands correctly. The 16 bit integer 0xFF00 is -256, and 0xFE00 is -512, leading to -0.0078125 * -0.015625 = 0.0001220703125. Check out this page for how to use two’s complement:

http://en.wikipedia.org/wiki/Two%27s_complement

December 18, 2014 at 9:16 am

Excellent information Shawn.

Regards

Fakhre Alam

January 5, 2017 at 11:02 pm

Hi Shawn,

Thank You for your explanation on fixed point multiplication. This is really helpful ! Its easy to understand.

But I have a doubt. When we multiply two Q.15(15 fractional bits + 1 signed bits = 16-bits) inputs, the answer will be Q.30 or Q.31?

For ex: if we have two Q.15 inputs as

0xFF00 (-0.0078125)

0xFE00 (-0.015625)

-0.0078125 * -0.015625 = 0.0001220703125.

Now if I convert 0.0001220703125 to Q.31(31 fractional bits + 1 sign bit = 32-bits) format I would get,

0.0001220703125 * 2^31 = 262144 = 0x40000.

Instead if I use Q.15 * Q.15 = Q.30 (30 fractional bits + 2 signed bits = 32-bits) then I get

0.0001220703125 * 2^30 = 131072 = 0x20000.

It is the correct answer per your above post.

So is the logic for Q.15 * Q.15 = Q.30 (all signed nos) correct?

Here we have 2 signed bits instead of 1.

Correct me if I am wrong.

Regards,

Rohit

January 7, 2017 at 11:58 am

Hi Rohit,

Thank you for your interest in my tutorials. When two Q.15 values are multiplied, the result will be Q.30. So if you were using a 32 bit integer for the result, there would in fact be two bit positions for the sign. I have seen hardware that automatically shifts the multiplication product left by one bit (sometimes this is a configuration option), in which case the result would be Q.31. I prefer to keep things in Q.30 if I can, since this doesn’t waste the least significant bit in the integer. Let me know if you need further clarification.

Thanks,

Shawn

January 9, 2017 at 12:07 am

Hi Shawn,

Thanks for your quick explanation.

I also read your next blog on “Overflow Handling in Fixed Point Computations”.

I want to understand how to convert Q.15 * Q.15 = Q1.30 result to overcome the overflow condition.

The special case where we multiply -1 * -1 = -1 (ans in Q.30).

But as you suggested we would get +1 (in Q1.30).

Can you give an example on how to get Q.15 * Q.15 = Q1.30 (or any other example like add Q.15 + Q.15) result to overcome the overflow problem?

Regards,

Rohit

January 9, 2017 at 10:27 am

Hi Rohit,

There are a number of things that can be done to deal with overflow. One is to use built in saturation of the hardware. Another is to explicitly check the size of the number in software. If you multiply two Q.15 numbers and get a Q1.30 result, there is no overflow. But if you want to convert the result to Q.31 or Q.15 for example, there is a potential problem. Let’s say you have multiplied Q.15 values of -1 * -1 to get 0x40000000. That result fits in a 32 bit register and is the expected positive value to represent +1 in Q1.30. Now, if you wanted to convert that number to Q.31, you would need to shift left by 1, at which point +1 would become -1 (that is 0x80000000). If your ALU saturates to 0x7fffffff automatically, then the sign of the result will be maintained correctly. Otherwise, you could consider checking for the particular value of 0x40000000 before shifting left by one, and treat that as a special case.

if (value == 0x40000000L)

{

output = 0x7fffffffL;

}

else

{

output = (int32_t)value << 1;

}

Now if your registers are more than 32 bits, then you will have more room for representing larger numbers. For example, it is common to have processors with 40 bit "accumulator" registers. The best way to deal with overflow is to avoid saturating the result until it is necessary, since the result (for example an accumulated sum) can get smaller as the calculation proceeds.

I hope that clears things up a little bit.

Cheers,

Shawn

January 10, 2017 at 2:14 am

Hi Shawn,

Thank You for your explanations to all my questions.

Now I am able to understand the technique to overcome the overflow problems.

I am working on a DSP related project in an FPGA, where I need to use max. 32-bits register due to resource constraint. We have designed a 1024-point FFT module for spectrum analysis and are facing this overflow problem as we need to truncate results to 32-bits.

The results after many multiplication and addition goes way beyond the range and so we need to introduce some scaling factor.

Can you provide some information/links on how to do scaling as per results obtained?

Sorry If I am asking you too many questions. Really appreciate your effort in spending your precious time in responding to all my queries.

Regards,

Rohit

January 10, 2017 at 10:30 am

Hi Rohit,

I’m somewhat familiar with scaling problems with fixed point FFTs, but have not implemented an FFT algorithm myself. Take a look at this link on “stage scaling”:

https://www.dsprelated.com/showthread/comp.dsp/23865-1.php

The solution is that the output of each stage of the FFT must be scaled down by some factor. The downside is that small inputs can become overly tiny. To overcome that problem, the input can be scaled to a larger value, and the amount of scaling (a shift value for example) can be saved and applied later. Kind of like having an exponent. Good luck.

Cheers,

Shawn

January 15, 2017 at 11:33 pm

Hi Shawn,

Thanks for your quick reply.

I understand I need to be careful when multiplying or adding two different Q format nos.

When the input is high we can discard the fractional bits. But as you mentioned small inputs need to be scaled to a larger value using exponent.

Thank You once again for explaining fixed point arithmetic in such a

simple manner. This concept is really helping me in my project.

Regards,

Rohit

June 30, 2017 at 3:25 am

” […] For example, if two 16 bit Q15 format numbers are added, the result is a Q15 number. […] ”

How can this be correct? The result would be a Q16-number. Both addends need to be extended by their MSB to avoid information loss.

Example with Q3’s:

7 + 4 =

0111 + 0100 = 1011 = {-5}_10

Again:

0111 + 0100

00111 + 00100 = 01011 = {11}_10

June 30, 2017 at 10:41 pm

Hi Alex. Your example shows the effect of overflow, which is something you need to watch out any time you are doing addition or subtraction with integer operations. See my page on overflow (https://sestevenson.wordpress.com/overflow-handling-in-fixed-point-computations/). In your example, you are adding the Q.3 numbers 0.875 and 0.5, which sum to 1.375. (Note that 1.375 * 8 = 11, which is the result you show in your second example.)

So you need a Q1.3 number to represent the sum, which requires at least 5 bits for signed numbers, but could be done with 4 bits using unsigned integer math. The same is true with Q.15 numbers. If the sum is greater than 32767/32768 then you need more than 16 bits to store the result, if using signed integers. The result would not be Q.16, but Q1.15. The least significant bit still has the weight 2^(-15).

I hope that clarifies things a bit.

Cheers,

Shawn

July 3, 2017 at 6:21 am

Hi Shawn, my bad. I mixed up the decimal interpretation Q-number of the binary numbers with their integer representation.

Correction:

0.875 + 0.5 = 1.375

Example with Q3’s:

0111 + 0100 = 1011 = -0.625

Again with Q1.3:

00111 + 00100 = 01011 = +1.011 = 1.375

Still I find your the first text passage in this article very misleading, because you also have to take a possible overflow into account, no?

Also I want to say to say that I like your tutorial page in general and appreciate your efforts.

July 7, 2017 at 2:14 am

Hi Shawn,

Thanks for your explanation on fixed point multiplication.

I have a question :

Say I have two 16 bit number S.8.7 and S,9.6 formats. And I need to store the data in 16 bit itself (consider S.7.8). In this case how will the multiplication take place. the output format considers only 7bit for integer and has more bits for fraction unlike input formats.

July 7, 2017 at 4:15 pm

Hi Omkar,

Thanks for checking out my blog.

If you multiply S8.7 and S9.6 numbers, then the integer part alone will require 17 bits. So with a 16 bit integer to store the result, you would be short one bit, plus you would need more bits for the sign. If it is okay to lose the upper 10 integer bits, then you could multiply to get Q17.13 (stored in 32 bits), shift right by 5 bits, and then keep the lower 16 bits. Two things to keep in mind: #1 The right shift should maintain the sign of the number by shifting in ones if the number is negative. #2 If the upper 17 bits (of the 32 bit shifted number) are not all zeros or all ones, then overflow has occurred and the result will not fit in 16 bits.

I hope you understand what I mean. Let me know if you don’t.

Cheers,

Shawn