Given a fixed point number, usually there is no way to identify what Q format it is in, unless you have some further information. For example, if you know the number is to represent the rational number 0.5, then you could check to see if x/0.5 is a power of two. For example let’s say the (integer) number x is 2048. Then x/0.5 is equal to 4096. The log2(4096) = log10(4096)/log10(2) = 12, so the number would be Q.12.

If you are looking at a standard DSP library, for example the CMSIS library for ARM Cortex M CPU’s, you will often see types defined in the code (such as q15_t) which can give you a clue.

If you have written the code yourself, than you are the one deciding what the Q format is, and you need to figure out how the Q format changes after addition, multiplication, etc.

Does this answer your question? If not, can you be more specific about what you are asking?

Cheers,

Shawn

i am new to DSP, can you please let me know given a fixed point number (hex) how to identify it is in which Q format?

thanks in advance.

]]>Thank you for the quick reply, here I kept the 25 bits and truncated 15 LSB. It is working.

Cheers,

Ravi.

The number of bits to keep depends upon your particular application, and how much precision is needed in subsequent operations. You need to analyze your algorithm to see how small the numbers need to be. Maybe it makes sense to keep 15 or 25 bits? I hope that helps.

-Shawn

]]> Thanks for understanding me the overflow in different arithmetic operations. If I wanted to multiply Q.25 and Q.15 then my result will be Q2.40. Please explain me how many maximum LSB bits I am allowed to truncate ?.

As I understand I can keep the full bit precision. But that is not possible because of my minimal resources.

Cheers,

Ravi.

Thanks for your quick reply.

I understand I need to be careful when multiplying or adding two different Q format nos.

When the input is high we can discard the fractional bits. But as you mentioned small inputs need to be scaled to a larger value using exponent.

Thank You once again for explaining fixed point arithmetic in such a

simple manner. This concept is really helping me in my project.

Regards,

Rohit

I’m somewhat familiar with scaling problems with fixed point FFTs, but have not implemented an FFT algorithm myself. Take a look at this link on “stage scaling”:

https://www.dsprelated.com/showthread/comp.dsp/23865-1.php

The solution is that the output of each stage of the FFT must be scaled down by some factor. The downside is that small inputs can become overly tiny. To overcome that problem, the input can be scaled to a larger value, and the amount of scaling (a shift value for example) can be saved and applied later. Kind of like having an exponent. Good luck.

Cheers,

Shawn

Thank You for your explanations to all my questions.

Now I am able to understand the technique to overcome the overflow problems.

I am working on a DSP related project in an FPGA, where I need to use max. 32-bits register due to resource constraint. We have designed a 1024-point FFT module for spectrum analysis and are facing this overflow problem as we need to truncate results to 32-bits.

The results after many multiplication and addition goes way beyond the range and so we need to introduce some scaling factor.

Can you provide some information/links on how to do scaling as per results obtained?

Sorry If I am asking you too many questions. Really appreciate your effort in spending your precious time in responding to all my queries.

Regards,

Rohit

There are a number of things that can be done to deal with overflow. One is to use built in saturation of the hardware. Another is to explicitly check the size of the number in software. If you multiply two Q.15 numbers and get a Q1.30 result, there is no overflow. But if you want to convert the result to Q.31 or Q.15 for example, there is a potential problem. Let’s say you have multiplied Q.15 values of -1 * -1 to get 0x40000000. That result fits in a 32 bit register and is the expected positive value to represent +1 in Q1.30. Now, if you wanted to convert that number to Q.31, you would need to shift left by 1, at which point +1 would become -1 (that is 0x80000000). If your ALU saturates to 0x7fffffff automatically, then the sign of the result will be maintained correctly. Otherwise, you could consider checking for the particular value of 0x40000000 before shifting left by one, and treat that as a special case.

if (value == 0x40000000L)

{

output = 0x7fffffffL;

}

else

{

output = (int32_t)value << 1;

}

Now if your registers are more than 32 bits, then you will have more room for representing larger numbers. For example, it is common to have processors with 40 bit "accumulator" registers. The best way to deal with overflow is to avoid saturating the result until it is necessary, since the result (for example an accumulated sum) can get smaller as the calculation proceeds.

I hope that clears things up a little bit.

Cheers,

Shawn

Thanks for your quick explanation.

I also read your next blog on “Overflow Handling in Fixed Point Computations”.

I want to understand how to convert Q.15 * Q.15 = Q1.30 result to overcome the overflow condition.

The special case where we multiply -1 * -1 = -1 (ans in Q.30).

But as you suggested we would get +1 (in Q1.30).

Can you give an example on how to get Q.15 * Q.15 = Q1.30 (or any other example like add Q.15 + Q.15) result to overcome the overflow problem?

Regards,

Rohit