## Rounding in Fixed Point Number Conversions

When converting from one fixed point representation to another, there is often a right shift operation to eliminate bits. (Or higher order bits are just stored without keeping the lower order bits.) This occurs when converting from a Q31 to a Q15 format number for example, since 16 bits need to be eliminated. Before throwing away the unused bits, sometimes it is desirable to perform a rounding operation first. This can improve the accuracy of results, and can prevent the introduction of a bias during conversion of a signal. Rounding is also an important operation when generating fixed point filter coefficients from floating point values, but that is not the subject of this post.

To illustrate rounding, I will use an example where six different signed Q7.8 numbers are converted to a signed Q15.0 number (a regular 16 bit integer). I will illustrate truncation (throwing away the least significant eight bits) and rounding. Recall that a Q7.8 number has seven integer bits and eight fractional bits. For the example, the six numbers will be 1.25, 1.5, 1.75, -1.25, -1.5 and -1.75.

The first thing to determine is how these numbers will be represented in a 16 bit integer register. Multiplying each by 256 (which is two to the power eight) gives the following result (in hexadecimal):

1.25 = 0×0140

1.5 = 0×0180

1.75 = 0x01C0

-1.25 = 0xFEC0

-1.5 = 0xFE80

-1.75 = 0xFE40

Now if the numbers are truncated, the result is found by shifting right by eight. Here are the results:

truncate(1.25) = 0×0001 = 1

truncate(1.5) = 0×0001 = 1

truncate(1.75) = 0×0001 = 1

truncate(-1.25) = 0xFFFE = -2

truncate(-1.5) = 0xFFFE = -2

truncate(-1.75) = 0xFFFE = -2

For the positive numbers, the result of truncation is that the fractional part is discarded. The negative number results are more interesting. The result is that the fractional part is lost, and the integer part has been reduced by one. If a series of these numbers had a mean of zero before truncation, then the series would have a mean of less than zero after truncation. Rounding is used to avoid this problem of introduced bias and to make results more accurate.

Truncation is not really the correct term for the example above. More accurately, a “floor” operation is being executed. A floor operation returns the greatest integer that is not greater than the operand.

In a common method of rounding, a binary one is added to the most significant bit of the bits that are to be thrown away. And then a truncation is performed. In the current example, we would add 0.5, represented as 128 decimal or 0×0080 in our 16 bit integer word. So the results in our example are as follows:

round(1.25) = (0×0140 + 0×80) >> 8 = 0×0001 = 1

round(1.5) = (0×0180 + 0×80) >> 8 = 0×0002 = 2

round(1.75) = (0x01C0 + 0×80) >> 8 = 0×0002 = 2

round(-1.25) = (0xFEC0 + 0×80) >> 8 = 0xFFFF = -1

round(-1.5) = (0xFE80+ 0×80) >> 8 = 0xFFFF = -1

round(-1.75) = (0xFE40 + 0×80) >> 8 = 0xFFFE = -2

These results are less problematic than using simple truncation, but there is still a bias due to the non-symmetry of the 1.5 and -1.5 cases. The amount of bias depends on the data set. Even if a set of data to be converted contained only positive values, there is still a bias introduced, because all of the values that end in exactly .5 are rounded to the next highest integer. One way to eliminate this bias is to round even and odd values differently (even and odd to the left of the rounding bit position).

For the more common conversion of Q31 to Q15 numbers, the rounding constant is one shifted left by fifteen, or 32768 decimal, or 0×8000 hexadecimal.

Some of the Texas Instrument DSPs have rounding instructions that can be performed on the accumulator register prior to saving a result to memory. For example, the TMS320C55x processor includes the ROUND instruction (full name is “round accumulator content”). The instruction has two different modes. The “biased” mode adds 0×8000 to the 40 bit accumulator register. The “unbiased” mode conditionally adds 0×8000 based on the value of the least significant 17 bits. It is designed to address the bias problems I described above. Wikipedia has a good discussion of rounding and bias errors (http://en.wikipedia.org/wiki/Rounding). The TMS320C55x is using the “round half to even” method of rounding for the unbiased mode, and “round half up” for the biased mode.

Although it seems simple on the surface, rounding in fixed point conversions has some important effects on the bias of resulting computations.

**Explore posts in the same categories:**dsp, fixed point

**Tags:** dsp, fixed point, Q15, Q31, rounding

January 18, 2011 at 7:50 pm

Excellent article!

I am using the TMS320VC5506 and have already written several assembly routines to make use of the ROUND instruction, as well as the automatic rounding modes for several other instructions. My only problem was that Texas Instruments’ description of the rounding was a little difficult to understand completely. Your article made things much clearer. It’s too bad TI doesn’t have someone with your writing skills on staff…

January 18, 2011 at 10:24 pm

Thanks for the compliment. I often find that the TI manuals explain what the instruction does, but don’t supply much detail of how or why to use the instruction.

September 6, 2011 at 10:33 pm

Iam using ARM processor.

Actually i am converting Q15 data to Q31 format as follows

Q31 value = (Q15 data)<<16;

for less amplitude values this is working fine (ie.the data receiving from ADC is an q15 format ADC supports 0-3.3v)

If my input is high then my q31 value getting saturated ie Q31 value exceeding the limit (ie 2^31)

Kindly suggest me how i can solve this

September 7, 2011 at 9:16 pm

Hi Venkatesh,

If your Q15 data is really only 16 bits, then shifting left by 16 bits shouldn’t be causing an overflow. I would suggest that you check the range of your Q15 data and verify that they fall in the range 0×8000 (most negative) to 0x7fff (most positive). Is the data stored in a 32 bit register or variable? Are you coding in C or assembly? It’s hard for me to suggest much more without knowing exactly what your doing.

May 3, 2012 at 10:36 pm

Hi , If i’m trying to truncate a 32 bit integer like this (num>>16)<<16 So was getting wrong results while accumulating. Guess i have to take care of rounding.

May 6, 2012 at 9:18 pm

Hi Vikas. If you are accumulating, and don’t have any “guard” bits on your processor, then yes you may need to truncate your numbers when accumulating (assuming your inputs can be large, or if you are accumulating a lot of values). Guard bits are extra bits in hardware, common on processors used for DSP, that allow results to overflow, and be scaled down later if needed. Another option is to use a larger word size (64 bits for example) but that has consequences for code speed.

A common method I’ve seen for accumulating numbers is to shift right by a small amount (with or without rounding, depending on need) before each add and store the results in an array. And then do the same thing with the new array of numbers (perhaps with a different shift). For example, say we are adding up 64 numbers. This could be done by accumulating 8 sets of 8 numbers, shifting right by 3 each time. Then, these 8 numbers would be added together, shifting right by 3 each time before adding. The result is the average of the 64 numbers (without overflowing, even for maximum input size). The shift right by 3 will cause a loss in precision, but this is often a better choice than going to a larger word size for the additions (depends on the requirements for the application).