Fixed Point Extensions to the C Programming Language

Posted September 10, 2009 by Shawn
Categories: dsp, fixed point

Tags: , , ,

Recently I ran across an ISO specification for extensions to the C programming language to support fixed point types. The types are defined in a header file called stdfix.h. I have attached an early draft of the ISO spec (from 2006) here:


I don’t think the extensions simplify the use of fixed types very much. The programmer still needs to know how many bits are allocated to integer and fractional parts, and how the number and positions of bits may change (during multiplication for example). What the extensions do provide is a way to access the saturation and rounding modes of the processor without writing assembly code. With this level of access, it is possible to write much more efficient C code to handle these operations.

The advantages of C code over assembly are quicker coding and debugging, and more portable code (that is, code that can run on more than one type of processor). However, I noticed that details such as fixed point fractional points and handling of rounding are implementation dependent. So the portability may only be applicable for “similar” processors.

I have never coded anything using the stdfix.h definitions. As far as I can see, the GCC compiler and the Dinkumware libraries are the only tools using these extensions. I’m not sure if or when it will come into popular use, but it’s something to consider if one is coding fixed point math operations in C.

Rounding in Fixed Point Number Conversions

Posted August 19, 2009 by Shawn
Categories: dsp, fixed point

Tags: , , , ,

When converting from one fixed point representation to another, there is often a right shift operation to eliminate bits. (Or higher order bits are just stored without keeping the lower order bits.) This occurs when converting from a Q31 to a Q15 format number for example, since 16 bits need to be eliminated. Before throwing away the unused bits, sometimes it is desirable to perform a rounding operation first. This can improve the accuracy of results, and can prevent the introduction of a bias during conversion of a signal. Rounding is also an important operation when generating fixed point filter coefficients from floating point values, but that is not the subject of this post.

To illustrate rounding, I will use an example where six different signed Q7.8 numbers are converted to a signed Q15.0 number (a regular 16 bit integer). I will illustrate truncation (throwing away the least significant eight bits) and rounding. Recall that a Q7.8 number has seven integer bits and eight fractional bits. For the example, the six numbers will be 1.25, 1.5, 1.75, -1.25, -1.5 and -1.75.

The first thing to determine is how these numbers will be represented in a 16 bit integer register. Multiplying each by 256 (which is two to the power eight) gives the following result (in hexadecimal):

1.25 = 0x0140

1.5 = 0x0180

1.75 = 0x01C0

-1.25 = 0xFEC0

-1.5 = 0xFE80

-1.75 = 0xFE40

Now if the numbers are truncated, the result is found by shifting right by eight. Here are the results:

truncate(1.25) = 0x0001 = 1

truncate(1.5) = 0x0001 = 1

truncate(1.75) = 0x0001 = 1

truncate(-1.25) = 0xFFFE = -2

truncate(-1.5) = 0xFFFE = -2

truncate(-1.75) = 0xFFFE = -2

For the positive numbers, the result of truncation is that the fractional part is discarded. The negative number results are more interesting. The result is that the fractional part is lost, and the integer part has been reduced by one. If a series of these numbers had a mean of zero before truncation, then the series would have a mean of less than zero after truncation. Rounding is used to avoid this problem of introduced bias and to make results more accurate.

Truncation is not really the correct term for the example above. More accurately, a “floor” operation is being executed. A floor operation returns the greatest integer that is not greater than the operand.

In a common method of rounding, a binary one is added to the most significant bit of the bits that are to be thrown away. And then a truncation is performed. In the current example, we would add 0.5, represented as 128 decimal or 0x0080 in our 16 bit integer word. So the results in our example are as follows:

round(1.25) = (0x0140 + 0x80) >> 8 = 0x0001 = 1

round(1.5) = (0x0180 + 0x80) >> 8 = 0x0002 = 2

round(1.75) = (0x01C0 + 0x80) >> 8 = 0x0002 = 2

round(-1.25) = (0xFEC0 + 0x80) >> 8 = 0xFFFF = -1

round(-1.5) = (0xFE80+ 0x80) >> 8 = 0xFFFF = -1

round(-1.75) = (0xFE40 + 0x80) >> 8 = 0xFFFE = -2

These results are less problematic than using simple truncation, but there is still a bias due to the non-symmetry of the 1.5 and -1.5 cases. The amount of bias depends on the data set. Even if a set of data to be converted contained only positive values, there is still a bias introduced, because all of the values that end in exactly .5 are rounded to the next highest integer. One way to eliminate this bias is to round even and odd values differently (even and odd to the left of the rounding bit position).

For the more common conversion of Q31 to Q15 numbers, the rounding constant is one shifted left by fifteen, or 32768 decimal, or 0x8000 hexadecimal.

Some of the Texas Instrument DSPs have rounding instructions that can be performed on the accumulator register prior to saving a result to memory. For example, the TMS320C55x processor includes the ROUND instruction (full name is “round accumulator content”). The instruction has two different modes. The “biased” mode adds 0x8000 to the 40 bit accumulator register. The “unbiased” mode conditionally adds 0x8000 based on the value of the least significant 17 bits. It is designed to address the bias problems I described above. Wikipedia has a good discussion of rounding and bias errors ( The TMS320C55x is using the “round half to even” method of rounding for the unbiased mode, and “round half up” for the biased mode.

Although it seems simple on the surface, rounding in fixed point conversions has some important effects on the bias of resulting computations.

Overflow Handling in Fixed Point Computations

Posted August 10, 2009 by Shawn
Categories: fixed point

Tags: , , , ,

Overflow handling is an important consideration when implementing signal processing algorithms. If overflow is not controlled appropriately it can lead to problems such as detection errors, or poor quality audio output. Typical digital signal processing CPUs include hardware support for handling overflow. Some RISC processors may include these modes as well. (In fact I helped define and implement such modes for the 32 bit MIPS processor core used in many Broadcom products). These processors often have a “saturating” mode that sets an instruction result to a minimum or maximum value on an overflow condition. (The term “saturating” comes from analog electronics, in which an amplifier output will be limited, or clipped, between fixed values when a large input is applied.) Commonly the CPU will limit the result to a 32 bit twos complement integer (0x7FFFFFFF or 0x80000000). For unsigned operations, the result would be limited to 0xFFFFFFFF. There are a number of situations in which overflow can occur, and I will discuss some of them below.

Addition and Subtraction

Overflow with twos complement integers occurs when the result of an addition or subtraction is larger the largest integer that can be represented, or smaller than the smallest integer. In fixed point representation, the largest or smallest value depends on the format of the number. I will assume Q31 in a 32 bit register for any examples that follow. In this case, a CPU with saturation arithmetic would set the result to -1 or (just below) +1 on an overflow, corresponding to the integer values 0x80000000 and 0x7FFFFFFF.

Overflow in addition can only occur when the sign of the two numbers being added is the same. Overflow in subtraction can occur only when a negative number is subtracted from a positive number, or when a positive number is subtracted from a negative number.


There is one case where negation of a number causes an overflow condition. When the smallest negative number is negated, there is no way to represent the corresponding positive value in twos complement. For example, the value -1 in Q31 is 0x80000000. When this number is negated (flip the bits and add one) the result is again -1. If the saturation mode is set, then the CPU will set the result to 0x7FFFFFFF (just less than +1).

Arithmetic Shift

Overflow can occur when shifting a number left by 1 to n bits. In fixed point computations, left shifting is used to multiply a fixed point value by a power of two, or to change the format of a number (Q15 to Q31 for example). Again, many CPUs have saturation modes to set the output to the minimum or maximum 32 bit integer (depending on whether the original number was positive or negative). Furthermore, a common feature is an instruction that counts the number of leading ones or zeros in a number. This helps the programmer avoid overflow since the number of leading sign bits determines how large a shift can be done without causing overflow.

Overflow will not occur when right shifting a number.


Overflow doesn’t really occur during multiplication if the result register has enough bits (32 bits if two 16 bit numbers are multiplied). But it is partly a matter of interpretation. When multiplying a fixed point value of -1 by -1 (0x8000 by 0x8000 using Q15 numbers), the result is +1. If the result is interpreted as a Q1.30 number (one integer bit and 30 fractional bits) then there is no problem. If the result is to be a Q30 number (no integer bits) then an overflow condition has occurred. And if the number was to be converted to Q31 (by shifting the result left by 1) then an overflow would occur during the left shift. The overall affect would be that -1 times -1 equals -1.

I have used a CPU that handles this special case with saturation hardware. Some CPUs have a multiplication mode that shifts the product left by one bit after a multiply operation. The reason for doing so is to create a Q31 result when two Q15 numbers are multiplied. Then if a Q15 result is desired, it can be found by storing the upper 16 bits of the result register (if the register is only 32 bits). The saturating mode automatically sets the result to 0x7FFFFFFF when the number 0x8000 is multiplied by itself, and the “shift left by one” multiplication mode is enabled.

A very often used operation in DSP algorithms is the “multiply accumulate” or “MAC”, where a series of numbers is multiplied and added to a running sum. I would recommend not using the “left shift by one” mode if possible when doing MACs, since this only increases the chance for overflow. A better technique is to keep the result as Q1.30, and then handle overflow if converting the final result to Q31 or Q15 (or whatever). This is also a good technique to use on CPUs without saturation modes, since the number of overflow checks can be greatly reduced in some cases.


Overflow in division can occur when the result would have more bits than was calculated. For example, if the magnitude of the numerator is several times larger than that of the denominator, than the result must have enough bits to represent numbers larger than one. Overflow can be avoided by carefully considering the range of numbers being operated on, and calculating enough bits for the result. I have not seen a CPU that implements a saturation mode for division.

Division by 0 is undefined, and not really an overflow case.


Many CPUs include hardware supported handling of overflow using saturation modes. These modes are useful, but it is better to avoid overflow in the first place if possible. This can lead to more accurate results in computations. And when using a CPU without saturation arithmetic, it is best to design the arithmetic operations so that the number of overflow checks is minimized.

Blurry Equations

Posted August 4, 2009 by Shawn
Categories: Uncategorized

Tags: ,

I noticed that some of the equations look a bit blurry in my blog. If they look excessively blurry, try changing the web browser zoom value to default (ctrl+0 in Internet Explorer or Firefox). Also, the font for the C code examples may look small in some cases. If this is a problem, try clicking on the title of the post.

Fixed Point Division

Posted July 21, 2009 by Shawn
Categories: Uncategorized

I have created a new post to replace this one.  Click here to see it.

Fixed Point Multiplication

Posted July 14, 2009 by Shawn
Categories: fixed point

Tags: , , ,

Fixed point addition and subtraction are straightforward. Additions and subtractions are performed using integer operations. For example, if two 16 bit Q15 format numbers are added, the result is a Q15 number. But what about fixed about multiplication? What happens if two Q15 numbers are multiplied?

Let’s try an example. Take 0.5 multiplied by 0.25. In Q15 the number 0.5 is represented (in hexadecimal) as 0x8000 times 0.5 or 0x4000. Similarly, 0.25 is 0x2000. When we multiply these together, the product is 0x08000000. Obviously the result is not a Q15 number since the number of bits required is more than 16. The expected product, 0.125, is 0x1000 in Q15.

To see what is going on, define the following two Q15 numbers a and b:

a = \frac{\hat{a}}{{2}^{15}}

b = \frac{\hat{b}}{{2}^{15}}

where \hat{a} and \hat{b} are the integer representations of our numbers (0x4000 and 0x2000 in our example). The product of a and b is:

c = ab=\frac{\hat{a}\hat{b}}{{2}^{30}}

From the above, it can be seen that the product is a Q30 number. Going back to our example, 0x4000 times 0x2000 is 0x08000000, which is 0.125 times {2}^{30} .

A general rule when multiplying a Qm format number by a Qn format number, is that the product will be a Q(m+n) number. The number of bits required to represent the product is at least (n+m) for unsigned multiplication and (n+m+1) for signed (twos complement) multiplication.

For the more general case of a Qa.b number times a Qc.d number, the product is Q(a+c).(b+d).  The number of bits needed for the result is (a + b + c + d + 1) for signed numbers (and one less for unsigned numbers).

Consider the example of a Q16 unsigned multiplication between the two largest unsigned numbers that can be represented. The largest Q16 number is 65535/65536 = 0.9999847412109375. The product is 0xffff times 0xffff or 0xfffe0001. The result is a Q32 number requiring at least 32 bits. If we divide by {2}^{32} then we get 0.99996948265470564365386962890625, the expected result.

There are a number of things that are done with the product of a multiplication, depending on the application. Some of the commonly seen options are:

  1. Convert the product to a different Q format.

  2. Use the product in the resulting Q format.

  3. Add the product to a running sum in an accumulator register.

  4. Convert the product to a different Q format, then add to a running sum.

Let’s look at some of these options for the case of signed multiplication using Q15 format numbers. For case 1, assume we want to multiply two Q15 numbers and get a Q15 result. The required operation is to take the Q30 product, and shift it right by 15 bits. The result can then be stored in 16 bits. There is also the option of rounding the product before shifting out the lower 15 bits (I may discuss rounding in a future post). Some CPU architectures are better set up to shift the product left by 1, and then store the upper 16 bits. This is almost exactly the same as shifting right by 15 bits and keeping the lower 16 bits.

Multiply-accumulate (MAC) operations are used a lot in many DSP algorithms. Many processors have one or more dedicated accumulator registers for this purpose (often with 32 or 40 bits). For the case of Q15 multiplies, each Q30 product can be summed to the accumulator.

I have seen a lot of code that shifts each product left by 1 when performing the MAC operations. Some DSP chips can do the left shift in hardware using a special mode of the ALU. In this case, the value in the accumulator is in Q31 format. Although very common, this method has a greater chance of overflow problems since each product is effectively two times bigger. I think this method became popular because certain older DSP chip architectures required the storing of the high 16 bits of the accumulator or product register, rather than having a single cycle instruction allowing a shift by 15 bits.

In summary, because multiplication operations are often a chief component of signal processing implementations, it is important to understand how they work. This is especially true for fixed point operations, where one must know the effect of multiplication on the format of the numbers themselves.

Introduction to Fixed Point Representation

Posted July 10, 2009 by Shawn
Categories: fixed point

Fixed point representation is a method of storing numbers in binary format. It is widely used in DSP products for telecommunications. One reason to use to use fixed point format (rather than floating point) is for cost savings in the digital signal processing chips used for implementing a system. Another reason is to have greater precision than floating point for a given number of bits per number represented.

So what is fixed point? Fixed point refers to a method of representing numbers with a fractional part on an ALU that only handles integer operations. Companies will often market their DSP processors as either “fixed point” or “floating point” models. Fractional digits of a number are handled with integers by using an assumed scaling factor, such as {2}^{15} . This is commonly called Q15 notation.

To understand how Q15 numbers are handled, it helps to understand the twos complement representation of integers. I won’t explain it here, but there are plenty of explanations on the web (on Wikipedia for example). Now assume we have a 16 bit twos complement integer a as follows, where s is a sign bit and {a}_{n} is the digit at bit n:

a = <s {a}_{14} {a}_{13} ... {a}_{1} {a}_{0}>

For a positive integer a, the value of the number a is:

a = {a}_{14}{2}^{14} + {a}_{13}{2}^{13} + ... + {a}_{1}{2}^{1} + {a}_{0}{2}^{0}

For a Q15 number, the weightings of each bit change. Let b = a/{2}^{15}. Then the value of a positive Q15 number b is:

b = {a}_{14}{2}^{-1} + {a}_{13}{2}^{-2} + ... + {a}_{1}{2}^{-14} + {a}_{0}{2}^{-15}

The range of numbers that can be handled is from -1 to 0.999969482421875, corresponding to the integers -32768 and 32767, which are the smallest and largest integers that can be represented in 16 bit twos complement.

Negative numbers are found by taking the twos complement of a given positive value, in the same manner as for integers. One way to do this is to flip all the bits (change all the ones to zeros, change all the zeros to ones) and then add one. For example, assume we have the Q15 number 0.5, which is 0x4000 in hexadecimal. Flipping the bits gives 0xBFFF, and adding one produces 0xC000. So a Q15 value of -0.5 is stored as 0xC000. One case that does not work is if one tries to negate a Q15 value of -1, since +1 is not possible in Q15. If you try flipping the bits and adding one, you will end up with a result of -1 instead of +1. This is an overflow case, and some DSP chips will produce the largest positive Q15 number (32767/32768) if -1 is negated.

Q15 isn’t the only possible representation of course. It is possible to use different scaling factors and different bit widths. For example, Q31 is common for 32 bit number widths. When using unsigned operations, it is possible to store a Q16 number in a 16 bit word. In this case we have:

a = {a}_{15}{2}^{-1} + {a}_{14}{2}^{-2} + ... + {a}_{1}{2}^{-15} + {a}_{0}{2}^{-16}

It is also possible to mix integer and fractional parts in one word. For example, we could have a number with 3 integer bits and 12 fractional bits stored in a 16 bit word. Lets call this format Q3.12. The range of numbers in Q3.12 using 16 bit twos complement is -32768/{2}^{12} to 32767/{2}^{12}, or -8 to 7.999755859375.

And finally, there is no rule that says a Q15 number can’t be stored in a 32 bit register. In this case, there would be 17 bits used for the sign part of the number (for twos complement).

So in conclusion, there are a large variety of number formats that can be handled when using a processor or hardware with integer-based arithmetic. Fixed point representation relies on an assumed scaling factor and a given register width. These two variables determine where the decimal point is. Fixed point arithmetic was taught to me by my first boss during the first few days at my first full time job. As my boss used to say “real men know where to find the decimal point.” The difficulty in finding the decimal point will become more clear when I cover fixed point multiplication.