Introduction to Fixed Point Representation

Fixed point representation is a method of storing numbers in binary format. It is widely used in DSP products for telecommunications. One reason to use to use fixed point format (rather than floating point) is for cost savings in the digital signal processing chips used for implementing a system. Another reason is to have greater precision than floating point for a given number of bits per number represented.

So what is fixed point? Fixed point refers to a method of representing numbers with a fractional part on an ALU that only handles integer operations. Companies will often market their DSP processors as either “fixed point” or “floating point” models. Fractional digits of a number are handled with integers by using an assumed scaling factor, such as {2}^{15} . This is commonly called Q15 notation.

To understand how Q15 numbers are handled, it helps to understand the twos complement representation of integers. I won’t explain it here, but there are plenty of explanations on the web (on Wikipedia for example). Now assume we have a 16 bit twos complement integer a as follows, where s is a sign bit and {a}_{n} is the digit at bit n:

a = <s {a}_{14} {a}_{13} ... {a}_{1} {a}_{0}>

For a positive integer a, the value of the number a is:

a = {a}_{14}{2}^{14} + {a}_{13}{2}^{13} + ... + {a}_{1}{2}^{1} + {a}_{0}{2}^{0}

For a Q15 number, the weightings of each bit change. Let b = a/{2}^{15}. Then the value of a positive Q15 number b is:

b = {a}_{14}{2}^{-1} + {a}_{13}{2}^{-2} + ... + {a}_{1}{2}^{-14} + {a}_{0}{2}^{-15}

The range of numbers that can be handled is from -1 to 0.999969482421875, corresponding to the integers -32768 and 32767, which are the smallest and largest integers that can be represented in 16 bit twos complement.

Negative numbers are found by taking the twos complement of a given positive value, in the same manner as for integers. One way to do this is to flip all the bits (change all the ones to zeros, change all the zeros to ones) and then add one. For example, assume we have the Q15 number 0.5, which is 0×4000 in hexadecimal. Flipping the bits gives 0xBFFF, and adding one produces 0xC000. So a Q15 value of -0.5 is stored as 0xC000. One case that does not work is if one tries to negate a Q15 value of -1, since +1 is not possible in Q15. If you try flipping the bits and adding one, you will end up with a result of -1 instead of +1. This is an overflow case, and some DSP chips will produce the largest positive Q15 number (32767/32768) if -1 is negated.

Q15 isn’t the only possible representation of course. It is possible to use different scaling factors and different bit widths. For example, Q31 is common for 32 bit number widths. When using unsigned operations, it is possible to store a Q16 number in a 16 bit word. In this case we have:

a = {a}_{15}{2}^{-1} + {a}_{14}{2}^{-2} + ... + {a}_{1}{2}^{-15} + {a}_{0}{2}^{-16}

It is also possible to mix integer and fractional parts in one word. For example, we could have a number with 3 integer bits and 12 fractional bits stored in a 16 bit word. Lets call this format Q3.12. The range of numbers in Q3.12 using 16 bit twos complement is -32768/{2}^{12} to 32767/{2}^{12}, or -8 to 7.999755859375.

And finally, there is no rule that says a Q15 number can’t be stored in a 32 bit register. In this case, there would be 17 bits used for the sign part of the number (for twos complement).

So in conclusion, there are a large variety of number formats that can be handled when using a processor or hardware with integer-based arithmetic. Fixed point representation relies on an assumed scaling factor and a given register width. These two variables determine where the decimal point is. Fixed point arithmetic was taught to me by my first boss during the first few days at my first full time job. As my boss used to say “real men know where to find the decimal point.” The difficulty in finding the decimal point will become more clear when I cover fixed point multiplication.

6 Comments on “Introduction to Fixed Point Representation”

  1. [...] libraries, and it is not too difficult to write your own, although some Assembly maybe required. Introduction to Fixed Point Representation Fixed-Point DSP and Algorithm Implementation BigDog The Edge… there is no honest way to [...]

  2. Pure1 Says:

    Can you please explain to me that part where you say about putting 15 numbers into the 32 bits, frankly I am new to DSP so I just want to know. Thank You

    • Shawn Says:

      What I was trying to get across is that a Q15 number could be stored in any register (memory location, etc.) that is at least 16 bits. It could be stored in one that is more than 16 bits, such as one of the registers of a 32 bit processor. In this case, the upper 16 bits would be used only for the sign (so 16 zero bits for positive numbers or 16 one bits for negative numbers), and the lower 16 bits would have one sign bit and the 15 fractional bits.

      You could also store 16 bit numbers in the highest 16 bits of a 32 bit register (and just leave the lowest 16 as zeroes). But in this case, I think it is easier to think of the resulting numbers as Q31 numbers, than as Q15 numbers stored in the upper part of a 32 bit register.

  3. Christian Says:

    I guess I found a way to calculate the decimal representation of a Qm.n number out of its binary representation.
    For everyone interested I want to share the way, I hope this is alright for you? If not, just delete this post, please.
    Given you have the binary representation of a Qm.n number ( m >= 1, n>= 0 (?) ), take at first a look at the leftmost bit ( assuming that the most significant bit is on the left of the binary number representation (?)). If it is zero, just take the number as the representation of an unsigned integer and calculate its decimal value ( normal conversion like many calculator-software offer, be careful to treat all zeros as zero and not as one, so start with zero and not with one if you are counting: eg. for 2 bits unsigned integer 00 -> 0, 01 -> 1, 10 -> 2, 11 -> 3 ). Afterwards multiply the result with 2^(-1 * n ) and you are done.
    However if the leftmost bit is one, do the following:
    Substract (!) 1 ( = binary ! ) from the binary representation of the number. E.g. if you have 110 substracting one yields 101 or as an 111 substracting one yields 110.
    Next toggle / invert (?) all the bits of the result, in the example 101 changes to 010 or 110 changes to 001.
    Now we have to check for a very special case: is the new binary number we now obtained exactly the same like the one we started with in the very beginning?
    If this is the case ( e.g. it is for 100 – 1 = 011 , inverting / toggling yields 100 which is what we started with. ), no positive number in the range of our Qm.n format exists with the same absolute value as that of our number.
    This is the case for the mathematical smallest number we can represent in our Qm.n format. So in this case we have to determine the smallest number of our Qm.n format and then we are done, it is the result of our conversion. Anyway it is always good to know the smallest number avaible in the used format.
    You can calculate / determine it just by using m of Qm.n, given that m is larger or equal than one ( assumption from the very beginning ). It is -1*( 2^( m-1 ) ) ( taken from , equation 6 ).
    ( Note: m needs to be bigger or equal to one in Qm.n format because at least the sign bit ( = leftmost bit ) has to be part of m and not of n, as far as I understand. )
    In all other cases, this is if the binary number you obtained differs from the original one, take the obtained number, treat it as an unsigned integer and convert it to its decimal representation ( as explained in the case when the very leftmost bit is 0) . Afterwards multiply the result by 2^(-1 * n ) and you are done.

    All this assumes 2s complement format ( and thereby also signed numbers ) !

    For understanding this I used after reading this page mainly and for the limits ( range ) of a Qm.n number .

    I’m open for better solutions, but this is at least enough to solve my task. Again thank you for the report, which helped me getting started with my current approach.

    BTW.: The other way round can be found on wikipedia: under Float to Q. As a third step in addition to the two explained there you need to find the 2′s complement binary representation of the nearest integer ( cut off the sign and remember it, find the unsigned integer binary representation, and fill it with zeros on the left until it has the total length of m + n and if the sign was negative invert it and add a binary 1, see or wikipedia’s 2′s complement text.


    • Shawn Says:

      Hi Christian. Thank you for your comments. I have a couple of things to add. First, when negating a 2′s complement binary number, an easy way to do this is by first inverting all the bits, and then adding one. Subtracting one and then inverting the bits also works, but is slightly harder. Second, the special case where negating the largest negative integer gives the same number is a common issue in using 2′s complement. Some processors have hardware that will output the largest positive number when negating the largest possible negative number.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

%d bloggers like this: