Introduction to Fixed Point Representation

Fixed point representation is a method of storing numbers in binary format. It is widely used in DSP products for telecommunications. One reason to use to use fixed point format (rather than floating point) is for cost savings in the digital signal processing chips used for implementing a system. Another reason is to have greater precision than floating point for a given number of bits per number represented.

So what is fixed point? Fixed point refers to a method of representing numbers with a fractional part on an ALU that only handles integer operations. Companies will often market their DSP processors as either “fixed point” or “floating point” models. Fractional digits of a number are handled with integers by using an assumed scaling factor, such as {2}^{15} . This is commonly called Q15 notation.

To understand how Q15 numbers are handled, it helps to understand the twos complement representation of integers. I won’t explain it here, but there are plenty of explanations on the web (on Wikipedia for example). Now assume we have a 16 bit twos complement integer a as follows, where s is a sign bit and {a}_{n} is the digit at bit n:

a = <s {a}_{14} {a}_{13} ... {a}_{1} {a}_{0}>

For a positive integer a, the value of the number a is:

a = {a}_{14}{2}^{14} + {a}_{13}{2}^{13} + ... + {a}_{1}{2}^{1} + {a}_{0}{2}^{0}

For a Q15 number, the weightings of each bit change. Let b = a/{2}^{15}. Then the value of a positive Q15 number b is:

b = {a}_{14}{2}^{-1} + {a}_{13}{2}^{-2} + ... + {a}_{1}{2}^{-14} + {a}_{0}{2}^{-15}

The range of numbers that can be handled is from -1 to 0.999969482421875, corresponding to the integers -32768 and 32767, which are the smallest and largest integers that can be represented in 16 bit twos complement.

Negative numbers are found by taking the twos complement of a given positive value, in the same manner as for integers. One way to do this is to flip all the bits (change all the ones to zeros, change all the zeros to ones) and then add one. For example, assume we have the Q15 number 0.5, which is 0x4000 in hexadecimal. Flipping the bits gives 0xBFFF, and adding one produces 0xC000. So a Q15 value of -0.5 is stored as 0xC000. One case that does not work is if one tries to negate a Q15 value of -1, since +1 is not possible in Q15. If you try flipping the bits and adding one, you will end up with a result of -1 instead of +1. This is an overflow case, and some DSP chips will produce the largest positive Q15 number (32767/32768) if -1 is negated.

Q15 isn’t the only possible representation of course. It is possible to use different scaling factors and different bit widths. For example, Q31 is common for 32 bit number widths. When using unsigned operations, it is possible to store a Q16 number in a 16 bit word. In this case we have:

a = {a}_{15}{2}^{-1} + {a}_{14}{2}^{-2} + ... + {a}_{1}{2}^{-15} + {a}_{0}{2}^{-16}

It is also possible to mix integer and fractional parts in one word. For example, we could have a number with 3 integer bits and 12 fractional bits stored in a 16 bit word. Lets call this format Q3.12. The range of numbers in Q3.12 using 16 bit twos complement is -32768/{2}^{12} to 32767/{2}^{12}, or -8 to 7.999755859375.

And finally, there is no rule that says a Q15 number can’t be stored in a 32 bit register. In this case, there would be 17 bits used for the sign part of the number (for twos complement).

So in conclusion, there are a large variety of number formats that can be handled when using a processor or hardware with integer-based arithmetic. Fixed point representation relies on an assumed scaling factor and a given register width. These two variables determine where the decimal point is. Fixed point arithmetic was taught to me by my first boss during the first few days at my first full time job. As my boss used to say “real men know where to find the decimal point.” The difficulty in finding the decimal point will become more clear when I cover fixed point multiplication.


23 Comments on “Introduction to Fixed Point Representation”

  1. […] libraries, and it is not too difficult to write your own, although some Assembly maybe required. Introduction to Fixed Point Representation Fixed-Point DSP and Algorithm Implementation BigDog The Edge… there is no honest way to […]

  2. Pure1 Says:

    Can you please explain to me that part where you say about putting 15 numbers into the 32 bits, frankly I am new to DSP so I just want to know. Thank You

    • Shawn Says:

      What I was trying to get across is that a Q15 number could be stored in any register (memory location, etc.) that is at least 16 bits. It could be stored in one that is more than 16 bits, such as one of the registers of a 32 bit processor. In this case, the upper 16 bits would be used only for the sign (so 16 zero bits for positive numbers or 16 one bits for negative numbers), and the lower 16 bits would have one sign bit and the 15 fractional bits.

      You could also store 16 bit numbers in the highest 16 bits of a 32 bit register (and just leave the lowest 16 as zeroes). But in this case, I think it is easier to think of the resulting numbers as Q31 numbers, than as Q15 numbers stored in the upper part of a 32 bit register.

  3. Christian Says:

    I guess I found a way to calculate the decimal representation of a Qm.n number out of its binary representation.
    For everyone interested I want to share the way, I hope this is alright for you? If not, just delete this post, please.
    Given you have the binary representation of a Qm.n number ( m >= 1, n>= 0 (?) ), take at first a look at the leftmost bit ( assuming that the most significant bit is on the left of the binary number representation (?)). If it is zero, just take the number as the representation of an unsigned integer and calculate its decimal value ( normal conversion like many calculator-software offer, be careful to treat all zeros as zero and not as one, so start with zero and not with one if you are counting: eg. for 2 bits unsigned integer 00 -> 0, 01 -> 1, 10 -> 2, 11 -> 3 ). Afterwards multiply the result with 2^(-1 * n ) and you are done.
    However if the leftmost bit is one, do the following:
    Substract (!) 1 ( = binary ! ) from the binary representation of the number. E.g. if you have 110 substracting one yields 101 or as an 111 substracting one yields 110.
    Next toggle / invert (?) all the bits of the result, in the example 101 changes to 010 or 110 changes to 001.
    Now we have to check for a very special case: is the new binary number we now obtained exactly the same like the one we started with in the very beginning?
    If this is the case ( e.g. it is for 100 – 1 = 011 , inverting / toggling yields 100 which is what we started with. ), no positive number in the range of our Qm.n format exists with the same absolute value as that of our number.
    This is the case for the mathematical smallest number we can represent in our Qm.n format. So in this case we have to determine the smallest number of our Qm.n format and then we are done, it is the result of our conversion. Anyway it is always good to know the smallest number avaible in the used format.
    You can calculate / determine it just by using m of Qm.n, given that m is larger or equal than one ( assumption from the very beginning ). It is -1*( 2^( m-1 ) ) ( taken from , equation 6 ).
    ( Note: m needs to be bigger or equal to one in Qm.n format because at least the sign bit ( = leftmost bit ) has to be part of m and not of n, as far as I understand. )
    In all other cases, this is if the binary number you obtained differs from the original one, take the obtained number, treat it as an unsigned integer and convert it to its decimal representation ( as explained in the case when the very leftmost bit is 0) . Afterwards multiply the result by 2^(-1 * n ) and you are done.

    All this assumes 2s complement format ( and thereby also signed numbers ) !

    For understanding this I used after reading this page mainly and for the limits ( range ) of a Qm.n number .

    I’m open for better solutions, but this is at least enough to solve my task. Again thank you for the report, which helped me getting started with my current approach.

    BTW.: The other way round can be found on wikipedia: under Float to Q. As a third step in addition to the two explained there you need to find the 2’s complement binary representation of the nearest integer ( cut off the sign and remember it, find the unsigned integer binary representation, and fill it with zeros on the left until it has the total length of m + n and if the sign was negative invert it and add a binary 1, see or wikipedia’s 2’s complement text.


    • Shawn Says:

      Hi Christian. Thank you for your comments. I have a couple of things to add. First, when negating a 2’s complement binary number, an easy way to do this is by first inverting all the bits, and then adding one. Subtracting one and then inverting the bits also works, but is slightly harder. Second, the special case where negating the largest negative integer gives the same number is a common issue in using 2’s complement. Some processors have hardware that will output the largest positive number when negating the largest possible negative number.

  4. Aravind Says:

    Hi Shawn,
    Can you please let me know how do I convert from Q1.15 to Q4.28? Your reply will be highly helpful.


    • Shawn Says:

      Hi Aravind,

      To convert from Q1.15 to Q4.28 you need to shift the value left by 13 places. So if A is your Q1.15 number and B is your Q4.28 number then B = A << 13. You could also use multiplication if that is more convenient: B = A * 8192. Is your Q4.28 number signed? If so, it must be stored in a register that is at least 33 bits wide.

  5. Hari Says:

    Hi Shawn,

    I was trying to convert 32767 (1.15 -> 0.999969482421875)
    to 4.28 and was checking the results i did not get the same results can you clarify the operation.

    ( 32767 x 16384 )/ 268435456 (which is 2^28)

    Result = 1.99993896484375

    But if i use 2^13 which is (28-15) and repeat the steps

    ( 32767 x 8192 ) / 268435456 = 0.999969482421875

    Can you confirm

    • Shawn Says:

      Oops! I obviously meant 13 instead of 14 in the reply to Aravind above. So multiplying by 8192 is the correct thing to do. Thanks for pointing that out.

  6. Nicholas Says:

    Hi Shawn, great post! It helped me a lot for understanding how fixed-point numbers work! 🙂 One thing is still unclear…. You said that the minimum value for Q15 is -1; how is that so? The closest number I could achieve is 1111 1111 1111 1111, which is really close to -1, but not exactly.

    • Shawn Says:

      Hi Nicholas. Thanks for visiting the site. The binary number 1111 1111 1111 1111 is -1 in integer format if you are using two’s complement. It’s Q.15 value would be -0.000030517578125, which is the Q.15 negative number closest to zero. The Q.15 number -1 is 1000 0000 0000 0000 in binary, corresponding to the integer -32768. The number -32768 is the most negative (minimum value) that can be representing using 16 bits with two’s complement. I hope this clears things up for you.

      • Nicholas Says:

        Yes that clears things up. By any chance, could you explain how to calculate the negative Q.15 values similar to how you calculated positive Q.15 values? In the post, you said that b = a/(2^15) is only for positive; so how do you determine the negative values?

      • Shawn Says:

        For a negative number, it is still true that b = a/(2^15). However, expressing the number as a sum of powers of two for each bit in the number does not work.

        If for example you wanted to calculate an integer to represent -0.25 in Q.15, you would do the following:

        -0.25 * 2^15 = -0.25 * 32768 = -8192

        Going the other way, if you had a negative integer value, for example -256 then, you could find what number that represents in Q.15 by doing the following:

        -256/32768 = -0.0078125

        I hope that helps.

      • Nicholas Says:

        Indeed that does help! I appreciate all of your assistance greatly! 🙂
        By the way, I believe the sum of powers looks like this for negative numbers (following the Q.15 format):

        The value a= is the binary representation of an integer.

        If s, the sign bit, is asserted, then the following computes the Q.15 number:

        b = -1 + a14×2^(-1) + a13×2^(-2) + a12×2^(-3) + …. + a0×x^(-15)

  7. vinaydivakar Says:

    Hello Shawn,

    Thank you for this wonderful blog.

    Between, any Idea on how to specify time in terms of unsigned binary fixed point integer in units of the sampling period?.

    • Shawn Says:

      I’m glad you like the blog. If you are specifying time in units of the sampling period, then why is there a need to use fixed point? Is it because you are doing some arithmetic on the times? If this is the case, you need to figure out how much precision you require after the decimal point (or really the binary point). That is, do you want numbers precise to 1 decimal place, or 2, or more? (e.g. 4.5 or 4.45 or 4.445) Once you have decided that, you need to figure out how many bits to use for the fractional part. For example, if it were three decimal places, you would need enough bits so that the lsb has a significance of <= 0.001. To get the approximate value, calculate log10(0.001)/log10(2) = -9.9; so 10 fractional bits are okay. I hope that explanation isn't too complicated…

      • Yes, Its because I am working on developing a fixed point resampling algorithm. I get the part of fixing the precision width after the binary point. The problem is C++ does not have any fixed point data types and so I was wondering how to define this binary fixed point number in terms of time in C++. For example, If my sampling period Ts = 1ms. Then what is the relationship between the binary point and this time? For more clarity : –

        Source Reference – Julius 0, Smith* and Phil Gossett**, 1984 “A Flexible Sampling Rate Conversion Method”, IEEE.

        The above is the source of the paper. You can check it out sometime when you are free. One of the best and highly optimized resampling techniques I have ever come across & even today some of the top companies are still using it. because of its specialty to to sample signals at arbitrary times.

  8. Shawn Says:

    In C++ you can use an ‘int’ type. Fixed point addition and subtraction are just like with regular integers. Multiplication and division cause a change of the binary point as described in other pages on this website. In C++ you could also define your own fixed point type (or types, such as q15, q30, q31, etc.), and define operators for +,-,* and /.

  9. rajesh kumar Says:

    Hi Shawn,

    i am new to DSP, can you please let me know given a fixed point number (hex) how to identify it is in which Q format?

    thanks in advance.

    • Shawn Says:

      Hi Rajesh,

      Given a fixed point number, usually there is no way to identify what Q format it is in, unless you have some further information. For example, if you know the number is to represent the rational number 0.5, then you could check to see if x/0.5 is a power of two. For example let’s say the (integer) number x is 2048. Then x/0.5 is equal to 4096. The log2(4096) = log10(4096)/log10(2) = 12, so the number would be Q.12.

      If you are looking at a standard DSP library, for example the CMSIS library for ARM Cortex M CPU’s, you will often see types defined in the code (such as q15_t) which can give you a clue.

      If you have written the code yourself, than you are the one deciding what the Q format is, and you need to figure out how the Q format changes after addition, multiplication, etc.

      Does this answer your question? If not, can you be more specific about what you are asking?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: