Posted tagged ‘Q15’

Fixed Point Division of two Q15 Numbers

September 20, 2010

Fixed point division is quite a bit more complicated than fixed point multiplication, and usually takes a lot more cycles than performing a multiplication. When dividing by a known value (a constant), it is usually better to multiply by the reciprocal than to do a division. And when dividing a fixed point number by an integer that is a power of two, a right shift can be used to implement a division. For example, to divide by 16, just shift your dividend right by 4 bits.

But there are cases where it is necessary to do a division by a calculated value. The easiest way to picture how the division should proceed is to think of the inverse of multiplying two Q15 numbers. The multiplication of two Q15 numbers produces a Q30 product. It then makes sense that a Q30 number divided by a Q15 number produces a Q15 result.

Let a = \hat{a}/{2}^{30}, b = \hat{b}/{2}^{15}

then a/b = \frac{\hat{a}/{2}^{30}}{\hat{b}/{2}^{15}}=\left(\hat{a}/\hat{b} \right)/{2}^{15}

So one procedure for finding a/b has the following steps:
1.convert the dividend a from Q15 to Q30 by shifting left by 15
2.divide the Q30 format number a by the Q15 format number b to get result in Q15

Let’s try an example. Let a = 0.03125 and b = 0.25, then c = a/b = 0.125. The Q15 numbers as hexadecimal integers will be a = 0x0400 and b = 0x2000. In step 1, a becomes 0x02000000 in Q30. In step 2, divide 0x02000000 by 0x2000 to get c = 0x1000 which is 4096 in decimal. As a check, find 4096/32768 = 0.125, the expected result.

In C language code, fixed point Q15 division can be coded as follows:

int16 a;
int16 b;
int16 c;

if ( abs(b) > abs(a) ) {
    c = (int16)(((int32)a << 15) / ((int32)b));

The casting is very ugly, but this works. Note that I have restricted the result of the division to be less than one. Removing the restriction that the magnitude of the divisor is larger than that of the dividend has an effect on the number of bits required for the result. To see this, try dividing the largest positive Q15 number by the smallest positive Q15 number, which results in a large number with 15 digits in front of the fractional point:

(0x7FFF/0x8000) / (1/0x8000) = (0x7FFF * 0x8000 ) / 0x8000 = 0x3FFF8000 / 0x8000

The result (0x3FFF8000) requires 30 bits, and it will have 15 bits to the left of the fractional point and 15 bits to the right. That is, the most significant bit has a weighting of {2}^{14} and the least significant bit has a weight of {2}^{-15} . In my work, I have almost always used Q15 division where the magnitude of the divisor is smaller than that of the dividend.

Along with looking ugly, the C code above for division is often inefficient. The C compiler will likely implement this as a division between two 32 bit numbers. When implementing division on a fixed point DSP chip, I have usually used assembly language coding and made use of a special purpose division instruction.

For example, the Texas Instruments TMS320C55x processor has the “subc” instruction or “conditional subtract.” To perform the type of division I have just described do the following:

1.make the dividend and divisor both positive and note original sign of each
2.load the dividend shifted left by 15 into an accumulator register
3.execute the conditional subtract of the divisor 16 times the result (in the lower 16 bits of the accumulator )
5.determine the correct sign for the result, and negate it if necessary

Note that a short cut is to load the dividend shifted left by 16 in the first step, and then execute the subc instruction 15 times. This works because it is known that the result will be positive.

Fixed point division is not difficult, but it can take a lot of cycles, and one needs to recognize the need to consider the range of the resulting output.

The C code and output below shows a couple of examples from this tutorial.

Example code:

#include <stdio.h>

typedef short int16;
typedef int int32;

int main( void )
    int16 a;
    int16 b;
    int16 c;
    int32 d;

    // example 1: magnitude of divisor is greater than magnitude of dividend
    printf("example 1: magnitude of divisor is greater than magnitude of dividend\n");

    a = 0x0400;
    b = 0x2000;

    if ( abs(b) > abs(a) ) {
        c = (int16)(((int32)a << 15) / ((int32)b));
    } else {
        printf("division error\n");

    printf("a = %d, b = %d, c = %d\n",a,b,c);
    printf("a = 0x%x, b = 0x%x, c = 0x%x\n",a,b,c);

    // example 2: no restrictions on divisor other than not 0
    printf("\nexample 2: no restrictions on divisor other than not 0\n");

    a = 0x7fff;
    b = 0x0001;

    if ( b != 0 ) {
        d = ((int32)a << 15) / ((int32)b);
    } else {
        printf("division by zero error\n");

    printf("a = %d, b = %d, d = %d\n",a,b,d);
    printf("a = 0x%x, b = 0x%x, d = 0x%x\n",a,b,d);

   return 0;

Code output:

example 1: magnitude of divisor is greater than magnitude of dividend
a = 1024, b = 8192, c = 4096
a = 0x400, b = 0x2000, c = 0x1000

example 2: no restrictions on divisor other than not 0
a = 32767, b = 1, d = 1073709056
a = 0x7fff, b = 0x1, d = 0x3fff8000

Implementation of FIR Filtering in C (Part 2)

October 8, 2009

In Part 1 I showed how to code a FIR filter in C using floating point. In this lesson I will show how to do the same thing using fixed point operations. The code example below will demonstrate the application of fixed point multiplication, rounding and saturation. The code has definitions for the FIR filtering function, followed by an example test program.

The following link is a PDF version of the code example:


And here is the code example:

#include <stdio.h>
#include <stdint.h>

//  Filter Code Definitions

// maximum number of inputs that can be handled
// in one function call
#define MAX_INPUT_LEN   80
// maximum length of filter than can be handled
#define MAX_FLT_LEN     63
// buffer to hold all of the input samples

// array to hold input samples
int16_t insamp[ BUFFER_LEN ];

// FIR init
void firFixedInit( void )
    memset( insamp, 0, sizeof( insamp ) );

// the FIR filter function
void firFixed( int16_t *coeffs, int16_t *input, int16_t *output,
       int length, int filterLength )
    int32_t acc;     // accumulator for MACs
    int16_t *coeffp; // pointer to coefficients
    int16_t *inputp; // pointer to input samples
    int n;
    int k;

    // put the new samples at the high end of the buffer
    memcpy( &insamp[filterLength - 1], input,
            length * sizeof(int16_t) );

    // apply the filter to each input sample
    for ( n = 0; n < length; n++ ) {
        // calculate output n
        coeffp = coeffs;
        inputp = &insamp[filterLength - 1 + n];
        // load rounding constant
        acc = 1 << 14;
        // perform the multiply-accumulate
        for ( k = 0; k < filterLength; k++ ) {
            acc += (int32_t)(*coeffp++) * (int32_t)(*inputp--);
        // saturate the result
        if ( acc > 0x3fffffff ) {
            acc = 0x3fffffff;
        } else if ( acc < -0x40000000 ) {
            acc = -0x40000000;
        // convert from Q30 to Q15
        output[n] = (int16_t)(acc >> 15);

    // shift input samples back in time for next time
    memmove( &insamp[0], &insamp[length],
            (filterLength - 1) * sizeof(int16_t) );


//  Test program

// bandpass filter centred around 1000 Hz
// sampling rate = 8000 Hz
// gain at 1000 Hz is about 1.13

#define FILTER_LEN  63
int16_t coeffs[ FILTER_LEN ] =
 -1468, 1058,   594,   287,    186,  284,   485,   613,
   495,   90,  -435,  -762,   -615,   21,   821,  1269,
   982,    9, -1132, -1721,  -1296,    1,  1445,  2136,
  1570,    0, -1666, -2413,  -1735,   -2,  1770,  2512,
  1770,   -2, -1735, -2413,  -1666,    0,  1570,  2136,
  1445,    1, -1296, -1721,  -1132,    9,   982,  1269,
   821,   21,  -615,  -762,   -435,   90,   495,   613,
   485,  284,   186,   287,    594, 1058, -1468

// number of samples to read per loop
#define SAMPLES   80

int main( void )
    int size;
    int16_t input[SAMPLES];
    int16_t output[SAMPLES];
    FILE   *in_fid;
    FILE   *out_fid;

    // open the input waveform file
    in_fid = fopen( "input.pcm", "rb" );
    if ( in_fid == 0 ) {
        printf("couldn't open input.pcm");

    // open the output waveform file
    out_fid = fopen( "outputFixed.pcm", "wb" );
    if ( out_fid == 0 ) {
        printf("couldn't open outputFixed.pcm");

    // initialize the filter

    // process all of the samples
    do {
        // read samples from file
        size = fread( input, sizeof(int16_t), SAMPLES, in_fid );
        // perform the filtering
        firFixed( coeffs, input, output, size, FILTER_LEN );
        // write samples to file
        fwrite( output, sizeof(int16_t), size, out_fid );
    } while ( size != 0 );

    fclose( in_fid );
    fclose( out_fid );

    return 0;

The first thing to notice is that the definitions for the input sample storage and handling are nearly the same as for the code in Part 1. The only difference is that the storage type is a 16 bit integer instead of double precision floating point.

The next difference is the inclusion of rounding in the calculation of each output. Rounding is used when converting the calculated result from a Q30 format number to Q15. Notice that I have loaded the rounding constant into the accumulator value (acc) at the beginning of the loop rather than adding it at the end. This is a small optimization commonly seen in code for FIR filters. If you are coding in assembly language, and your chip has a rounding instruction, it may be better to do the rounding at the end (depending on what the instruction actually does).

The multiplication itself is now an integer multiplication of two 16 bit values, each of which is a Q15 number. The accumulator variable is 32 bits, and holds a Q1.30 format number. There is one bit for the sign, one integer bit, and thirty fractional bits. Notice that I have cast each multiplier to a 32 bit value. Failure to do so should result in a 16 bit product and produce wrong results.

Next comes the overflow handling for converting from Q1.30 to Q30. The code checks for values beyond the limits of the largest/smallest Q30 number (no integer bits), and saturates if necessary.

Finally, a right shift by 15 is used to convert the MAC result from Q30 to Q15 and the result is stored to the output array.

The test program is simpler than the one in Part 1 because there is no need to convert the input and output samples to or from floating point. The most important thing to note is the change in the filter coefficient array. To generate these coefficients, I took the floating point coefficients from Part 1, multiplied by 32768, and then rounded to the nearest integer. The coefficients are in Q15 format, and note that none of the original floating point coefficients are close to one. Multiplying by 32768 would cause a problem for any coefficients larger than 32767/32768 or less than -1.

As in Part 1, the test input file should be 16 bit samples at a sampling rate of 8000 Hz. Try using a 1000 Hz tone with noise added to it. The bandpass filter should remove a large portion of the noise.

The filter has greater than unity gain at 1000 Hz and the gain is about 1.13. Admittedly it is not a great filter design, but it tests the saturation code if a full scale 1000 Hz sine wave is used as an input. Try a full scale 1000 Hz tone input with and without the saturation check and see what the difference is. There should be a small amount of distortion with the saturation code present (due to slight flattening of tone) and a large amount of distortion without it (due to overflow).

A fixed point FIR filter is fairly easy to implement. Personally I find the sample storage and movement to be the most difficult part. In Part 3 of this lesson I will demonstrate how to run multiple FIR filters on the same input.

Rounding in Fixed Point Number Conversions

August 19, 2009

When converting from one fixed point representation to another, there is often a right shift operation to eliminate bits. (Or higher order bits are just stored without keeping the lower order bits.) This occurs when converting from a Q31 to a Q15 format number for example, since 16 bits need to be eliminated. Before throwing away the unused bits, sometimes it is desirable to perform a rounding operation first. This can improve the accuracy of results, and can prevent the introduction of a bias during conversion of a signal. Rounding is also an important operation when generating fixed point filter coefficients from floating point values, but that is not the subject of this post.

To illustrate rounding, I will use an example where six different signed Q7.8 numbers are converted to a signed Q15.0 number (a regular 16 bit integer). I will illustrate truncation (throwing away the least significant eight bits) and rounding. Recall that a Q7.8 number has seven integer bits and eight fractional bits. For the example, the six numbers will be 1.25, 1.5, 1.75, -1.25, -1.5 and -1.75.

The first thing to determine is how these numbers will be represented in a 16 bit integer register. Multiplying each by 256 (which is two to the power eight) gives the following result (in hexadecimal):

1.25 = 0x0140

1.5 = 0x0180

1.75 = 0x01C0

-1.25 = 0xFEC0

-1.5 = 0xFE80

-1.75 = 0xFE40

Now if the numbers are truncated, the result is found by shifting right by eight. Here are the results:

truncate(1.25) = 0x0001 = 1

truncate(1.5) = 0x0001 = 1

truncate(1.75) = 0x0001 = 1

truncate(-1.25) = 0xFFFE = -2

truncate(-1.5) = 0xFFFE = -2

truncate(-1.75) = 0xFFFE = -2

For the positive numbers, the result of truncation is that the fractional part is discarded. The negative number results are more interesting. The result is that the fractional part is lost, and the integer part has been reduced by one. If a series of these numbers had a mean of zero before truncation, then the series would have a mean of less than zero after truncation. Rounding is used to avoid this problem of introduced bias and to make results more accurate.

Truncation is not really the correct term for the example above. More accurately, a “floor” operation is being executed. A floor operation returns the greatest integer that is not greater than the operand.

In a common method of rounding, a binary one is added to the most significant bit of the bits that are to be thrown away. And then a truncation is performed. In the current example, we would add 0.5, represented as 128 decimal or 0x0080 in our 16 bit integer word. So the results in our example are as follows:

round(1.25) = (0x0140 + 0x80) >> 8 = 0x0001 = 1

round(1.5) = (0x0180 + 0x80) >> 8 = 0x0002 = 2

round(1.75) = (0x01C0 + 0x80) >> 8 = 0x0002 = 2

round(-1.25) = (0xFEC0 + 0x80) >> 8 = 0xFFFF = -1

round(-1.5) = (0xFE80+ 0x80) >> 8 = 0xFFFF = -1

round(-1.75) = (0xFE40 + 0x80) >> 8 = 0xFFFE = -2

These results are less problematic than using simple truncation, but there is still a bias due to the non-symmetry of the 1.5 and -1.5 cases. The amount of bias depends on the data set. Even if a set of data to be converted contained only positive values, there is still a bias introduced, because all of the values that end in exactly .5 are rounded to the next highest integer. One way to eliminate this bias is to round even and odd values differently (even and odd to the left of the rounding bit position).

For the more common conversion of Q31 to Q15 numbers, the rounding constant is one shifted left by fifteen, or 32768 decimal, or 0x8000 hexadecimal.

Some of the Texas Instrument DSPs have rounding instructions that can be performed on the accumulator register prior to saving a result to memory. For example, the TMS320C55x processor includes the ROUND instruction (full name is “round accumulator content”). The instruction has two different modes. The “biased” mode adds 0x8000 to the 40 bit accumulator register. The “unbiased” mode conditionally adds 0x8000 based on the value of the least significant 17 bits. It is designed to address the bias problems I described above. Wikipedia has a good discussion of rounding and bias errors ( The TMS320C55x is using the “round half to even” method of rounding for the unbiased mode, and “round half up” for the biased mode.

Although it seems simple on the surface, rounding in fixed point conversions has some important effects on the bias of resulting computations.

Fixed Point Multiplication

July 14, 2009

Fixed point addition and subtraction are straightforward. Additions and subtractions are performed using integer operations. For example, if two 16 bit Q15 format numbers are added, the result is a Q15 number. But what about fixed about multiplication? What happens if two Q15 numbers are multiplied?

Let’s try an example. Take 0.5 multiplied by 0.25. In Q15 the number 0.5 is represented (in hexadecimal) as 0x8000 times 0.5 or 0x4000. Similarly, 0.25 is 0x2000. When we multiply these together, the product is 0x08000000. Obviously the result is not a Q15 number since the number of bits required is more than 16. The expected product, 0.125, is 0x1000 in Q15.

To see what is going on, define the following two Q15 numbers a and b:

a = \frac{\hat{a}}{{2}^{15}}

b = \frac{\hat{b}}{{2}^{15}}

where \hat{a} and \hat{b} are the integer representations of our numbers (0x4000 and 0x2000 in our example). The product of a and b is:

c = ab=\frac{\hat{a}\hat{b}}{{2}^{30}}

From the above, it can be seen that the product is a Q30 number. Going back to our example, 0x4000 times 0x2000 is 0x08000000, which is 0.125 times {2}^{30} .

A general rule when multiplying a Qm format number by a Qn format number, is that the product will be a Q(m+n) number. The number of bits required to represent the product is at least (n+m) for unsigned multiplication and (n+m+1) for signed (twos complement) multiplication.

For the more general case of a Qa.b number times a Qc.d number, the product is Q(a+c).(b+d).  The number of bits needed for the result is (a + b + c + d + 1) for signed numbers (and one less for unsigned numbers).

Consider the example of a Q16 unsigned multiplication between the two largest unsigned numbers that can be represented. The largest Q16 number is 65535/65536 = 0.9999847412109375. The product is 0xffff times 0xffff or 0xfffe0001. The result is a Q32 number requiring at least 32 bits. If we divide by {2}^{32} then we get 0.99996948265470564365386962890625, the expected result.

There are a number of things that are done with the product of a multiplication, depending on the application. Some of the commonly seen options are:

  1. Convert the product to a different Q format.

  2. Use the product in the resulting Q format.

  3. Add the product to a running sum in an accumulator register.

  4. Convert the product to a different Q format, then add to a running sum.

Let’s look at some of these options for the case of signed multiplication using Q15 format numbers. For case 1, assume we want to multiply two Q15 numbers and get a Q15 result. The required operation is to take the Q30 product, and shift it right by 15 bits. The result can then be stored in 16 bits. There is also the option of rounding the product before shifting out the lower 15 bits (I may discuss rounding in a future post). Some CPU architectures are better set up to shift the product left by 1, and then store the upper 16 bits. This is almost exactly the same as shifting right by 15 bits and keeping the lower 16 bits.

Multiply-accumulate (MAC) operations are used a lot in many DSP algorithms. Many processors have one or more dedicated accumulator registers for this purpose (often with 32 or 40 bits). For the case of Q15 multiplies, each Q30 product can be summed to the accumulator.

I have seen a lot of code that shifts each product left by 1 when performing the MAC operations. Some DSP chips can do the left shift in hardware using a special mode of the ALU. In this case, the value in the accumulator is in Q31 format. Although very common, this method has a greater chance of overflow problems since each product is effectively two times bigger. I think this method became popular because certain older DSP chip architectures required the storing of the high 16 bits of the accumulator or product register, rather than having a single cycle instruction allowing a shift by 15 bits.

In summary, because multiplication operations are often a chief component of signal processing implementations, it is important to understand how they work. This is especially true for fixed point operations, where one must know the effect of multiplication on the format of the numbers themselves.