## Posted tagged ‘rounding’

### Implementation of FIR Filtering in C (Part 2)

October 8, 2009

In Part 1 I showed how to code a FIR filter in C using floating point. In this lesson I will show how to do the same thing using fixed point operations. The code example below will demonstrate the application of fixed point multiplication, rounding and saturation. The code has definitions for the FIR filtering function, followed by an example test program.

firFixed

And here is the code example:

```
#include <stdio.h>
#include <stdint.h>

//////////////////////////////////////////////////////////////
//  Filter Code Definitions
//////////////////////////////////////////////////////////////

// maximum number of inputs that can be handled
// in one function call
#define MAX_INPUT_LEN   80
// maximum length of filter than can be handled
#define MAX_FLT_LEN     63
// buffer to hold all of the input samples
#define BUFFER_LEN      (MAX_FLT_LEN - 1 + MAX_INPUT_LEN)

// array to hold input samples
int16_t insamp[ BUFFER_LEN ];

// FIR init
void firFixedInit( void )
{
memset( insamp, 0, sizeof( insamp ) );
}

// the FIR filter function
void firFixed( int16_t *coeffs, int16_t *input, int16_t *output,
int length, int filterLength )
{
int32_t acc;     // accumulator for MACs
int16_t *coeffp; // pointer to coefficients
int16_t *inputp; // pointer to input samples
int n;
int k;

// put the new samples at the high end of the buffer
memcpy( &insamp[filterLength - 1], input,
length * sizeof(int16_t) );

// apply the filter to each input sample
for ( n = 0; n < length; n++ ) {
// calculate output n
coeffp = coeffs;
inputp = &insamp[filterLength - 1 + n];
acc = 1 << 14;
// perform the multiply-accumulate
for ( k = 0; k < filterLength; k++ ) {
acc += (int32_t)(*coeffp++) * (int32_t)(*inputp--);
}
// saturate the result
if ( acc > 0x3fffffff ) {
acc = 0x3fffffff;
} else if ( acc < -0x40000000 ) {
acc = -0x40000000;
}
// convert from Q30 to Q15
output[n] = (int16_t)(acc >> 15);
}

// shift input samples back in time for next time
memmove( &insamp[0], &insamp[length],
(filterLength - 1) * sizeof(int16_t) );

}

//////////////////////////////////////////////////////////////
//  Test program
//////////////////////////////////////////////////////////////

// bandpass filter centred around 1000 Hz
// sampling rate = 8000 Hz
// gain at 1000 Hz is about 1.13

#define FILTER_LEN  63
int16_t coeffs[ FILTER_LEN ] =
{
-1468, 1058,   594,   287,    186,  284,   485,   613,
495,   90,  -435,  -762,   -615,   21,   821,  1269,
982,    9, -1132, -1721,  -1296,    1,  1445,  2136,
1570,    0, -1666, -2413,  -1735,   -2,  1770,  2512,
1770,   -2, -1735, -2413,  -1666,    0,  1570,  2136,
1445,    1, -1296, -1721,  -1132,    9,   982,  1269,
821,   21,  -615,  -762,   -435,   90,   495,   613,
485,  284,   186,   287,    594, 1058, -1468
};

// number of samples to read per loop
#define SAMPLES   80

int main( void )
{
int size;
int16_t input[SAMPLES];
int16_t output[SAMPLES];
FILE   *in_fid;
FILE   *out_fid;

// open the input waveform file
in_fid = fopen( "input.pcm", "rb" );
if ( in_fid == 0 ) {
printf("couldn't open input.pcm");
return;
}

// open the output waveform file
out_fid = fopen( "outputFixed.pcm", "wb" );
if ( out_fid == 0 ) {
printf("couldn't open outputFixed.pcm");
return;
}

// initialize the filter
firFixedInit();

// process all of the samples
do {
size = fread( input, sizeof(int16_t), SAMPLES, in_fid );
// perform the filtering
firFixed( coeffs, input, output, size, FILTER_LEN );
// write samples to file
fwrite( output, sizeof(int16_t), size, out_fid );
} while ( size != 0 );

fclose( in_fid );
fclose( out_fid );

return 0;
}

```

The first thing to notice is that the definitions for the input sample storage and handling are nearly the same as for the code in Part 1. The only difference is that the storage type is a 16 bit integer instead of double precision floating point.

The next difference is the inclusion of rounding in the calculation of each output. Rounding is used when converting the calculated result from a Q30 format number to Q15. Notice that I have loaded the rounding constant into the accumulator value (acc) at the beginning of the loop rather than adding it at the end. This is a small optimization commonly seen in code for FIR filters. If you are coding in assembly language, and your chip has a rounding instruction, it may be better to do the rounding at the end (depending on what the instruction actually does).

The multiplication itself is now an integer multiplication of two 16 bit values, each of which is a Q15 number. The accumulator variable is 32 bits, and holds a Q1.30 format number. There is one bit for the sign, one integer bit, and thirty fractional bits. Notice that I have cast each multiplier to a 32 bit value. Failure to do so should result in a 16 bit product and produce wrong results.

Next comes the overflow handling for converting from Q1.30 to Q30. The code checks for values beyond the limits of the largest/smallest Q30 number (no integer bits), and saturates if necessary.

Finally, a right shift by 15 is used to convert the MAC result from Q30 to Q15 and the result is stored to the output array.

The test program is simpler than the one in Part 1 because there is no need to convert the input and output samples to or from floating point. The most important thing to note is the change in the filter coefficient array. To generate these coefficients, I took the floating point coefficients from Part 1, multiplied by 32768, and then rounded to the nearest integer. The coefficients are in Q15 format, and note that none of the original floating point coefficients are close to one. Multiplying by 32768 would cause a problem for any coefficients larger than 32767/32768 or less than -1.

As in Part 1, the test input file should be 16 bit samples at a sampling rate of 8000 Hz. Try using a 1000 Hz tone with noise added to it. The bandpass filter should remove a large portion of the noise.

The filter has greater than unity gain at 1000 Hz and the gain is about 1.13. Admittedly it is not a great filter design, but it tests the saturation code if a full scale 1000 Hz sine wave is used as an input. Try a full scale 1000 Hz tone input with and without the saturation check and see what the difference is. There should be a small amount of distortion with the saturation code present (due to slight flattening of tone) and a large amount of distortion without it (due to overflow).

A fixed point FIR filter is fairly easy to implement. Personally I find the sample storage and movement to be the most difficult part. In Part 3 of this lesson I will demonstrate how to run multiple FIR filters on the same input.

### Rounding in Fixed Point Number Conversions

August 19, 2009

When converting from one fixed point representation to another, there is often a right shift operation to eliminate bits. (Or higher order bits are just stored without keeping the lower order bits.) This occurs when converting from a Q31 to a Q15 format number for example, since 16 bits need to be eliminated. Before throwing away the unused bits, sometimes it is desirable to perform a rounding operation first. This can improve the accuracy of results, and can prevent the introduction of a bias during conversion of a signal. Rounding is also an important operation when generating fixed point filter coefficients from floating point values, but that is not the subject of this post.

To illustrate rounding, I will use an example where six different signed Q7.8 numbers are converted to a signed Q15.0 number (a regular 16 bit integer). I will illustrate truncation (throwing away the least significant eight bits) and rounding. Recall that a Q7.8 number has seven integer bits and eight fractional bits. For the example, the six numbers will be 1.25, 1.5, 1.75, -1.25, -1.5 and -1.75.

The first thing to determine is how these numbers will be represented in a 16 bit integer register. Multiplying each by 256 (which is two to the power eight) gives the following result (in hexadecimal):

1.25 = 0x0140

1.5 = 0x0180

1.75 = 0x01C0

-1.25 = 0xFEC0

-1.5 = 0xFE80

-1.75 = 0xFE40

Now if the numbers are truncated, the result is found by shifting right by eight. Here are the results:

truncate(1.25) = 0x0001 = 1

truncate(1.5) = 0x0001 = 1

truncate(1.75) = 0x0001 = 1

truncate(-1.25) = 0xFFFE = -2

truncate(-1.5) = 0xFFFE = -2

truncate(-1.75) = 0xFFFE = -2

For the positive numbers, the result of truncation is that the fractional part is discarded. The negative number results are more interesting. The result is that the fractional part is lost, and the integer part has been reduced by one. If a series of these numbers had a mean of zero before truncation, then the series would have a mean of less than zero after truncation. Rounding is used to avoid this problem of introduced bias and to make results more accurate.

Truncation is not really the correct term for the example above. More accurately, a “floor” operation is being executed. A floor operation returns the greatest integer that is not greater than the operand.

In a common method of rounding, a binary one is added to the most significant bit of the bits that are to be thrown away. And then a truncation is performed. In the current example, we would add 0.5, represented as 128 decimal or 0x0080 in our 16 bit integer word. So the results in our example are as follows:

round(1.25) = (0x0140 + 0x80) >> 8 = 0x0001 = 1

round(1.5) = (0x0180 + 0x80) >> 8 = 0x0002 = 2

round(1.75) = (0x01C0 + 0x80) >> 8 = 0x0002 = 2

round(-1.25) = (0xFEC0 + 0x80) >> 8 = 0xFFFF = -1

round(-1.5) = (0xFE80+ 0x80) >> 8 = 0xFFFF = -1

round(-1.75) = (0xFE40 + 0x80) >> 8 = 0xFFFE = -2

These results are less problematic than using simple truncation, but there is still a bias due to the non-symmetry of the 1.5 and -1.5 cases. The amount of bias depends on the data set. Even if a set of data to be converted contained only positive values, there is still a bias introduced, because all of the values that end in exactly .5 are rounded to the next highest integer. One way to eliminate this bias is to round even and odd values differently (even and odd to the left of the rounding bit position).

For the more common conversion of Q31 to Q15 numbers, the rounding constant is one shifted left by fifteen, or 32768 decimal, or 0x8000 hexadecimal.

Some of the Texas Instrument DSPs have rounding instructions that can be performed on the accumulator register prior to saving a result to memory. For example, the TMS320C55x processor includes the ROUND instruction (full name is “round accumulator content”). The instruction has two different modes. The “biased” mode adds 0x8000 to the 40 bit accumulator register. The “unbiased” mode conditionally adds 0x8000 based on the value of the least significant 17 bits. It is designed to address the bias problems I described above. Wikipedia has a good discussion of rounding and bias errors (http://en.wikipedia.org/wiki/Rounding). The TMS320C55x is using the “round half to even” method of rounding for the unbiased mode, and “round half up” for the biased mode.

Although it seems simple on the surface, rounding in fixed point conversions has some important effects on the bias of resulting computations.