Let me know if there are any problems with formatting in the code (something that always seems to be a problem). Copying the code is easy. If you put your mouse pointer over any of the code listings, a number of symbols will be shown at the top right. One of these will allow the code to be copied to the clipboard.
Using pages instead of regular posts, parts one and two of the FIR filter tutorials now use a better control for displaying the C code, which prevents lines from being chopped off, and makes it easy to copy and paste the code.
I have lots of ideas for new tutorials, but I haven’t had a lot of time to work on them. I have some interesting code for frequency detection that I will hopefully put up soon.
But there are cases where it is necessary to do a division by a calculated value. The easiest way to picture how the division should proceed is to think of the inverse of multiplying two Q15 numbers. The multiplication of two Q15 numbers produces a Q30 product. It then makes sense that a Q30 number divided by a Q15 number produces a Q15 result.
Let
then
So one procedure for finding a/b has the following steps:
1.convert the dividend a from Q15 to Q30 by shifting left by 15
2.divide the Q30 format number a by the Q15 format number b to get result in Q15
Let’s try an example. Let a = 0.03125 and b = 0.25, then c = a/b = 0.125. The Q15 numbers as hexadecimal integers will be a = 0x0400 and b = 0x2000. In step 1, a becomes 0x02000000 in Q30. In step 2, divide 0x02000000 by 0x2000 to get c = 0x1000 which is 4096 in decimal. As a check, find 4096/32768 = 0.125, the expected result.
In C language code, fixed point Q15 division can be coded as follows:
int16 a; int16 b; int16 c; if ( abs(b) > abs(a) ) { c = (int16)(((int32)a << 15) / ((int32)b)); }
The casting is very ugly, but this works. Note that I have restricted the result of the division to be less than one. Removing the restriction that the magnitude of the divisor is larger than that of the dividend has an effect on the number of bits required for the result. To see this, try dividing the largest positive Q15 number by the smallest positive Q15 number, which results in a large number with 15 digits in front of the fractional point:
(0x7FFF/0x8000) / (1/0x8000) = (0x7FFF * 0x8000 ) / 0x8000 = 0x3FFF8000 / 0x8000
The result (0x3FFF8000) requires 30 bits, and it will have 15 bits to the left of the fractional point and 15 bits to the right. That is, the most significant bit has a weighting of and the least significant bit has a weight of . In my work, I have almost always used Q15 division where the magnitude of the divisor is smaller than that of the dividend.
Along with looking ugly, the C code above for division is often inefficient. The C compiler will likely implement this as a division between two 32 bit numbers. When implementing division on a fixed point DSP chip, I have usually used assembly language coding and made use of a special purpose division instruction.
For example, the Texas Instruments TMS320C55x processor has the “subc” instruction or “conditional subtract.” To perform the type of division I have just described do the following:
1.make the dividend and divisor both positive and note original sign of each
2.load the dividend shifted left by 15 into an accumulator register
3.execute the conditional subtract of the divisor 16 times
4.store the result (in the lower 16 bits of the accumulator )
5.determine the correct sign for the result, and negate it if necessary
Note that a short cut is to load the dividend shifted left by 16 in the first step, and then execute the subc instruction 15 times. This works because it is known that the result will be positive.
Fixed point division is not difficult, but it can take a lot of cycles, and one needs to recognize the need to consider the range of the resulting output.
The C code and output below shows a couple of examples from this tutorial.
Example code:
#include <stdio.h> typedef short int16; typedef int int32; int main( void ) { int16 a; int16 b; int16 c; int32 d; // example 1: magnitude of divisor is greater than magnitude of dividend printf("example 1: magnitude of divisor is greater than magnitude of dividend\n"); a = 0x0400; b = 0x2000; if ( abs(b) > abs(a) ) { c = (int16)(((int32)a << 15) / ((int32)b)); } else { printf("division error\n"); } printf("a = %d, b = %d, c = %d\n",a,b,c); printf("a = 0x%x, b = 0x%x, c = 0x%x\n",a,b,c); // example 2: no restrictions on divisor other than not 0 printf("\nexample 2: no restrictions on divisor other than not 0\n"); a = 0x7fff; b = 0x0001; if ( b != 0 ) { d = ((int32)a << 15) / ((int32)b); } else { printf("division by zero error\n"); } printf("a = %d, b = %d, d = %d\n",a,b,d); printf("a = 0x%x, b = 0x%x, d = 0x%x\n",a,b,d); return 0; }
Code output:
example 1: magnitude of divisor is greater than magnitude of dividend
a = 1024, b = 8192, c = 4096
a = 0x400, b = 0x2000, c = 0x1000
example 2: no restrictions on divisor other than not 0
a = 32767, b = 1, d = 1073709056
a = 0x7fff, b = 0x1, d = 0x3fff8000
Click the following link for a PDF version of the code example:
The code example is shown below:
#include <stdio.h> #include <stdint.h> ////////////////////////////////////////////////////////////// // Filter Code Definitions ////////////////////////////////////////////////////////////// // maximum number of inputs that can be handled // in one function call #define MAX_INPUT_LEN 80 // maximum length of filter than can be handled #define MAX_FLT_LEN 63 // buffer to hold all of the input samples #define BUFFER_LEN (MAX_FLT_LEN - 1 + MAX_INPUT_LEN) // array to hold input samples int16_t insamp[ BUFFER_LEN ]; // FIR init void firFixedInit( void ) { memset( insamp, 0, sizeof( insamp ) ); } // store new input samples int16_t *firStoreNewSamples( int16_t *inp, int length ) { // put the new samples at the high end of the buffer memcpy( &insamp[MAX_FLT_LEN - 1], inp, length * sizeof(int16_t) ); // return the location at which to apply the filtering return &insamp[MAX_FLT_LEN - 1]; } // move processed samples void firMoveProcSamples( int length ) { // shift input samples back in time for next time memmove( &insamp[0], &insamp[length], (MAX_FLT_LEN - 1) * sizeof(int16_t) ); } // the FIR filter function void firFixed( int16_t *coeffs, int16_t *input, int16_t *output, int length, int filterLength ) { int32_t acc; // accumulator for MACs int16_t *coeffp; // pointer to coefficients int16_t *inputp; // pointer to input samples int n; int k; // apply the filter to each input sample for ( n = 0; n < length; n++ ) { // calculate output n coeffp = coeffs; inputp = &input[n]; // load rounding constant acc = 1 << 14; // perform the multiply-accumulate for ( k = 0; k < filterLength; k++ ) { acc += (int32_t)(*coeffp++) * (int32_t)(*inputp--); } // saturate the result if ( acc > 0x3fffffff ) { acc = 0x3fffffff; } else if ( acc < -0x40000000 ) { acc = -0x40000000; } // convert from Q30 to Q15 output[n] = (int16_t)(acc >> 15); } } ////////////////////////////////////////////////////////////// // Test program ////////////////////////////////////////////////////////////// // bandpass filter centred around 1000 Hz // sampling rate = 8000 Hz // gain at 1000 Hz is about 1.13 #define FILTER_LEN 63 int16_t coeffs[ FILTER_LEN ] = { -1468, 1058, 594, 287, 186, 284, 485, 613, 495, 90, -435, -762, -615, 21, 821, 1269, 982, 9, -1132, -1721, -1296, 1, 1445, 2136, 1570, 0, -1666, -2413, -1735, -2, 1770, 2512, 1770, -2, -1735, -2413, -1666, 0, 1570, 2136, 1445, 1, -1296, -1721, -1132, 9, 982, 1269, 821, 21, -615, -762, -435, 90, 495, 613, 485, 284, 186, 287, 594, 1058, -1468 }; // Moving average (lowpass) filter of length 8 // There is a null in the spectrum at 1000 Hz #define FILTER_LEN_MA 8 int16_t coeffsMa[ FILTER_LEN_MA ] = { 32768/8, 32768/8, 32768/8, 32768/8, 32768/8, 32768/8, 32768/8, 32768/8 }; // number of samples to read per loop #define SAMPLES 80 int main( void ) { int size; int16_t input[SAMPLES]; int16_t output[SAMPLES]; int16_t *inp; FILE *in_fid; FILE *out_fid; FILE *out_fid2; // open the input waveform file in_fid = fopen( "input.pcm", "rb" ); if ( in_fid == 0 ) { printf("couldn't open input.pcm"); return; } // open the output waveform files out_fid = fopen( "outputFixed.pcm", "wb" ); if ( out_fid == 0 ) { printf("couldn't open outputFixed.pcm"); return; } out_fid2 = fopen( "outputFixedMa.pcm", "wb" ); if ( out_fid == 0 ) { printf("couldn't open outputFixedMa.pcm"); return; } // initialize the filter firFixedInit(); // process all of the samples do { // read samples from file size = fread( input, sizeof(int16_t), SAMPLES, in_fid ); // store new samples in working array inp = firStoreNewSamples( input, size ); // apply each filter firFixed( coeffs, inp, output, size, FILTER_LEN ); fwrite( output, sizeof(int16_t), size, out_fid ); firFixed( coeffsMa, inp, output, size, FILTER_LEN_MA ); fwrite( output, sizeof(int16_t), size, out_fid2 ); // move processed samples firMoveProcSamples( size ); } while ( size != 0 ); fclose( in_fid ); fclose( out_fid ); fclose( out_fid2 ); return 0; }
There are a few differences from the code example of Part 2. First, I have created a function to store the input samples to the input sample array (firStoreNewSamples). This function is called once for every block of input samples that are processed. The calling function passes in a pointer to the new input samples, and the number of new samples to copy. The function returns the address at which to apply the FIR filter.
Second, I have added a function to move the samples after processing a block of samples (firMoveProcSamples). Again, this function is called once per block of samples, not once per FIR filter applied.
The FIR filtering function (firFixed) has the same argument list as in the Part 2 example, but the “input” argument is a bit different in this case. The input pointer passed in should be the address returned from the firStoreNewSamples function, rather than a pointer to the input sample buffer.
The test program shows an example where two different FIR filters are applied to the same output data. First one input file is opened (for input samples) and two output files are opened (one for each filter). In the sample processing loop, a block of up to 80 samples is read and stored into the working array for the filters. Next the 63 tap bandpass filter is applied by calling firFixed, and the block of output samples is written to file. Afterwards, the 8 tap moving average filter is applied, and the output samples are written to a different file. Finally, the sample buffer is shifted to prepare for the next block of input samples.
The code I have shown works for however many filters that you want to implement. Remember to keep track of the maximum filter tap length, and input sample block size, and change the #define statements appropriately. That concludes my tutorial on basic FIR filters.
In Part 1 I showed how to code a FIR filter in C using floating point. In this lesson I will show how to do the same thing using fixed point operations. The code example below will demonstrate the application of fixed point multiplication, rounding and saturation. The code has definitions for the FIR filtering function, followed by an example test program.
The following link is a PDF version of the code example:
And here is the code example:
#include <stdio.h> #include <stdint.h> ////////////////////////////////////////////////////////////// // Filter Code Definitions ////////////////////////////////////////////////////////////// // maximum number of inputs that can be handled // in one function call #define MAX_INPUT_LEN 80 // maximum length of filter than can be handled #define MAX_FLT_LEN 63 // buffer to hold all of the input samples #define BUFFER_LEN (MAX_FLT_LEN - 1 + MAX_INPUT_LEN) // array to hold input samples int16_t insamp[ BUFFER_LEN ]; // FIR init void firFixedInit( void ) { memset( insamp, 0, sizeof( insamp ) ); } // the FIR filter function void firFixed( int16_t *coeffs, int16_t *input, int16_t *output, int length, int filterLength ) { int32_t acc; // accumulator for MACs int16_t *coeffp; // pointer to coefficients int16_t *inputp; // pointer to input samples int n; int k; // put the new samples at the high end of the buffer memcpy( &insamp[filterLength - 1], input, length * sizeof(int16_t) ); // apply the filter to each input sample for ( n = 0; n < length; n++ ) { // calculate output n coeffp = coeffs; inputp = &insamp[filterLength - 1 + n]; // load rounding constant acc = 1 << 14; // perform the multiply-accumulate for ( k = 0; k < filterLength; k++ ) { acc += (int32_t)(*coeffp++) * (int32_t)(*inputp--); } // saturate the result if ( acc > 0x3fffffff ) { acc = 0x3fffffff; } else if ( acc < -0x40000000 ) { acc = -0x40000000; } // convert from Q30 to Q15 output[n] = (int16_t)(acc >> 15); } // shift input samples back in time for next time memmove( &insamp[0], &insamp[length], (filterLength - 1) * sizeof(int16_t) ); } ////////////////////////////////////////////////////////////// // Test program ////////////////////////////////////////////////////////////// // bandpass filter centred around 1000 Hz // sampling rate = 8000 Hz // gain at 1000 Hz is about 1.13 #define FILTER_LEN 63 int16_t coeffs[ FILTER_LEN ] = { -1468, 1058, 594, 287, 186, 284, 485, 613, 495, 90, -435, -762, -615, 21, 821, 1269, 982, 9, -1132, -1721, -1296, 1, 1445, 2136, 1570, 0, -1666, -2413, -1735, -2, 1770, 2512, 1770, -2, -1735, -2413, -1666, 0, 1570, 2136, 1445, 1, -1296, -1721, -1132, 9, 982, 1269, 821, 21, -615, -762, -435, 90, 495, 613, 485, 284, 186, 287, 594, 1058, -1468 }; // number of samples to read per loop #define SAMPLES 80 int main( void ) { int size; int16_t input[SAMPLES]; int16_t output[SAMPLES]; FILE *in_fid; FILE *out_fid; // open the input waveform file in_fid = fopen( "input.pcm", "rb" ); if ( in_fid == 0 ) { printf("couldn't open input.pcm"); return; } // open the output waveform file out_fid = fopen( "outputFixed.pcm", "wb" ); if ( out_fid == 0 ) { printf("couldn't open outputFixed.pcm"); return; } // initialize the filter firFixedInit(); // process all of the samples do { // read samples from file size = fread( input, sizeof(int16_t), SAMPLES, in_fid ); // perform the filtering firFixed( coeffs, input, output, size, FILTER_LEN ); // write samples to file fwrite( output, sizeof(int16_t), size, out_fid ); } while ( size != 0 ); fclose( in_fid ); fclose( out_fid ); return 0; }
The first thing to notice is that the definitions for the input sample storage and handling are nearly the same as for the code in Part 1. The only difference is that the storage type is a 16 bit integer instead of double precision floating point.
The next difference is the inclusion of rounding in the calculation of each output. Rounding is used when converting the calculated result from a Q30 format number to Q15. Notice that I have loaded the rounding constant into the accumulator value (acc) at the beginning of the loop rather than adding it at the end. This is a small optimization commonly seen in code for FIR filters. If you are coding in assembly language, and your chip has a rounding instruction, it may be better to do the rounding at the end (depending on what the instruction actually does).
The multiplication itself is now an integer multiplication of two 16 bit values, each of which is a Q15 number. The accumulator variable is 32 bits, and holds a Q1.30 format number. There is one bit for the sign, one integer bit, and thirty fractional bits. Notice that I have cast each multiplier to a 32 bit value. Failure to do so should result in a 16 bit product and produce wrong results.
Next comes the overflow handling for converting from Q1.30 to Q30. The code checks for values beyond the limits of the largest/smallest Q30 number (no integer bits), and saturates if necessary.
Finally, a right shift by 15 is used to convert the MAC result from Q30 to Q15 and the result is stored to the output array.
The test program is simpler than the one in Part 1 because there is no need to convert the input and output samples to or from floating point. The most important thing to note is the change in the filter coefficient array. To generate these coefficients, I took the floating point coefficients from Part 1, multiplied by 32768, and then rounded to the nearest integer. The coefficients are in Q15 format, and note that none of the original floating point coefficients are close to one. Multiplying by 32768 would cause a problem for any coefficients larger than 32767/32768 or less than -1.
As in Part 1, the test input file should be 16 bit samples at a sampling rate of 8000 Hz. Try using a 1000 Hz tone with noise added to it. The bandpass filter should remove a large portion of the noise.
The filter has greater than unity gain at 1000 Hz and the gain is about 1.13. Admittedly it is not a great filter design, but it tests the saturation code if a full scale 1000 Hz sine wave is used as an input. Try a full scale 1000 Hz tone input with and without the saturation check and see what the difference is. There should be a small amount of distortion with the saturation code present (due to slight flattening of tone) and a large amount of distortion without it (due to overflow).
A fixed point FIR filter is fairly easy to implement. Personally I find the sample storage and movement to be the most difficult part. In Part 3 of this lesson I will demonstrate how to run multiple FIR filters on the same input.
]]>
A FIR filter is one in which the output depends only on the inputs (and not on previous outputs). The process is a discrete-time convolution between the input signal and the impulse response of the filter to apply. I will call this impulse response “the filter coefficients.” The following equivalent equations describe the convolution process:
equation 1:
equation 2:
where h is the impulse response, x is the input, y[n] is the output at time n and N is the length of the impulse response (the number of filter coefficients). For each output calculated, there are N multiply-accumulate operations (MACs). There is a choice of going through the filter coefficients forward and the input backwards (equation 1), or through the filter coefficients backwards and the input forwards (equation 2). In the example that follows, I will use the form in equation 1.
In my code example, samples will be stored in a working array, with lower indices being further back in time (at the “low” end). The input samples will be arranged in an array long enough to apply all of the filter coefficients. The length of the array required to process one input sample is N (the number of coefficients in the filter). The length required for M input samples is N – 1 + M.
A function will be defined that filters several input samples at a time. For example, a realtime system for processing telephony quality speech might process eighty samples at a time. This is done to improve the efficiency of the processing, since the overhead of calling the function is eighty times smaller than if it were called for every single sample. The drawback is that additional delay is introduced in the output.
The basic steps for applying a FIR filter are the following:
Arrange the new samples at the high end of the input sample buffer.
Loop through an outer loop that produces each output sample.
Loop through an inner loop that multiplies each filter coefficient by an input sample and adds to a running sum.
Shift the previous input samples back in time by the number of new samples that were just processed.
The code example defines an initialization function and a filtering function, and includes a test program. Download the PDF from the following link to see the code, or view it in-line below:
#include <stdio.h> #include <stdint.h> ////////////////////////////////////////////////////////////// // Filter Code Definitions ////////////////////////////////////////////////////////////// // maximum number of inputs that can be handled // in one function call #define MAX_INPUT_LEN 80 // maximum length of filter than can be handled #define MAX_FLT_LEN 63 // buffer to hold all of the input samples #define BUFFER_LEN (MAX_FLT_LEN - 1 + MAX_INPUT_LEN) // array to hold input samples double insamp[ BUFFER_LEN ]; // FIR init void firFloatInit( void ) { memset( insamp, 0, sizeof( insamp ) ); } // the FIR filter function void firFloat( double *coeffs, double *input, double *output, int length, int filterLength ) { double acc; // accumulator for MACs double *coeffp; // pointer to coefficients double *inputp; // pointer to input samples int n; int k; // put the new samples at the high end of the buffer memcpy( &insamp[filterLength - 1], input, length * sizeof(double) ); // apply the filter to each input sample for ( n = 0; n < length; n++ ) { // calculate output n coeffp = coeffs; inputp = &insamp[filterLength - 1 + n]; acc = 0; for ( k = 0; k < filterLength; k++ ) { acc += (*coeffp++) * (*inputp--); } output[n] = acc; } // shift input samples back in time for next time memmove( &insamp[0], &insamp[length], (filterLength - 1) * sizeof(double) ); } ////////////////////////////////////////////////////////////// // Test program ////////////////////////////////////////////////////////////// // bandpass filter centred around 1000 Hz // sampling rate = 8000 Hz #define FILTER_LEN 63 double coeffs[ FILTER_LEN ] = { -0.0448093, 0.0322875, 0.0181163, 0.0087615, 0.0056797, 0.0086685, 0.0148049, 0.0187190, 0.0151019, 0.0027594, -0.0132676, -0.0232561, -0.0187804, 0.0006382, 0.0250536, 0.0387214, 0.0299817, 0.0002609, -0.0345546, -0.0525282, -0.0395620, 0.0000246, 0.0440998, 0.0651867, 0.0479110, 0.0000135, -0.0508558, -0.0736313, -0.0529380, -0.0000709, 0.0540186, 0.0766746, 0.0540186, -0.0000709, -0.0529380, -0.0736313, -0.0508558, 0.0000135, 0.0479110, 0.0651867, 0.0440998, 0.0000246, -0.0395620, -0.0525282, -0.0345546, 0.0002609, 0.0299817, 0.0387214, 0.0250536, 0.0006382, -0.0187804, -0.0232561, -0.0132676, 0.0027594, 0.0151019, 0.0187190, 0.0148049, 0.0086685, 0.0056797, 0.0087615, 0.0181163, 0.0322875, -0.0448093 }; void intToFloat( int16_t *input, double *output, int length ) { int i; for ( i = 0; i < length; i++ ) { output[i] = (double)input[i]; } } void floatToInt( double *input, int16_t *output, int length ) { int i; for ( i = 0; i 32767.0 ) { input[i] = 32767.0; } else if ( input[i] < -32768.0 ) { input[i] = -32768.0; } // convert output[i] = (int16_t)input[i]; } } // number of samples to read per loop #define SAMPLES 80 int main( void ) { int size; int16_t input[SAMPLES]; int16_t output[SAMPLES]; double floatInput[SAMPLES]; double floatOutput[SAMPLES]; FILE *in_fid; FILE *out_fid; // open the input waveform file in_fid = fopen( "input.pcm", "rb" ); if ( in_fid == 0 ) { printf("couldn't open input.pcm"); return; } // open the output waveform file out_fid = fopen( "outputFloat.pcm", "wb" ); if ( out_fid == 0 ) { printf("couldn't open outputFloat.pcm"); return; } // initialize the filter firFloatInit(); // process all of the samples do { // read samples from file size = fread( input, sizeof(int16_t), SAMPLES, in_fid ); // convert to doubles intToFloat( input, floatInput, size ); // perform the filtering firFloat( coeffs, floatInput, floatOutput, size, FILTER_LEN ); // convert to ints floatToInt( floatOutput, output, size ); // write samples to file fwrite( output, sizeof(int16_t), size, out_fid ); } while ( size != 0 ); fclose( in_fid ); fclose( out_fid ); return 0; }
First is the definition of the working array to hold the input samples. I have defined the array length to handle up to 80 input samples per function call, and a filter length up to 63 coefficients. In other words, the filter function that uses this working array will support between 1 to 80 input samples, and a filter length between 1 and 63.
Next is the initialization function, which simply zeroes the entire input sample array. This insures that all of the inputs that come before the first actual input are set to zero.
Then comes the FIR filtering function itself. The function takes pointers to the filter coefficients, the new input samples, and the output sample array. The “length” argument specifies the number of input samples to process (must be between 0 and 80 inclusive). The “filterLength” argument specifies the number of coefficients in the filter (must be between 1 and 63 inclusive). Note that the same filter coefficients and filterLength should be passed in each time the firFloat function is called, until the entire input is processed.
The first step in the function is the storage of the new samples. Note closely where the samples are placed in the array.
Next comes the outer loop that processes each input sample to produce one output sample. The outer loop sets up pointers to the coefficients and the input samples, and zeroes an accumulator value for the multiply-accumulate operation in the inner loop. It also stores each calculated output. The inner loop simply performs the multiply-accumulate operation.
Finally comes the step to move the input samples back in time. This is probably the most difficult step in the FIR filter and the part that always takes me the longest to work out. Pay close attention to the position of the samples to move, and the number of samples moved. The amount of time shift is equal to the number of input samples just processed. The number of samples to move is one less than the number of coefficients in the filter. Also note the use of the memmove function rather than the memcpy function, since there is a potential for moving overlapping data (whether or not the data overlaps depends on the relative lengths of the input samples and number of coefficients).
That concludes the code example for the FIR filter. In Part 2, I will show how to do the same thing using fixed point operations.
]]>
Recently I ran across an ISO specification for extensions to the C programming language to support fixed point types. The types are defined in a header file called stdfix.h. I have attached an early draft of the ISO spec (from 2006) here:
I don’t think the extensions simplify the use of fixed types very much. The programmer still needs to know how many bits are allocated to integer and fractional parts, and how the number and positions of bits may change (during multiplication for example). What the extensions do provide is a way to access the saturation and rounding modes of the processor without writing assembly code. With this level of access, it is possible to write much more efficient C code to handle these operations.
The advantages of C code over assembly are quicker coding and debugging, and more portable code (that is, code that can run on more than one type of processor). However, I noticed that details such as fixed point fractional points and handling of rounding are implementation dependent. So the portability may only be applicable for “similar” processors.
I have never coded anything using the stdfix.h definitions. As far as I can see, the GCC compiler and the Dinkumware libraries are the only tools using these extensions. I’m not sure if or when it will come into popular use, but it’s something to consider if one is coding fixed point math operations in C.
When converting from one fixed point representation to another, there is often a right shift operation to eliminate bits. (Or higher order bits are just stored without keeping the lower order bits.) This occurs when converting from a Q31 to a Q15 format number for example, since 16 bits need to be eliminated. Before throwing away the unused bits, sometimes it is desirable to perform a rounding operation first. This can improve the accuracy of results, and can prevent the introduction of a bias during conversion of a signal. Rounding is also an important operation when generating fixed point filter coefficients from floating point values, but that is not the subject of this post.
To illustrate rounding, I will use an example where six different signed Q7.8 numbers are converted to a signed Q15.0 number (a regular 16 bit integer). I will illustrate truncation (throwing away the least significant eight bits) and rounding. Recall that a Q7.8 number has seven integer bits and eight fractional bits. For the example, the six numbers will be 1.25, 1.5, 1.75, -1.25, -1.5 and -1.75.
The first thing to determine is how these numbers will be represented in a 16 bit integer register. Multiplying each by 256 (which is two to the power eight) gives the following result (in hexadecimal):
1.25 = 0x0140
1.5 = 0x0180
1.75 = 0x01C0
-1.25 = 0xFEC0
-1.5 = 0xFE80
-1.75 = 0xFE40
Now if the numbers are truncated, the result is found by shifting right by eight. Here are the results:
truncate(1.25) = 0x0001 = 1
truncate(1.5) = 0x0001 = 1
truncate(1.75) = 0x0001 = 1
truncate(-1.25) = 0xFFFE = -2
truncate(-1.5) = 0xFFFE = -2
truncate(-1.75) = 0xFFFE = -2
For the positive numbers, the result of truncation is that the fractional part is discarded. The negative number results are more interesting. The result is that the fractional part is lost, and the integer part has been reduced by one. If a series of these numbers had a mean of zero before truncation, then the series would have a mean of less than zero after truncation. Rounding is used to avoid this problem of introduced bias and to make results more accurate.
Truncation is not really the correct term for the example above. More accurately, a “floor” operation is being executed. A floor operation returns the greatest integer that is not greater than the operand.
In a common method of rounding, a binary one is added to the most significant bit of the bits that are to be thrown away. And then a truncation is performed. In the current example, we would add 0.5, represented as 128 decimal or 0x0080 in our 16 bit integer word. So the results in our example are as follows:
round(1.25) = (0x0140 + 0x80) >> 8 = 0x0001 = 1
round(1.5) = (0x0180 + 0x80) >> 8 = 0x0002 = 2
round(1.75) = (0x01C0 + 0x80) >> 8 = 0x0002 = 2
round(-1.25) = (0xFEC0 + 0x80) >> 8 = 0xFFFF = -1
round(-1.5) = (0xFE80+ 0x80) >> 8 = 0xFFFF = -1
round(-1.75) = (0xFE40 + 0x80) >> 8 = 0xFFFE = -2
These results are less problematic than using simple truncation, but there is still a bias due to the non-symmetry of the 1.5 and -1.5 cases. The amount of bias depends on the data set. Even if a set of data to be converted contained only positive values, there is still a bias introduced, because all of the values that end in exactly .5 are rounded to the next highest integer. One way to eliminate this bias is to round even and odd values differently (even and odd to the left of the rounding bit position).
For the more common conversion of Q31 to Q15 numbers, the rounding constant is one shifted left by fifteen, or 32768 decimal, or 0x8000 hexadecimal.
Some of the Texas Instrument DSPs have rounding instructions that can be performed on the accumulator register prior to saving a result to memory. For example, the TMS320C55x processor includes the ROUND instruction (full name is “round accumulator content”). The instruction has two different modes. The “biased” mode adds 0x8000 to the 40 bit accumulator register. The “unbiased” mode conditionally adds 0x8000 based on the value of the least significant 17 bits. It is designed to address the bias problems I described above. Wikipedia has a good discussion of rounding and bias errors (http://en.wikipedia.org/wiki/Rounding). The TMS320C55x is using the “round half to even” method of rounding for the unbiased mode, and “round half up” for the biased mode.
Although it seems simple on the surface, rounding in fixed point conversions has some important effects on the bias of resulting computations.
]]>
Overflow with twos complement integers occurs when the result of an addition or subtraction is larger the largest integer that can be represented, or smaller than the smallest integer. In fixed point representation, the largest or smallest value depends on the format of the number. I will assume Q31 in a 32 bit register for any examples that follow. In this case, a CPU with saturation arithmetic would set the result to -1 or (just below) +1 on an overflow, corresponding to the integer values 0x80000000 and 0x7FFFFFFF.
Overflow in addition can only occur when the sign of the two numbers being added is the same. Overflow in subtraction can occur only when a negative number is subtracted from a positive number, or when a positive number is subtracted from a negative number.
There is one case where negation of a number causes an overflow condition. When the smallest negative number is negated, there is no way to represent the corresponding positive value in twos complement. For example, the value -1 in Q31 is 0x80000000. When this number is negated (flip the bits and add one) the result is again -1. If the saturation mode is set, then the CPU will set the result to 0x7FFFFFFF (just less than +1).
Overflow can occur when shifting a number left by 1 to n bits. In fixed point computations, left shifting is used to multiply a fixed point value by a power of two, or to change the format of a number (Q15 to Q31 for example). Again, many CPUs have saturation modes to set the output to the minimum or maximum 32 bit integer (depending on whether the original number was positive or negative). Furthermore, a common feature is an instruction that counts the number of leading ones or zeros in a number. This helps the programmer avoid overflow since the number of leading sign bits determines how large a shift can be done without causing overflow.
Overflow will not occur when right shifting a number.
Overflow doesn’t really occur during multiplication if the result register has enough bits (32 bits if two 16 bit numbers are multiplied). But it is partly a matter of interpretation. When multiplying a fixed point value of -1 by -1 (0x8000 by 0x8000 using Q15 numbers), the result is +1. If the result is interpreted as a Q1.30 number (one integer bit and 30 fractional bits) then there is no problem. If the result is to be a Q30 number (no integer bits) then an overflow condition has occurred. And if the number was to be converted to Q31 (by shifting the result left by 1) then an overflow would occur during the left shift. The overall affect would be that -1 times -1 equals -1.
I have used a CPU that handles this special case with saturation hardware. Some CPUs have a multiplication mode that shifts the product left by one bit after a multiply operation. The reason for doing so is to create a Q31 result when two Q15 numbers are multiplied. Then if a Q15 result is desired, it can be found by storing the upper 16 bits of the result register (if the register is only 32 bits). The saturating mode automatically sets the result to 0x7FFFFFFF when the number 0x8000 is multiplied by itself, and the “shift left by one” multiplication mode is enabled.
A very often used operation in DSP algorithms is the “multiply accumulate” or “MAC”, where a series of numbers is multiplied and added to a running sum. I would recommend not using the “left shift by one” mode if possible when doing MACs, since this only increases the chance for overflow. A better technique is to keep the result as Q1.30, and then handle overflow if converting the final result to Q31 or Q15 (or whatever). This is also a good technique to use on CPUs without saturation modes, since the number of overflow checks can be greatly reduced in some cases.
Overflow in division can occur when the result would have more bits than was calculated. For example, if the magnitude of the numerator is several times larger than that of the denominator, than the result must have enough bits to represent numbers larger than one. Overflow can be avoided by carefully considering the range of numbers being operated on, and calculating enough bits for the result. I have not seen a CPU that implements a saturation mode for division.
Division by 0 is undefined, and not really an overflow case.
Many CPUs include hardware supported handling of overflow using saturation modes. These modes are useful, but it is better to avoid overflow in the first place if possible. This can lead to more accurate results in computations. And when using a CPU without saturation arithmetic, it is best to design the arithmetic operations so that the number of overflow checks is minimized.