### ALGORITHM-BASED LOW-POWER DSP SYSTEM DESIGN: METHODOLOGY AND VERIFICATION

An-Yeu Wu\* K. J. Ray Liu\* Zhongying Zhang Kazuo Nakajima Arun Raghupathy\* Shang-Chieh Liu\* Electrical Engineering Department \* also with Institute for Systems Research University of Maryland College Park, MD 20742

Abstract - We present a low-power design methodology based on the multirate approach for DSP systems. Since the data rate in the resulting multirate implementation is M-times slower (where M is a positive integer) than the original data rate while maintaining the same throughput rate, we can apply this feature to either the low-power implementation, or the speed-up of the DSP systems. This design methodology provides VLSI designers a systematic tool to design low-power DSP systems at the algorithmic/architectural level. The proposed low-power multirate design scheme is verified by the implementation of two FIR VLSI chips with different architectures: One is the normal pipelined design and the other is the multirate FIR design with downsampling rate equal to two. Using the CMOS power dissipation model, we can predict that the multirate FIR chip consumes only 29% power of the normal FIR chip given the same throughput rate. The predicted results will be verified by measuring real power consumption of both chips.

### INTRODUCTION

Due to the limited power-supply capability of current battery technology, the stateof-the-art personal communication services (PCS) devices call for low-power VLSI design at all aspects (algorithmic, architectural, circuit, logic, and device levels) to minimize the total power consumption, while maintaining the system performance such as data throughput rate [1]. It is also imperative to develop low-power design techniques using systematic approaches, rather than heuristic solutions, for the fast prototyping of low-power PCS devices in the fast-changing PCS markets. Recently, several systematic approaches for low-power designs have been proposed [2][3][4]. They basically focus on reducing the occurrence of the switching events in the VLSI circuits. On the other hand, although parallel processing, pipelined design, and the multirate approach, are known as efficient ways to compensate the increased delay under low supply voltage at the architectural/algorithmic level [1][5][6], there is still no systematic way to apply those compensation techniques to general DSP problems

In this paper, we present a systematic approach for the low-power design of a general linear time-invariant (LTI) FIR/IIR system based on the multirate approach. In general, the direct implementation of the system transfer function H(z) (see Fig. 1(a)) has the constraint that the speed of the processing elements must be as fast as the input data rate. As a result, it cannot compensate the speed penalty under low supply voltage [1]. On the other hand, the multirate system in Fig. 1(b)

This work was supported in part by the ONR grant N00014-93-10566 and the NSF grant MIP9457397.

0-7803-2612-1/95 \$4.00 © 1995 IEEE



Figure 1: (a) A LTI FIR/IIR system. (b) The equivalent multirate implementation, where  $f_s$  is the data sampling rate.

requires only low-speed processing elements at one-third of the original clock rate to maintain the same throughput. Therefore, the processing elements can be operated at a lowest possible supply voltage to reduce the power dissipation while the data throughput rate is still maintained. Hence, the multirate implementation provides a direct and efficient way to compensate the speed penalty in the low-power design. In this paper, we present a design methodology for the design of low-power DSP systems. The users can simply follow the design steps to convert a speed-demanding system function into an equivalent multirate transfer function. Note that the data processing rate in the multirate implementation is M-times slower (where M is a positive integer) than the input data rate. Therefore, we can apply this feature to either the low-power implementation of the FIR/IIR system, or to the processing speed-up of the system [5][6]. The proposed multirate approach exploits the inherent properties of DSP systems and reformulates algorithms in such a way that the speed penalty can be compensated at the architectural level.

We also verify the effectiveness of the proposed multirate low-power design by the implementation of two FIR VLSI chips with different architectures. One is the normal pipelined design and the other is the multirate design with downsampling rate equal to two. We implement both chips using the same CAD synthesis tool and the same VLSI technology. The only difference lies in the architectural design. Therefore, the effectiveness of the algorithm-based low-power design can be observed. The selected FIR system is a Quadrature Mirror Filter (QMF) which is widely used in the applications of image compression and subband coding [7]. We consider the architectural design issues for the QMF. We employ power-of-two coefficients to reduce the total chip area [8] and find the most suitable wordlength based on the fixed-point simulation results. These two chip designs have been sent to MOSIS for fabrication. After the chips are back, we will measure the real power consumption of both chips to verify our low-power design.

## MULTIRATE DESIGN METHODOLOGY

In what follows, we present the design methodology to derive the multirate FIR system of Fig. 1. Without loss of generality, we assume that M = 3 in our derivation. The results can be easily extended for an arbitrary M.



Figure 2: Design procedure for the multirate LTI system.

#### The Design Procedure:

Given an LTI FIR system H(z) with order N and decimation factor M, the design procedure is as follows (see Fig. 2).

- **Step(a):** Insert M-1 unit delays after the transfer function H(z).
- Step(b): Replace the delay element with its equivalent "delay chain perfect reconstruction system."
- Step(c): Move H(z) to the right till reaching the decimation operators.
- **Step(d):** Merge the delay elements with the transfer functions. Group the resulting new transfer functions  $(H(z), z^{-1}H(z), \text{ and } z^{-2}H(z))$  with their associated decimation operators.
- **Step(e):** Replace each decimation circuit in the circle shown in Fig.2(d) with its polyphase implementation [9]. Then we have Fig. 2(e), where  $E_i(z)$ , i =





Step(f): Note that the data inputs at points designated by a are the same, an so are those at points b and c. After merging the common data paths is Fig. 2(e), we obtain Fig. 2(f) in which

$$\mathbf{E}(z) \stackrel{\Delta}{=} \begin{bmatrix} E_0(z) & E_1(z) & E_2(z) \\ z^{-1}E_2(z) & E_0(z) & E_1(z) \\ z^{-1}E_1(z) & z^{-1}E_2(z) & E_0(z) \end{bmatrix}.$$
(1)

The general form of  $\mathbf{E}(z)$  with an arbitrary decimation factor M can be shown to be

$$\mathbf{E}(z) = \begin{bmatrix} E_0(z) & E_1(z) & \cdots & E_{M-1}(z) \\ z^{-1}E_{M-1}(z) & E_0(z) & \cdots & E_{M-2}(z) \\ \vdots & \vdots & \ddots & \vdots \\ z^{-1}E_1(z) & z^{-1}E_2(z) & \cdots & E_0(z) \end{bmatrix},$$
(2)

which is also known as the *pseudocirculant matrix* in the context of alias-free QMF filter banks [10].

The design steps described in Fig. 2 provide a systematic way to design a lowpower FIR system. Since each  $E_i(z)$  in Fig. 2(f) represents a subfilter of order N/M, the total hardware complexity to realize the multirate FIR system is MNmultipliers and  $(MN + M^2)$  adders. Basically, we pay a linear increase of hardware overhead in exchange for the advantage of an M-times slower processing speed.

The methodology can also be applied to multirate IIR system design. We first find out the polyphase components  $E'_i(z)$ ,  $i = 0, 1, \ldots, M - 1$ , of the given IIR function H'(z) (see Appendix). After replacing  $E_i(z)$  in Fig. 2 with  $E'_i(z)$ , for  $i = 0, 1, \ldots, M - 1$ , we can use the methodology in Fig. 2 to convert H'(z) into its equivalent multirate transfer function. Note that the complexity of each  $E'_i(z)$  can be as high as that of the original transfer function H'(z). Hence, we may pay up to  $O(M^2)$  hardware complexity for the implementation of the multirate IIR filter.



Figure 3: Diagonalization of the pseudocirculant matrix using the FFT approach.

## **Diagonalization of the Pseudocirculant Matrix**

Although the multirate implementation of Fig. 2(f) can be readily applied to lowpower design, the global communication of this structure is not desirable in the VLSI implementation. Therefore, we want to diagonalize the pseudocirculant matrix of Eq. (2) so as to eliminate global communication in the multirate implementation.

One way to diagonalize  $\mathbf{E}(z)$  of Eq. (2) is to use the DFT approach [11]. For example, the multirate system of Fig. 2(f) can be diagonalized by using 6-point DFT/IDFT networks as depicted in Fig. 3. The DFT approach requires complexnumber operations for the filtering operations of those subfilters  $D_i(z)$ 's. Besides, it still has global communications in the DFT/IDFT networks.

Besides the DFT approach, the diagonalization approaches based on polynomial convolution techniques have been studied in [12][13]. As an example, for the case the decimation factor equal to two, Eq. (2) can be rewritten as

$$\begin{bmatrix} E_0(z) & E_1(z) \\ z^{-1}E_1(z) & E_0(z) \end{bmatrix} = \underbrace{\begin{bmatrix} 1 & 1 & 0 \\ 0 & 1 & 1 \end{bmatrix}}_{\mathbf{B}} \begin{bmatrix} E_0(z) & 0 & 0 \\ 0 & E_0(z) + E_1(z) & 0 \\ 0 & 0 & E_1(z) \end{bmatrix} \underbrace{\begin{bmatrix} 1 & -1 \\ 0 & 1 \\ z^{-1} & -1 \end{bmatrix}}_{\mathbf{A}}$$
(3)

The resulting diagram is depicted in Fig. 4, where the downsampling circuit and the upsampling circuit are used to realize the pre-processing matrix  $\mathbf{A}$  and the post-processing matrix  $\mathbf{B}$ , respectively, of Eq. (3). This diagonalization approach involves only real-number operations to process the decimated sequences. However, as M increases, the derivation becomes complicated and the resulting architecture is highly irregular (as opposed to the DFT approach) [13]. In the verification part, we will use the multirate FIR structure in Fig. 4 for our low-power FIR chip design.

#### Power Estimation for the Multirate FIR Architecture

Before we proceed to our chip design, we consider the power dissipation of the multirate design. For the normal FIR architecture, it requires N multipliers and N adders. For the low-power multirate FIR architecture depicted in Fig. 4, 3N/2 multipliers and 3N/2 adders are required. Since all operators are running at a 2-times slower clock rate,  $V'_{dd}$  can be as low as 3.1V [6]. Provided that the capacitance due to the multipliers is dominant in the circuit and is roughly proportional to the



Figure 4: Multirate FIR architecture with M = 2, where  $f_s$  is the data sampling rate.

number of multipliers and adders, the power consumption of the multirate FIR design can be estimated as

$$(\frac{\frac{3N}{2}}{N}C_{eff})(\frac{3.1V}{5V})^2(\frac{1}{2}f) \approx 0.29P_0, \tag{4}$$

where  $P_0$  denotes the power consumption of the original system. Although the multirate architecture requires about 50% hardware overhead, it consumes only 29% of the power of the original pipelined design. Basically, we trade hardware complexity for low-power consumption.

# MULTIRATE LOW-POWER CHIP DESIGN

We now verify the power saving of the multirate low-power FIR design. The selected FIR design is a Quadrature Mirror Filter (QMF) which is widely used in the image compression and subband coding [7]. The design and implementation issues of these two FIR chips are discussed below.

### Selection of QMF Coefficients and System Wordlength

The implementation of the QMF filter requires a large number of multipliers. In order to save the chip area, we choose the QMF design with power-of-two coefficients [8] for our chip implementation. For the purpose of further lowering the hardware complexity, we modify the QMF design in [8] by truncating some boundary tap coefficients and by dropping some relatively small components in each coefficient. Shown below are the QMF coefficients used in our FIR chip design:

| h(10) |   | h(11) |   | $2^{-2} + 2^{-4} + 2^{-6}$ , | h(5) | = | h(16) | = | $2^{-7} + 2^{-8}$ , |
|-------|---|-------|---|------------------------------|------|---|-------|---|---------------------|
| h(9)  | = | h(12) |   | $2^{-4} + 2^{-5}$ ,          | h(4) | = | h(17) | = | $-2^{-6}-2^{-7}$    |
| h(8)  | = | h(13) | = | $-2^{-4}-2^{-7}$ ,           | h(3) | = | h(18) | = | $-2^{-8}$ ,         |
| h(7)  | = | h(14) | = | $-2^{-5}$ ,                  | h(2) | = | h(19) | = | $2^{-6}$ ,          |
| h(6)  | - | h(15) | = | $2^{-5} + 2^{-7}$ ,          | h(1) | = | h(20) | = | 0,                  |
|       |   |       |   |                              | h(0) | = | h(21) | = | $-2^{-7}$ .         |
|       |   |       |   |                              |      |   |       |   | (5)                 |

To verify the performance of this modified QMF, we carried out simulations by passing the LENA image through the subband coding structure (see [8, Fig.5]). Table 1

| Filter type                | Filter<br>length | PSNR     | Coefficient type | Fixed-point<br>adders |
|----------------------------|------------------|----------|------------------|-----------------------|
| Filter (Type 32D) in [7]   | 32               | 44.82 dB | floating-point   | N/A                   |
| Filter in [8]              | 32               | 38.79 dB | power-of-two     | 84                    |
| Modified Filter of Eq. (5) | 22               | 37.01 dB | power-of-two     | 36                    |

Table 1: SNR results for different QMF's.



Figure 5: SNR results for the modified QMF as a function of system wordlength B.

lists the SNR results. Compared with the original design in [8], our modified design suffers only a little degradation in SNR but has much less hardware complexity.

Next we want to determine the system wordlength used in our chip design. Since the wordlength would directly affect the resulting chip area as well as the total number of switching events in the logic circuits, it is important to determine the minimum wordlength without degrading the SNR performance. We conducted computer simulations by feeding the LENA image into the subband coding structure under fixed-point arithmetic. The results are shown in Fig. 5. We see that the SNR curve saturates around B = 12. Thus, we use the wordlength of 12 bits in our design.

#### Chip Design

Having decided the system specifications of our design, we use PARTHENON [14] to synthesize both the normal and the multirate FIR filters. The resulting chip layouts are shown in Fig. 6. There are four modules in the multirate design. The upper right module realizes the upsampling and downsampling circuits of Fig. 4. The signal data rate is reduced to  $f_s/2$  after this module. The other three modules realize the three N/2-tap FIR filters,  $E_0(z)$ ,  $E_1(z)$ , and  $E_0(z) + E_1(z)$ , of Fig. 4, and the operating frequency of these filters is only  $f_s/2$ . Their output signals are sent back to the up/downsampling module to reconstruct the filtering output y(n) running at  $f_s$ . The chip area of the multirate design is about 50% more than that of the normal design as we expected. Therefore, our estimation of the effective capacitance in Eq. (4) is very accurate.

In order to see the effect of supply voltage on the speed of the FIR design, we conducted SPICE simulation for the critical path of the FIR structure. The simulation result depicted in Fig. 7 shows that the propagation delay is approximately doubled as the supply voltage reduces form 5V to 3V. This is consistent with the



Figure 6: (a) Final layout of the normal FIR filter (dimension =  $4400 \times 6600\lambda^2$ ). (b) Final layout of the multirate FIR filter (dimension =  $6500 \times 6600\lambda^2$ ).

results presented in [1]. Since the delay in the critical path generally determines the maximum clock rate of the chip, we can predict that the performance of the filtering operations will be degraded by 50% under the 3V supply voltage. Nevertheless, the data throughput rate of the multirate FIR will not be affected by such a speed penalty since the slowed-down devices are in the  $f_s/2$  region (see Fig. 4). The I/O data rate will remain at  $f_s$  which is the same as the normal FIR design operated at 5V system.

The results shown in Fig. 6 and 7 have confirmed the parameter values of Eq. (4) that are used to estimate the power consumption of the multirate FIR chip under 3.1V operation. Currently, these two chip designs are being fabricated by MOSIS using  $2\mu$  double metal CMOS technology. Once the chips are back, we will measure and report the real power consumption of each chip to verify the proposed algorithm-based low-power design.

### CONCLUSIONS

In this paper, we have presented an algorithm-based low-power design methodology using the multirate approach for FIR/IIR filtering. To verify the effectiveness of the multirate architecture in the application of low-power design, we have also implemented two designs onto VLSI chips so as to compare the real power savings. From the CMOS power model, it can be shown that the multirate low-power FIR design can save up to 71% power compared with the normal FIR design. The predicted results will be verified once the chips are back. Another attractive application of the multirate design is in very high-speed filtering. If we do not lower down the supply voltage to save chip power, the maximum speed of the multirate design can



Figure 7: Timing analysis for one pipelined stage in the FIR design.

be M-times faster than the normal design.

### APPENDIX

Polyphase decomposition of IIR functions

Given an IIR function

$$H'(z) = \frac{N(z)}{D(z)} = \frac{1 + \sum_{i=1}^{P} p_i z^{-i}}{1 + \sum_{i=1}^{Q} q_i z^{-i}},$$
(6)

its polyphase components can be computed as follows. For each pole of H'(z), we introduce (M-1) extra pole-zero pairs that are equally spaced on a circle with the same radius as that of the original pole. Then, H'(z) can be rewritten as [15]

$$H'(z) = \frac{N(z) \prod_{k=1}^{M-1} D(ze^{j2\pi k/M})}{D(z) \prod_{k=0}^{M-1} D(ze^{j2\pi k/M})} = \frac{N'(z)}{D'(z^M)}.$$
(7)

Let the Type I polyphase representation of N'(z) be  $\sum_{l=0}^{M-1} z^{-l} G'_l(z^M)$ . The polyphase decomposition of H'(z) can be represented as

$$H'(z) = \sum_{l=0}^{M-1} z^{-l} E'_l(z^M)$$
(8)

with  $E'_l(z^M) \stackrel{\triangle}{=} \frac{G'_l(z^M)}{D'(z^M)}$ . The complexity of the IIR polyphase decomposition is higher than that of FIR. Note that the multiplication complexity of  $D'(z^M)$  is Q, and that of N'(z) is (P + (M - 1)Q). Hence, we need a total multiplication complexity of (P + (2M - 1)Q) for implementing all  $E'_l(z^M)$ 's.

# References

 A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, "Low-power CMOS digital design," *IEEE J. Solid-State Circuits*, vol. 27, pp. 473–484, April 1992.

- [2] L. Benini, P. Siegel, and G. D. Micheli, "Saving power by synthesizing gated clocks for sequential circuits," *IEEE Design & Test of Computers*, vol. 11, pp. 32-41, Winter 1994.
- [3] C.-L. Su, C.-Y. Tsui, and A. M. Despain, "Saving power in the control path of embedded processors," *IEEE Design & Test of Computers*, vol. 11, pp. 24–30, Winter 1994.
- [4] S. A. Khan and V. K. Madisetti, "System partitioning of MCM for low power," IEEE Design & Test of Computers, vol. 12, pp. 41-52, Spring 1995.
- [5] A.-Y. Wu and K. J. R. Liu, "A low-power and low-complexity DCT/IDCT VLSI architecture based on Backward Chebyshev Recursion," in *Proc. IEEE Int'l Symp. Circuits and Systems*, (London), May 1994, pp. 4.155-4.158.
- [6] A.-Y. Wu and K. J. R. Liu, "Algorithm-based low-power transform coding architectures," in Proc. IEEE Int'l Conf. Acoust. Speech, Signal Processing, (Detroit), May 1995, pp. 3267–3270.
- [7] J. D. Johnston, "A filter family designed for use in quadrature mirror filter banks," Proc. IEEE Int'l Conf. Acoust. Speech, Signal Processing, 1980, pp. 291-294.
- [8] C.-K. Chen and J.-H. Lee, "Design of linear-phase quadrature mirror filters with power-of-two coefficients," *IEEE Trans. Circuits Syst. II*, vol. 41, pp. 445– 456, July 1994.
- [9] P. P. Vaidyanathan, Multirate systems and filter banks. Englewood Cliffs, NJ: Prentice Hall, 1993.
- [10] P. P. Vaidyanathan and S. K. Mitra, "Polyphase networks, block digital filtering, LPTV systems, and alias-free QMF banks: A unified approach based on pseudocirculants," *IEEE Trans. Acoust. Speech, Signal Processing*, vol. 36, pp. 381–391, March 1988.
- [11] H. K. Kwan and M. T. Tsim, "High speed 1-D FIR digital filtering architectures using polynomial convolution," in Proc. IEEE Int'l Conf. Acoust. Speech, Signal Processing, (Dallas, TX), April 1987, pp. 1863–1866.
- [12] Z.-J. Mou and P. Duhamel, "Fast FIR filtering: Algorithm and implementations," Signal Processing, vol. 13, pp. 377-384, 1987.
- [13] Z.-J. Mou and P. Duhamel, "Short-length FIR filters and their use in fast nonrecursive filtering," *IEEE Trans. Signal Processing*, vol. 39, pp. 1322-1332, June 1991.
- [14] Y. Nakamura, K. Oguri, and A. Nagoya, "Synthesis from pure behavioral descriptions," in *High-level VLSI Synthesis* (R. Camposano and W. Wolf, eds.), Kluwer Academic Publishers, 1991, pp. 205–229.
- [15] K. K. Parhi and D. G. Messerschmitt, "Pipeline interleaving and parallelism in recursive digital filters-Part I: Pipelining using scattered look-ahead and decomposition," *IEEE Trans. Acoust. Speech, Signal Processing*, vol. 37, pp. 1099-1117, July 1989.