# PIPELINED DIVISION OF SIGNED NUMBERS WITH THE USE OF RESIDUE ARITHMETIC FOR SMALL NUMBER RANGE WITH THE PROGRAMMABLE GATE ARRAY 


#### Abstract

In this work an architecture of the pipelined signed residue divider for the small number range is presented. Its operation is based on reciprocal calculation and multiplication by the dividend. The divisor in the signed binary form is used to compute the approximated reciprocal in the residue form by the table look-up. In order to limit the look-up table address an algoritm based on segmentation of the divisor into two segments is used. The approximate reciprocal transformed to residue representation with the proper sign is stored in look-up tables. During operation it is multiplied by the dividend in the residue form and subsequently scaled. The pipelined realization of the divider in the FPGA environment is also shown.


## 1. INTRODUCTION

In the digital signal processing the division is usually performed when the quotient of two signals has to be determined, for example, in computation of the phase shift before arctangent calculation. The residue arithmetic $[1,2,3]$ is a tool that can be used for realization of $D S P$ algorithms due to its decompositional properties with respect to addition, subtraction and especially to multiplication because multiplication in one large integer ring can be replaced by a set of multiplications performed in small integer rings in parallel. The other important feature is the possibility to decompose the complex multiplication of Gaussian numbers in similar manner as that for integers using derivative systems such as the MQRNS (Modified Quadratic Residue Number System) [4]. However, other operations in residue arithmetic such as reverse conversion, sign detection, magnitude comparison, scaling and division are difficult. The division of residue numbers can be carried out by converting them to a weighted system, performing division and converting back to the residue form. However, division of residue numbers partly or fully in residue arithmetic can be more effective. The algorithms

[^0]of residue division belong mainly to a group of subtractive [5, 6, 7] or multiplicative $[8,9,10]$ algorithms. The multiplicative algorithms compute, using the Mixed-Radix Conversion ( $M R C$ ) [1], the reciprocal of the divisor which is subsequently multiplied by the dividend. Also two algorithms were presented [11, 12], where the $M R C$, sign detection, overflow detection are not needed but in the former the conversion of the divisor and dividend to the binary system is necessary. They have better time-hardware complexity, however, they are iterative what makes them not suitable for pipelined processing. The algorithm based on iterative reciprocal computation was given in [13]. In this work an architecture of the pipelined residue divider of signed number is shown. The implementation uses a non-iterative residue multiplicative division algorithm. The approximate reciprocal of the divisor is computed by the look-up with the use of the algorithm from [14] based on segmentation of the divisor in the binary form in two segments that address the look-up tables. In this way the size of look-up tables is reduced. The algorithm has fixed division time. It is assumed that the architecture will use 6-bit look-up tables available in the Xilinx $F P G A$. The algorithm permits to implement the division for signed 12-bit numbers with the maximum division error smaller than 2.

## 2. THE RESIDUE NUMBER SYSTEM(RNS)

The $R N S$ with the base $B=\left\{m_{1}, m_{2}, . ., m_{p}\right\}$, where $m_{j}, j=1,2, . ., p$, are named moduli and the number range $M=\prod_{j=1}^{p} m_{j}$, allows to represent the nonnegative integer $N$ from $[0, M-1]$ by the digit vector $\left.\left.\left\langle\left. N\right|_{m_{1}},\right| N\right|_{m_{2}}, \ldots,|N|_{m_{p}}\right)=\left(n_{1}, n_{2} \ldots, n_{p}\right)$, where $|N|_{m_{j}}$ is the least nonnegative residue from division of $N$ by $m_{j}$, $j=1,2,3, \ldots, p$. This representation is one-to-one correspondence if the moduli are pairwise relatively prime. In such a case there is a unique mapping given by the Chinese Remainder Theorem [1-2]. The main advantage of the $R N S$ is due to the fact that addition, subtraction and multiplication of two $R N S$ numbers can be performed independently on the corresponding pairs of residues. For the numbers with the sign denoted as $X$, if $M$ is even, $X=N$ for $N<M / 2$, and $X=N-M$, if $N \geq M / 2$. If $M$ is odd, $X=N$ for $N<(M-1) / 2$, and $X=N-M$, if $N \geq(M-1) / 2$. As the multiplication of signed numbers is used in the method of division presented below, we shall illustrate it with an example.

Example 1. Multiplication of signed numbers in residue arithmetic.
Let $B=\{32,31,29,27,25,23)$, we have $M=446623200$ and let
$z_{1}=35 \leftrightarrow\{3,4,6,8,10,12\}$
$z_{2}=-70 \rightarrow M-70=446623130 \leftrightarrow\{26,23,17,11,5,22\}$
We want to obtain the product $P=z_{1} \cdot z_{2}$. By performing the multiplications in the individual rings we obtain the residues of $P$
$P=\left(|3 \cdot 26|_{32},|4 \cdot 23|_{31},|6 \cdot 17|_{29},|8 \cdot 11|_{27},|10 \cdot 5|_{25},|12 \cdot 22|_{23}\right)=(14,30,15,7,0,11)$
These residues are the residues of the number $M-2450=46620750$, that represents the product $P$ in the $M$ ring.

## 3. DIVISION ALGORITHM

In the residue division algorithm we have to find an integer $\widetilde{Q}$ that approximates $Q=X / Y$ with the maximum acceptable division error, $\varepsilon_{d i v}^{\max }$. The reciprocal of the divisor has to be determined with such accuracy that after multiplication by dividend $X$, the resulting division error is smaller than the assumed maximum acceptable error. The additional requirement imposed on the algorithm may be the use of small tables for the reciprocal computation. In the algorithm initially $m$-bit divisor $(m \leq 12)$ is decomposed into $m$ - $k$ bit segment and $k$-bit segment with $k$ not exceeding 7 bits. For computation of the divisor by lookup such segmentation allows to use smaller look-up tables than in the case when the look-up table is addressed with the full representation of the divisor. The reciprocal $R$ can be decomposed into two parts in the following manner

$$
\begin{equation*}
R=\frac{1}{Y}=\frac{1}{a+b}=\frac{1}{a}-\frac{b}{a \cdot(a+b)} \tag{1}
\end{equation*}
$$

The transformation of (1) into the form that allows to use small look-up tables was presented in [14] along the reciprocal computation algorithm. In the following a short review is provided. It is seen that the computation of $1 /(a \cdot(a+b))$ requires $m$-bit address, in order to replace it with $\left\lceil\log _{2} a\right\rceil$-bit address we may try to replace $b$ by a suitably chosen constant, $K$ that leads to

$$
\begin{equation*}
\frac{b}{a \cdot(a+b)} \cong \frac{b}{a \cdot(a+K)}, \tag{2}
\end{equation*}
$$

and in effect we obtain the following reciprocal approximation

$$
\begin{equation*}
\widetilde{R}=\frac{1}{a}-\frac{b}{a \cdot(a+K)} \tag{3}
\end{equation*}
$$

where $a=\left\lfloor Y / 2^{k}\right\rfloor$ and $b=|Y|_{2^{k}}$. We see that $1 / a$ can be computed by the look-up using $m-k$ bits. For $a>0$ we replace $b$ by $K$ that approximates $b$ in $\left[0,2^{k}-1\right]$. Remark that for $a=0,1 / b$ can be looked up using $k$-bits.

The reciprocal approximation error resulting from using $K$ instead of $b$ is expressed as [14]

$$
\begin{equation*}
R-\widetilde{R}=\varepsilon(a, b, K)=\frac{b \cdot(K-b)}{a \cdot(a+b) \cdot(a+K)} . \tag{4}
\end{equation*}
$$

As $K$ has to approximate $b$, it should belong to $\left[0, b_{k}\right]$, where $b_{k}=2^{k}-1$ is the end of the interval. It is evident that $\varepsilon(a, b, K)$ is maximal with respect to $a$ when $a=a_{\text {min }}$ and with respect to $b$ when $b=b_{k}$ or for certain $b=b_{\text {max }}$. Using (4) and $a=a_{\text {min }}$, the maximum division error for the maximum dividend, $X_{\text {max }}$ can be written as

$$
\begin{equation*}
\varepsilon_{d i v}^{\max }=\varepsilon\left(a_{\min }, b, K\right) \cdot X_{\max } . \tag{5}
\end{equation*}
$$

The extreme of (4) with respect to $b_{\text {max }}$ is obtained as

$$
\begin{equation*}
b_{\max }=-a+\sqrt{a \cdot(a+K)} \tag{6}
\end{equation*}
$$

Using this $b_{\text {max }}$, we want to equalize the division error, for $b_{k}$ and $b_{\max }$

$$
\begin{equation*}
-\varepsilon\left(a_{\min }, b_{\max }, K\right)=\varepsilon\left(a_{\min }, b_{k}, K\right) \tag{7}
\end{equation*}
$$

(7) using (4) can be written in the following form

$$
\begin{equation*}
-\frac{b_{\max } \cdot\left(b_{\max }-K\right)}{a+b_{\max }}=\frac{b_{k} \cdot\left(b_{k}-K\right)}{a+b_{k}} \tag{8}
\end{equation*}
$$

Inserting (6) into (8), we obtain the equation for $K$ that allows to determine K that provides the fulfilment of (8)

$$
\begin{equation*}
\left(1+\frac{4 \cdot b_{k}}{a_{\min }}+\frac{4 \cdot b_{k}^{2}}{a_{\min }^{2}}\right) \cdot K^{2}+\left(\frac{2 \cdot b_{k}^{2}}{a_{\min }}-\frac{4 \cdot b_{k}^{3}}{a_{\min }^{2}}+4 \cdot b_{k}\right) \cdot K-4 \cdot b_{k}^{2}-4 \cdot \frac{b_{k}^{3}}{a_{\min }}+\frac{b_{k}^{4}}{a_{\min }^{2}}=0 \tag{9}
\end{equation*}
$$

Sample solutions of (9) are given in Example 2.
Example 2. Assume the length of the divisor $Y$ equal to $m=12$ bits and the lengths of $a$ and $b$ equal to 6 bits. We have $a_{\min }=64$ and $b_{k}=63$. The coefficients of the quadratic equation (9) are $A=8.81, B=131.84, C=-27658.0$. Moreover, we have $b_{\text {max }}=21.05$ and the optimum $K=49.07$. In effect we obtain for $X_{\max }=2^{12}-1$ and for $b_{\max }$, the maximum division error equal to $\varepsilon\left(K, b_{\max }\right) \cdot X_{\text {max }}=3.92$ and for $b_{k}$ to $\varepsilon\left(K, b_{k}\right) \cdot X_{\max }=-3.92$.

In order to reduce this error we may increase the length of $a$ to 7 bits and shorten $b$ to 5 bits. We then have $a_{\min }=128$ and $b_{k}=31$. The coefficients of (9) are $A=2.2, B=131.74, C=-4718.6$, and the optimum $K=25.21$ and $b_{\text {max }}=12.03$. In effect we have for $X_{\text {max }}, \varepsilon\left(K, b_{\text {max }}\right) \cdot X_{\text {max }}=0.236$ and $\varepsilon\left(K, b_{k}\right) \cdot X_{\text {max }}=-0.236$.

## 4. HARDWARE REALIZATION

Now we shall consider the realization of the divider with the use of residue arithmetic. Such realization requires the transformation of the approximated reciprocal values to integers. This transformation is done by the multiplication by a constant $K_{s}$ and rounding off the result. After transforming of (3) to integers we get

$$
\begin{equation*}
\left\{\frac{1}{Y} \cdot K_{s}\right\} \cong\left\{\frac{1}{a} \cdot K_{s}\right\}-b \cdot\left\{\frac{1}{a \cdot(a+K)} \cdot K_{s}\right\} \tag{10}
\end{equation*}
$$

where $\left\}\right.$ denotes rounding off to nearest integer. $K_{s}$ in (10) should give the appropriate dynamic number range to represent the both terms and provide for the allowable error value that arises after multiplication of the round-off error of the second term by $b$. The maximum value of this error should not cause the unacceptable division error. The upper bound of this error is reached for $b_{\text {max }}$ and $X_{\text {max }}$ and maximal value of $a, a_{\max c}$ for which the compensation of the reciprocal approximation error is still needed. It is easy to verify that for the considered number range of division of $2^{12}$, we have $a_{\max c}=2^{8}$. The error of the second term of (10) has to fulfill the following condition

$$
\begin{equation*}
b_{\max } \cdot\left(\frac{1}{a_{\max } \cdot\left(a_{\max }+K\right)} \cdot K_{s}-\left\{\frac{1}{a_{\max } \cdot\left(a_{\max }+K\right)} \cdot K_{s}\right\}\right)<0.5 \cdot K_{s} . \tag{11}
\end{equation*}
$$

Representing the second term in (11) as $1 /(a \cdot(a+K))+\varepsilon_{r}$ where $\varepsilon_{r}$ is the rounding error, we obtain the bound on $\varepsilon_{r}$ to limit the error of transformation to integers to 0.5

$$
\begin{equation*}
\varepsilon_{r}<\frac{0.5 \cdot K_{s}}{b_{\max } \cdot X_{\max }}, \tag{12}
\end{equation*}
$$

moreover, $K_{s}$ has to fulfill the condition

$$
\begin{equation*}
K_{s}>2 \cdot a_{\max c} \cdot\left(a_{\max c}+K\right), \tag{13}
\end{equation*}
$$

For example, for $a_{\max \mathrm{c}}=256$ we have $K_{s}>156195.84$, that gives after inserting into (12) $\varepsilon_{r}<0.3072$. We may avoid the round-off error, by assumming $K_{s}$ as the multiple of the $a_{\min } \cdot\left(a_{\min }+K\right)=7236.48$, for example, 159202.56, that slighly extends the error bound. However, there can be additional requirements imposed on $K_{s}$, because certain values may facilitate the design of the scaling circuit that performs scaling after division.

Example 3. Realization of division for three divisor values 127, 191 and 319 with $a=64,128,256$, respectively, and maximum of $b=63$, for which the highest level of error compensation is needed and the error due to round-off of the second term in (10) may reach its maximum.

First we shall we consider $Y_{1}=127$. We have $a=64$ and $b=63$.
$a \cdot(a+K)=64 \cdot(64+49.07)=7236.48$, and we will adopt $K_{s}=159712$. Such choice of $K_{s}=32 \cdot 29 \cdot 23 \cdot 7$ results from the requirements of scaling after division, scaling becomes more simple when the scaling factor is a product of the moduli of the $R N S$ base.

Using (10) we get

$$
\left\{\frac{1}{Y_{1}} \cdot K_{s}\right\}=\left\{\frac{159712}{64}\right\}-63 \cdot\left\{\frac{159712}{7236.48}\right\}=\{2495.5\}-63 \cdot\{22.07\}=1110
$$

We obtain the approximate quotient as

$$
\widetilde{Q}_{1}=\left\{\frac{1}{Y_{1}} \cdot K_{s}\right\} \cdot X_{\max } / K_{s}=\frac{1110 \cdot 4095}{159712}=\{28.46\}=28
$$

whereas $Q_{1}=\frac{X_{\max }}{Y_{1}}=32.34$.
In the second case we shall consider $a$ in the middle of its interval.
$Y_{2}=193$. We have $a=128$ and $b=63$.
Here $a \cdot(a+K)=128 \cdot(128+49.07)=22664.96$. We get

$$
\left\{\frac{1}{Y_{2}} \cdot K_{s}\right\}=\left\{\frac{159712}{128}\right\}-63 \cdot\left\{\frac{159712}{22664.96}\right\}=\{1247.5\}-63 \cdot\{7.04\}=807
$$

We obtain the approximate quotient as

$$
\widetilde{Q}_{2}=\left\{\left\{\frac{1}{Y_{2}} \cdot K_{s}\right\} \cdot X_{\max } / K_{s}\right\}=\left\{\frac{807 \cdot 4095}{159712}\right\}=\{20.69\}=21
$$

whereas $Q_{2}=\frac{X_{\text {max }}}{Y_{2}}=21.39$.
Finally we consider the division for $Y_{3}=319$, where $a=256$, that means that it reaches the end point of interval in which the reciprocal approximation error is compensated. We have $a=256$ and $b=63$. Here
$a \cdot(a+K)=256 \cdot(256+49.07)=78097.92$.
$\left\{\frac{1}{Y_{3}} \cdot K_{s}\right\}=\left\{\frac{159712}{256}\right\}-63 \cdot\left\{\frac{159712}{78097.92}\right\}=\{623.87\}-63 \cdot\{2.04\}=498$
We obtain the approximate quotient as

$$
\widetilde{Q}_{3}=\left\{\frac{1}{Y_{3}} \cdot K_{s}\right\} \cdot X_{\max } / K_{s}=\frac{498 \cdot 4095}{159712}=\{12.76\}=13
$$

whereas $Q=\frac{X_{\text {max }}}{Y_{3}}=12.83$.
We can estimate the required number range by (14).

$$
\begin{equation*}
M=\left(\left\lceil\frac{1}{a_{\min }} \cdot K_{s}\right\rceil-b_{\max } \cdot\left\lfloor\frac{K_{s}}{a_{\max } \cdot\left(a_{\max }+K\right)}\right\rfloor\right) \tag{14}
\end{equation*}
$$

In our case we have $a_{\text {min }}=64, b_{\max }=63, K_{s}=159712$ and $K=49.03$. The dynamic range of the first term in (14) is equal to 2495.5 and of the second term 441. Finally we may estimate the require dynamic range as $M_{D}=\left\lceil\log _{2}(2495.5 \cdot 4095\rceil=\left\lceil\log _{2} 10219072.5\right\rceil=24\right.$ bits.

We see that after scaling the binary size of quotient obtained from this residue channel will not exceed 7 bits.
The $R N S$ base has been chosen as $B=\{32,31,29,23,21\}$ with $M=13894944$. and, given above, $K_{s}=32 \cdot 31 \cdot 23 \cdot 7=159712$.

For the $R N S$ architecture we assume that $-2048 \leq X \leq 2047$, and has the residue representation $X \leftrightarrow\left(x_{1}, x_{2}, x_{3}, x_{4}, x_{5}\right)$, where $x_{j}=|X|_{m_{j}}, j=1,2, \ldots, 5$. and $Y$ is represented in 12-bit signed binary form.

$$
\begin{equation*}
\left|\widetilde{Q} \cdot K_{s}\right|_{m}=\left\lvert\,\left\{\left.\left\{\frac{1}{a} \cdot K_{s}\right\}\right|_{m}-|b|_{m} \cdot\left(\left.\left|\left\{\frac{1}{a \cdot(a+K)} \cdot K_{s}\right\}\right|_{m_{m}}\right|_{m}\right.\right.\right. \tag{15}
\end{equation*}
$$

In Fig. 1 an architecture that implements (15) is depicted.


Fig. 1. The architecture of the residue divider
The dividend $X$ is represented at the input as $\left(|X|_{m_{1}},|X|_{m 2}, \ldots,|X|_{m_{6}}\right)$. The scaling converter scales $Y$ to the range $\left[-2^{11}, 2^{11}-1\right]$ and outputs 12 -bit binary word where the most significant bit is the sign bit, the next 5 bits form operand $a$ and six least significant bits represent operand $b$. For each residue channel the same configuration of components are used. ROM1 mod $m_{i}$ compute $\left\{\frac{\left\{\frac{k_{s}}{a}\right\}}{\left.\right|_{m_{i}}}\right.$, ROM2 $\bmod m_{i}$ compute $\left|\left\{\frac{1}{a \cdot(a+K)} \cdot K_{s}\right\}\right|_{m_{i}}$ and ROM3 mod $m_{i}$ compute $|b|_{m_{i}}$. In the next stage the multiplication is performed $\left(\right.$ MULT1 $\left.\bmod m_{i}\right)$ and in the following stage the subtraction is performed $\left(B A \bmod m_{i}\right)$. In the final stage the obtained residues are scaled by $K_{s}$. If $a=0$ the $R O M 4 \bmod m_{i}$ are applied that compute $\left|\left\{\frac{K_{s}}{b}\right\}\right|_{m_{i}}$ and ROM5 detects the sign. The outputs of these circuits are multiplexed with these obtained from (15). In this simplified divider architecture there is no divisor zero detection.

The architecture has been implemented in the Xilinx environment using the device from the Virtex-6 family. Below the synthesis report is shown. The pipelining rate of 2.74 ns has been attained. It is possible to obtain 1.52 ns that corresponds to 658.610 MHz . The pipelining rate is greater because of reduction of the number of pipeline stages.

Selected Device : 6vcx240tff784-2
Slice Logic Utilization:

Number of slice registers:
Number of slice LUTs:
Number used as logic:
Number used as memory:
Number used as SRL:

443 out of 301440
908 out of 150720
834 out of 150720 74 out of 58400 74

Timing Summary:
Minimum period: 2.747 ns (maximum frequency: 363.980 MHz )
Minimum input arrival time before clock: 0.550 ns
Maximum output required time after clock: 0.659 ns

## 5. CONCLUSIONS

The paper presents the implementation of the pipelined residue divider for 12bit number range in the Xilinx $F P G A$ environment. The divider makes use of the multiplicative division algorithm with the two-term reciprocal approximation. The residue error belongs to $[-3.92,3.92$ ], however for two's complement coding the error is halved. The divider architecture uses 5 -bit moduli so that easy implementation is possible as in this environment 6 -bit $L U T$ s are available. The architecture use neither large memories nor multipliers.

## REFERENCES

[1] Szabo N.S., Tanaka R.I.: Residue Arithmetic and its Applications to Computer Technology, McGraw-Hill, New York, 1967.
[2] Soderstrand M. et al., Residue Number System Arithmetic, Modern Applications in Digital Signal Processing, IEEE Press, NY, 1986.
[3] Omondi A., Premkumar B., Residue Number Systems: Theory and Implementation, London, Imperial College Press, 2007.
[4] Jenkins W.K., Krogmeier J.V.: The design of dual-mode complex signal processors based on quadratic modular number codes, IEEE Trans.on Circuits and Systems, Volume 34, Number 4, pp.354-364, 1987.
[5] Keir, Y.A, Cheney P.W., Tanenbaum M.: Division and overflow detection in residue number systems, IRE Trans. Electron. Comput., Volume EC-11, pp. 501507, 1962.
[6] Kinoshita E., Kosako H., Koyima Y.: General division in symmetric residue number systems, IEEE Trans. on Computers, Volume C-22, pp.134-142, 1973.
[7] Banerji D.K., Cheung T.Y., Ganesan V.: A high speed division method in residue arithmetic, Proc. of 5th IEEE Symp.on Comput. Arithm., pp. 331-342, 1981.
[8] Lin, M. L., Leiss, E., McInnis B.: Division and sign detection algorithms for residue number systems, Comput. Math. Appl. Volume 10, Number4/5, pp. 331-342, 1984.
[9] Chren W.A., Jr.: A new residue number system division algorithm, Comput. Math. Appl., vol.19, Number 7, pp.13-29, 1990.
[10] Lu M, Chiang Jen-Shiun: A novel division algorithm for the residue number system, IEEE Trans. on Comput., Volume C-41, pp.1026-1032, 1992.
[11] Hiasat A.A., Zohdy,H.A.A.: Semi-custom VLSI design and implementation of a new efficient RNS division algorithm, Computer Journal, Volume 42, Number3, pp.232-240, 1999.
[12] Talameh S., Siy P.: Arithmetic division in RNS using Galois field $G F(p)$, Comput. Math. Appl., Volume 39, pp. 227-238, 2000.
[13] Hitz, M.A., Kaltofen, E: Integer division in residue number system, IEEE Trans. on Computers, Volume C-44, pp.983-989, 1995.
[14] Czyzak, M.: Noniterative small range residue division, RADIOELEKTRONIKA 2002, May 14-16, Bratislava, pp.111-114, 2002.


[^0]:    * Gdańsk University of Technology.

