# A High-Speed and Energy-Efficient Compressor-Technique based 4-bit Signed Multiply Accumulate (MAC) Unit

Parthiv Bhau <sup>A</sup> Vijay Savani <sup>B</sup> <sup>B</sup> Corresponding Author

Abstract—This work presents the design of a novel 2-stage pipelined 4-bit signed Multiply and Accumulate (MAC) unit (called 2s4BSMAC), which is low-power, high-speed, and highthroughput. The unit incorporates a unique compressor-based 4-bit Vedic Signed Multiplier (VSM) architecture, a compressorbased 16-bit Ripple Carry Adder (RCA) architecture, and a saturation unit. Without adding buffers at intermediate stages, the proposed 4-bit Vedic Signed Multiplier (VSM) and 16-bit Ripple Carry Adder (RCA) based on compressors overcome the driving limitation of transmission gates. This paper introduces a novel hybrid Full Adder (FTGHFA-21T) having low power and high speed, utilising FinFET-Transmission Gate technology and 21 transistors. The FTGHFA-21T adder, operating at nominal supply voltage, exhibited significant improvements in maximum propagation delay 39.02%, average power consumption 13.4%, and Power Delay Product (PDP) 47.18% as compared to the Transmission Gate Adder (TGA). The PVT analysis is conducted to analyse and comprehend the circuit's variability in response to environmental and fabrication variations. The proposed compressor-based 16-bit RCA and the proposed compressor-based 4-bit VSM utilising the FTGHFA-21T adder shows 5.41% and 11.92% reduction in PDP compared to the proposed compressor-based 16-bit RCA and the proposed compressor-based 4-bit VSM utilising the TGA respectively. At the nominal supply voltage, the proposed 2s4BSMAC unit using the suggested adder exhibits a 7.14% enhancement in PDP and a 7.22% increase in throughput as compared to the TGA-based design. The simulation is carried out using 18 nm Fin Field Effect Transistor (FinFET) technology in the Cadence Virtuoso tool at the nominal supply voltage of 0.8 V with a nominal temperature of 27°C.

*Index Terms*—Fin Field Effect Transistor (FinFET), Multiply and Accumulate Unit (MAC), Ripple Carry Adder (RCA), Vedic Signed Multiplier (VSM), Transmission Gate (TG), Transmission Gate Adder (TGA), Power-Delay-Product (PDP).

#### I. INTRODUCTION

The Multiply and Accumulate (MAC) unit is a fundamental building block extensively utilised in various complex signal processing applications. Its importance lies in enhancing the efficiency and performance of computational tasks such as Discrete Fourier Transform (DFT), Fast Fourier Transform (FFT), filtering, convolution, Finite Impulse Response (FIR)

Manuscript Submitted: 28-05-2024, Accepted: 05-09-2024

filters, and more, thereby creating a strong dependency of the Digital Signal Processor (DSP) on this unit. Given the pivotal role of the multiplier and adder within the MAC unit, meticulous design is essential to optimise power and speed in digital systems [1].

The fundamental components of MAC units consist of a multiplier and a parallel adder, both of which utilise a 1bit adder circuit. There are two types of Full Adder designs: dynamic and static [2]. Dynamic Full Adders provide faster performance due to their lower input capacitance but consume more power because of high clock load and increased leakage current, rendering them unsuitable for battery-powered devices [3]. In contrast, static Full Adders are preferred in such devices for their lower power consumption. As noted in reference [4], static Full Adders are more energy-efficient than dynamic ones, thereby making them ideal for battery-operated electronics. The six main types of static Full Adders include Conventional CMOS (CCMOS) style, Pass Transistor Logic (PTL) style, Complementary Pass Transistor Logic (CPL) style, Gate Diffusion Input (GDI) style, Transmission Gate (TG) style, and Hybrid style.

The conventional CMOS (CCMOS) style utilises two complementary networks: a PMOS pull-up network and an NMOS pull-down network. Its key advantage is resilience to voltage scaling, allowing it to maintain robustness. Additionally, it features short rise and fall times, enabling high-frequency operation. However, the increased input capacitance in CMOS logic can hinder Full Adder performance when cascaded. To address this issue, Pass Transistor Logic (PTL) is used to reduce the number of transistors in Full Adder designs, thereby minimising power dissipation. Despite this advantage, PTLbased Full Adders have limited driving capabilities and suffer from threshold voltage drops, making them unsuitable for chain and tree structures [3, 5, 6].

The Complementary Pass-Transistor Logic (CPL) style utilises two independent NMOS networks, which contribute to its full-swing operation and strong driving capability, as noted in reference [3]. However, this approach increases the number of internal nodes, resulting in higher power dissipation. Additionally, the design's larger transistor count requires more silicon area and leads to a more complex layout [3].

In contrast, the Transmission Gate (TG) logic style, a variation of pass-transistor (PT) logic, utilises parallel combinations of NMOS and PMOS transistors. To alleviate the threshold voltage drop typically associated with PT logic, additional

First A. Author is with Department of Electronics and Communication Engg. (Working as a Ph.D. Scholar), Institute of Technology, Gujarat, India (email: 20ftphd40@nirmauni.ac.in)

Second B. Author is with Department of Electronics and Communication Engg., Institute of Technology, Gujarat, India (email: vijay.savani@nirmauni.ac.in)  $^B$  Corresponding Author

NMOS or PMOS transistors are integrated into Full Adder designs using this style. The key advantage of TG logic is its ability to achieve full-swing output while reducing power dissipation. However, its driving capability in chain or tree structures is limited [7, 8].

According to references [9, 10, 11, 12], the GDI logic style has several advantages, including lower power consumption, higher speed, and fewer transistors. However, it is subject to a threshold voltage drop issue, which limits its ability to achieve full-swing output. The hybrid CMOS logic style utilises the strengths of several static CMOS design approaches to minimise the number of transistors used while maintaining satisfactory driving capacity and speed, as explained in references [2, 3].

The rest of the paper is organised as follows: Section II provides details of the proposed compressor technique-based 16-bit RCA adder, which utilises a proposed 1-bit hybrid adder. Section III covers a proposed compressor technique-based 4-bit Vedic Signed Multiplier (VSM) that employs the classical Urdhva Tiryakbhayam (UT) algorithm. Section IV explains the proposed 2-stage pipelined 4-bit signed Multiply and Accumulate (MAC) (2s4BSMAC) unit architecture. Section V presents simulation results using the testbench setup. Finally, Section VI provides a summary of the findings.

# II. PROPOSED COMPRESSOR BASED RIPPLE CARRY ADDER (RCA) STRUCTURE USING FULL ADDER (FTGHFA-21T)

This section presents the proposed compressor-based RCA structure that incorporates the proposed full-swing, low-voltage, and low-power 21-transistor-based FinFET-Transmission Gate Hybrid Full Adder (FTGHFA-21T). The adder under review comprises four modules: an 11T XOR/XNOR gate, an inverter, a summing module, and a carry-out (C<sub>out</sub>) module. Figure 1 illustrates the architecture of the proposed adder, which utilises 21 transistors and consists of four modules. The first module is an 11T-based XOR/XNOR gate, followed by an inverter (Module 2) and Transmission Gate (TG) multiplexers (Modules 3 and 4). Module 2 generates the complement of the input carry, which is used by Module 3.

The operation of Module 1, the 11-transistor XOR-XNOR gate, is described for different input combinations as follows:

(1) Input Pattern AB = 00: In this scenario, transistors P1, P5, P6, and N2 are ON, while the remaining transistors are in the subthreshold region, effectively OFF. A strong logic '1' is transmitted through transistor P1, turning on transistor N2, which passes a strong logic '0' at the A  $\oplus$  B output. Concurrently, the ON states of transistors P5 and P6 result in a strong logic '1' at the A  $\odot$  B output.

(2) Input Pattern AB = 01: With inputs AB = 01, transistors P3, N1, and N5 are in the ON state, operating in the linear region, while the remaining transistors remain OFF. The ON condition of transistor P3 ensures that the A  $\oplus$  B output is set to a strong logic '1'. Simultaneously, the ON state of transistor N5 leads to a strong logic '0' at the A  $\odot$  B output.

(3) Input Pattern AB = 10: For AB = 10, transistors P1, P2, P6, and N4 are ON, with all other transistors OFF. Transistor



Fig. 1: The Proposed Full Adder Architecture (FTGHFA-21T)

P2, functioning as part of the transmission gate, passes a strong logic '1' at the  $A \oplus B$  output, while the ON state of transistor N4 drives a strong logic '0' at the  $A \odot B$  output.

(4) Input Pattern AB = 11: When AB = 11, transistors P4, N1, and N3 are activated, while the others are OFF. Transistor N1 transmits a strong logic '0' to the source of transistor N3, resulting in an A  $\oplus$  B output of logic '0'. At the same time, transistor P4 passes a strong logic '1', producing an A  $\odot$  output of logic '1'.

The logical expressions for Sum and  $C_{out}$  for the proposed adder are provided in Equations 1 and 2, respectively.

$$Sum = (A \oplus B) \cdot C'_{in} + (A \odot B) \cdot C_{in} \tag{1}$$

$$C_{out} = (A \oplus B) \cdot C_{in} + (A \odot B) \cdot B \tag{2}$$

By multiplexing the input selection line  $C_{in}$  through XOR and XNOR gates. Similarly, the  $C_{out}$  output is obtained by multiplexing input B and  $C_{in}$ , with the XNOR signal serving as the selection line.

Certain Transmission Gate Hybrid Full-adder designs in the literature experience inadequate driving strength when implemented in chain and tree architectures. This limitation arises primarily due to the increased load on the input carry of the least significant bit (LSB) Full Adder [4]. The proposed compressor-based Ripple Carry Adder (RCA) structure addresses this issue effectively by reducing the load on the input carry. This is accomplished by breaking the carry propagation path without the need for extra buffers, thereby improving driving strength.

Two architectures have been proposed for this compressorbased scheme: Full Adder Type 0-1 (FA Type 0-1) and Full Adder Type 1-0 (FA Type 1-0). The FA Type 0-1 generates an inverted output carry ( $C_{out}$ ), while the FA Type 1-0 receives an inverted carry ( $C_{in}$ ) as input. The FA Type 0-1 architecture of the FTGHFA-21T design is shown in Fig. 2, and the FA Type 1-0 architecture of the FTGHFA-21T design is shown in Fig. 3.



Fig. 2: Full Adder Type 0-1 of FTGHFA-21T



Fig. 3: Full Adder Type 1-0 of FTGHFA-21T

Equation 3 gives the logical equation for *Sum*, and equation 4 gives the logical equation for  $C_{out}$  in the FA Type 0-1 architecture. Equation 5 provides the logical equation for *Sum*, and Equation 6 provides the logical equation for  $C_{out}$  in the FA Type 1-0 architecture.

$$Sum = (A \oplus B) \cdot C'_{in} + (A \odot B) \cdot C_{in}$$
(3)

$$C_{out} = (A \oplus B) \cdot C'_{in} + (A \odot B) \cdot B'$$
(4)

$$Sum = (A \oplus B) \cdot C_{in} + (A \odot B) \cdot C'_{in}$$
(5)

$$C_{out} = (A \oplus B) \cdot C'_{in} + (A \odot B) \cdot B \tag{6}$$

Figure 4 depicts the proposed 16-bit RCA structure based on the compressor technique. Buffers have been added to all inputs to simulate realistic input signals. Likewise, buffers are used on all outputs to emulate realistic load conditions.

### III. PROPOSED COMPRESSOR TECHNIQUE-BASED 4-BIT VEDIC SIGNED MULTIPLIER(VSM)

Vedic mathematics is an ancient Indian mathematical system that emphasises using Vedic equations and their applications across various mathematical domains. Through his extensive study of the Vedas, Sri Bharati Krishna Tirtha revived Vedic mathematics by drawing upon ancient Indian texts. His work on the Atharva Veda culminated in the formulation of 16 sutras and 16 upa-sutras. Among these, the Urdhva Tiryakbhayam sutra is recognised as the most efficient for computational use.

Vedic mathematics is renowned for its elegance, as it simplifies complex computations that would typically be more cumbersome using conventional mathematical methods. The underlying Vedic equations are thought to align with the fundamental principles of human cognition, contributing to their computational efficiency. This efficiency translates into faster multiplication operations in digital signal processing (DSP) blocks. The powerful algorithms, derived from Vedic mathematics, have proven to be highly applicable across various engineering disciplines, as demonstrated in studies [13, 14].

The Urdhva Tiryakbhayam (UT) algorithm performs unsigned Vedic multiplication, generating the partial products through vertical and crosswise operations [15, 16, 17, 18, 19, 20]. Figure 5 depicts the vertical and crosswise operations of the UT Vedic multiplication algorithm.

Equations 7a through 7h represent the partial products generated by the UT algorithm during the 4-bit unsigned Vedic multiplication procedure.

$$P_0 = (X_0 \cdot Y_0) \tag{7a}$$

$$P_1 = (X_1 \cdot Y_0) + (X_0 \cdot Y_1) \tag{7b}$$

$$P_2 = (X_2 \cdot Y_0) + (X_1 \cdot Y_1) + (X_0 \cdot Y_2) + \text{Carry of } P_1 \quad (7c)$$
  

$$P_3 = (X_3 \cdot Y_0) + (X_2 \cdot Y_1) + (X_1 \cdot Y_2) + (X_0 \cdot Y_3)$$

+ Carry of 
$$P_2$$
 (7d)

$$P_4 = (X_3 \cdot Y_1) + (X_2 \cdot Y_2) + (X_1 \cdot Y_3) + \text{Carry of } P_3$$
 (7e)

$$P_5 = (X_3 \cdot Y_2) + (X_2 \cdot Y_3) + \text{Carry of } P_4 \tag{7f}$$

$$P_6 = (X_3 \cdot Y_3) + \text{Carry of } P_5 \tag{7g}$$

$$P_7 = \text{Carry of } P_6 \tag{7h}$$



\* All outputs are passed through buffers to get realistic load

Fig. 4: Proposed Compressor technique-based 16-bit RCA structure



Fig. 5: Urdhva Tiryakbhayam Vedic Multiplication Algorithm

This paper proposes a modification of the UT algorithm for Vedic Signed Multiplication, resulting in the generation of partial products shown in equations 8a through 8h.

$$P_0 = X_0 \cdot Y_0 \tag{8a}$$

$$P_1 = (X_1 \cdot Y_0) + (X_0 \cdot Y_1)$$
(8b)

$$P_{2} = (X_{2} \cdot Y_{0}) + (X_{1} \cdot Y_{1}) + (X_{0} \cdot Y_{2}) + \text{Carry of } P_{1} \quad (8c)$$

$$P_{2} = (X_{2} \cdot Y_{2})' + (X_{2} \cdot Y_{2}) + (X_{2} \cdot Y_{2})$$

$$= (X_0 \cdot Y_3)' + \text{Carry of } P_2$$

$$= (X_0 \cdot Y_3)' + (X_0 \cdot Y_2)' + (X_0 \cdot Y_2)$$
(8d)

$$+ (X_1 \cdot Y_3)' + \text{Carry of } P_3 \tag{8e}$$

$$P_5 = (X_3 \cdot Y_2)' + (X_2 \cdot Y_3)' + \text{Carry of } P_4$$
(8f)

$$P_6 = X_3 \cdot Y_3 + \text{Carry of } P_5 \tag{8g}$$

$$P_7 = 1 + \text{Carry of } P_6 \tag{8h}$$

Figure 6 depicts the architecture of the proposed compressor technique utilising a 4-bit Vedic Signed Multiplier (VSM).

According to [21], the Transmission Gate-based Hybrid Full Adder suffers from signal degradation when cascaded. The compressor-based 4-bit VSM structure offers a significant advantage by producing a full-swing output while reducing the load on the  $C_{in}$  input of FA type 0-1. This results in the generation of the P1 signal by interrupting the carry propagation path without adding buffers at the intermediate stages, thereby reducing the overall PDP.

## IV. PROPOSED COMPRESSOR-TECHNIQUE BASED 2-STAGE PIPELINE 4-BIT SIGNED MAC UNIT

The architecture of the proposed 2-stage pipelined, compressor-based 4-bit Signed Multiply and Accumulate (2s4BSMAC) unit is presented in Figure 7. The MAC unit consists of five main blocks. The first block uses the proposed 4-bit compressor-based Vedic Signed Multiplier to generate the partial products. The second block is an 8-bit Parallel Input, Parallel Output (PIPO) shift register, which stores the partial products. In the third block, a 16-bit compressor-based Ripple Carry Adder (RCA) adds partial products and guard bits. The MAC unit incorporates 8 guard bits to prevent overflow during addition. The inclusion of guard bits is crucial to prevent data loss due to overflow, which can occur when executing multiple consecutive MAC operations with a large accumulated sum [22].

The MAC unit can preserve precision across multiple operations by incorporating guard bits into the accumulated sum. The fourth block is a 17-bit Parallel Input, Parallel Output (PIPO) register, which stores the accumulated sum. Finally, a saturation unit ensures that the output of the MAC operation stays within predefined maximum and minimum limits. This unit is crucial for preventing overflow or underflow due to the MAC operation. Typically, the saturation unit is implemented as a simple comparator circuit that compares the MAC output with the maximum and minimum representable values. Additionally, the 1-bit storage element, implemented as a D-flip flop with set and reset functionality, is used to build the PIPO register, as shown in Fig. 8.

The saturation unit outputs the maximum value if the result exceeds the upper limit and the minimum value if the result falls below the lower limit. Positioned between the MAC unit and the subsequent processing stage, the saturation unit ensures that the output of the MAC operation always stays within the representable range, thereby preventing errors due to overflow or underflow in downstream processing [23]. The architecture of the saturation unit, shown in Fig. 9, constrains the output to a range between -128 and 127.

#### V. RESULTS AND ANALYSIS

This section details the implementation of the 1-bit full adder (FTGHFA-21T), including the analysis of its process corner simulation and supply variation simulation. The simulation results of the proposed adder are compared with those



Fig. 6: Proposed Compressor technique-based 4-bit VSM architecture



\* All inputs are passed through buffers to get realistic input signal
 \* All outputs are passed through buffers to get realistic load





of state-of-the-art hybrid adders, as reported in the literature. Additionally, the simulation results and analysis of a 16-bit compressor-based RCA that utilises the 1-bit adder (FTGHFA-21T) are included. Simulations of a 4-bit compressor-based Vedic Signed Multiplier were performed, and the results are presented. Finally, the simulations and results of a 4-bit Signed Multiply and Accumulate (MAC) unit with two pipelining stages are presented and analysed.



Fig. 9: Saturation Unit of 2s4BSMAC

### A. Simulation of the Proposed 1-bit full adder (FTGHFA-21T)

A testbench setup is shown in Fig. 10 to simulate the full adders. The input signals are generated using buffers, taking the input capacitances into account, thus providing more realistic inputs. On the output side, the buffers act as a load to ensure appropriate loading conditions.

A simulation was conducted using a supply voltage of 0.8 V to evaluate the proposed adder. The transient response of the FTGHFA-21T design is analysed, and the outcomes are presented in Figure 11. An input test pattern consisting of 57 transitions was employed to ensure the adder's performance and functional accuracy. This comprehensive test pattern was designed to capture the maximum propagation delay, estimate power dissipation fairly, and verify the adder's correct functionality. The test pattern has a rise time and fall time of 1 ps, a period of 1 ns, and a stop time of 58 ns [24].

Researchers have proposed several hybrid full-adder designs, and their findings are documented in the literature. To establish a standard framework for comparison, the adders were simulated using the Cadence Virtuoso tool. The simulations were performed using 18 *n*m FinFET technology and the  $cds_ff_mpt_1.1$  libraries. The proposed adder architecture was optimised to ensure accurate simulation results by incorporating low-threshold p1lvt and n1lvt transistors, each featuring two fins.

The simulation results for various hybrid adder circuits utilising Transmission Gates with a power supply of 0.8 V are presented in Table I. In comparison to the other adder circuits discussed in the document, the proposed adder incorporates module 1, featuring a low-power, high-speed full-swing 11T XOR-XNOR gate. This design choice contributes to a reduced figure of merit (FoM) in terms of the power-delay product (PDP), highlighting its advantages over other adder circuits.

TABLE I: Simulation results of different 1-bit Full Adder architectures (Supply voltage = 0.8 V)

| Sr. No | Adder       | Transistor<br>Count | Avge.<br>Power<br>(µW) | Max<br>Delay<br>(ps) | PDP<br>(aJ) |
|--------|-------------|---------------------|------------------------|----------------------|-------------|
| 1      | LPHS22T [4] | 22                  | 1.95                   | 86                   | 167.87      |
| 2      | TGA [6]     | 20                  | 1.94                   | 123                  | 238.74      |
| 3      | Kamsani [7] | 22                  | 1.97                   | 77                   | 152         |
| 4      | FTGHFA-21T  | 21                  | 1.68                   | 75                   | 126.08      |

Figure 12 presents a comprehensive comparison between the proposed adder and other leading hybrid full adders, focusing explicitly on their figure of merit (FoM) in terms of the power-delay product (PDP) under supply voltage variation. It is widely recognised that reducing the supply voltage leads to reduced power consumption; however, it also introduces increased delay due to the limited current available for charging and discharging the load capacitance. Remarkably, the proposed adder demonstrates the lowest FoM and performs competitively with state-of-the-art adders when the supply voltage fluctuates from 0.72 V to 0.88 V.

Process corner analysis aims to simulate the minimum and maximum values of each parameter to account for extreme scenarios. Process variation leads to significant uncertainties in circuit performance. To verify the effectiveness of the architecture, process corner analysis is carried out for the proposed adder, with results presented in Table II. The proposed adder demonstrated optimal and comparable results for PDP across all process corner variations when operating at the nominal supply voltage.

The layout diagram of the proposed adder is shown in Fig. 13. By using a reduced number of interconnects and transistors, the proposed architecture occupies a smaller chip area than existing designs. The silicon area occupied by the circuit shown in Fig. is measured as  $3.73 \ \mu m \ x \ 1.42 \ \mu m$ , further emphasising its space-efficient nature.



Fig. 10: The Simulation test bench setup for 1-bit adders with buffers as load

| Sr. No | Parameters         | Process Corner |       |        |        |        |  |  |  |
|--------|--------------------|----------------|-------|--------|--------|--------|--|--|--|
|        |                    | FF             | FS    | SF     | SS     | TT     |  |  |  |
| 1      | Avg. Power<br>(µW) | 2.12           | 1.8   | 1.63   | 1.51   | 1.68   |  |  |  |
| 2      | Max. Delay<br>(ps) | 61             | 76    | 77     | 95     | 75     |  |  |  |
| 3      | PDP<br>(aJ)        | 129.32         | 136.8 | 125.51 | 143.45 | 126.08 |  |  |  |

TABLE II: Process Corner Analysis results of the 1-bit FTGHFA-21T Architecture (Supply Voltage = 0.8 V)



Fig. 11: Transient response for the 1-bit adder (FTGHFA-21T) at 0.8V power supply



Fig. 12: PDP V/s. Supply Voltage variation for different 1-bit adder architectures

The post-layout simulation of the FTGHFA-21T architecture is carried out, and its comparison with the pre-layout simulation is presented in Table III at the nominal supply voltage for the FTGHFA-21T.

### B. Proposed compressor technique based 16-bit RCA

To capture the maximum propagation delay, reasonably estimate power dissipation, and verify correct functionality, the proposed 16-bit compressor-based RCA is simulated using a 0.8 V power supply. The input test pattern from Ref. [24], which consists of two groups of inputs (primary and secondary), is used in the simulation. The input signals have a 1 *ps* rise time, 1 *ps* fall time, 10 *ns* period, and 580 *ns* stop time. Figure 14 depicts the transient response of the 16-bit compressor-based RCA.

Table IV compares the results of various 16-bit RCA architectures based on compressor techniques, all operating at a 0.8 V power supply. The results show that the compressor technique-based 16-bit RCA composed of FTGHFA-21T has the lowest average power dissipation among all the state-of-the-art adders, resulting in the lowest PDP.

Fig. 15 presents the simulation results for the effect of supply voltage variation on the performance of the compressor technique-based 16-bit RCAs. The results show that, within the range of supply voltage variation from 0.72 V to 0.88 V, the 16-bit compressor-based RCA composed of FTGHFA-21T exhibits the lowest FoM, which is comparable to other 16-bit compressor technique-based RCAs composed of state-of-the-art adders.

The proposed adder is utilised in compressor techniquebased 16-bit RCA architectures, and process corner analysis is performed. The results, presented in Table V, show that the proposed adder achieves optimal and comparable results for PDP across all process corner variations when operated at the nominal supply voltage (0.8 V).

# C. The compressor technique-based proposed 4-bit Vedic Signed Multiplier

The classical Vedic multiplier, designed for unsigned multiplication, is limited in its application. To address this limitation, a compressor technique-based Vedic Signed Multiplier is proposed and simulated using a 0.8 V supply voltage. Correct functionality is verified using a random input pattern with 1000 samples, a 1 ps rise time, a 1 ps fall time, a 10 nsperiod, and a 10000 ns stop time. The transient response of the proposed 4-bit Vedic Signed Multiplier, based on the compressor technique, is illustrated in Fig. 16.

To showcase the efficiency of the FTGHFA-21T and compressor technique in Vedic Signed Multiplication, Vedic signed multipliers employing the conventional adder and the proposed hybrid adder are simulated, and their results are presented in Table VI, operating at a 0.8 V power supply voltage. The simulation results demonstrate that the 4-bit compressorbased VSM composed of FTGHFA-21T achieves the lowest power-delay product (PDP) among designs using state-of-theart adders.



Fig. 13: The Layout of the 1-bit FTGHFA-21T Architecture

TABLE III: Pre-Layout V/s. Post-Layout Results of Performance Parameter for 1-bit FTGHFA-21T Architecture (Supply voltage = 0.8 V)



Fig. 14: Transient simulation of compressor technique-based 16-bit RCA

TABLE IV: Performance comparison of compressor technique-based 16-bit RCA architectures (Supply voltage = 0.8 V)

| Sr. No | Compressor<br>technique-based<br>16-bit RCA | Transistor<br>Count | Avge.<br>Power<br>(µW) | Max<br>Delay<br>(ps) | PDP<br>(f J) |
|--------|---------------------------------------------|---------------------|------------------------|----------------------|--------------|
| 1      | LPHS22T [4]                                 | 352                 | 7.01                   | 1178                 | 8.25         |
| 2      | TGA [6]                                     | 320                 | 6.6                    | 1175                 | 7.76         |
| 3      | Kamsani [7]                                 | 352                 | 7.04                   | 1169                 | 8.22         |
| 4      | FTGHFA-21T                                  | 336                 | 6.16                   | 1192                 | 7.34         |

TABLE V: Process Corner Analysis results for the compressor technique-based 16-bit RCA composed of FTGHFA-21T Architecture (Supply voltage = 0.8 V)

| Sr No   | Parameters         | Process Corner |      |      |      |      |  |  |  |
|---------|--------------------|----------------|------|------|------|------|--|--|--|
| 51. 110 |                    | FF             | FS   | SF   | SS   | TT   |  |  |  |
| 1       | Avg. Power<br>(µW) | 12.38          | 8    | 5.32 | 4.03 | 6.16 |  |  |  |
| 2       | Max. Delay<br>(ps) | 964            | 1198 | 1194 | 1520 | 1192 |  |  |  |
| 3       | PDP<br>(fJ)        | 11.93          | 9.58 | 6.53 | 6.13 | 7.34 |  |  |  |

The impact of supply voltage variation on the Figure of Merit (PDP) for the Vedic Signed Multiplier using the hybrid adder and the proposed hybrid adder (i.e., the proposed Vedic Signed Multiplier architecture) is analysed and presented in Fig. 17. The results reveal that the 4-bit compressor-based VSM incorporating the proposed adder achieves the lowest FoM, comparable to state-of-the-art VSMs under supply voltage variations ranging from 0.72 V to 0.88 V.

Process corner analysis is performed on the proposed adderbased 4-bit VSM employing the compressor technique. The results, presented in Table VII, demonstrate that the VSM achieves optimal and consistent PDP performance across all



Fig. 15: PDP V/s. Supply voltage variation for compressor technique-based different 16-bit RCAs

TABLE VI: Performance comparison of Compressortechnique based 4-bit Vedic Signed Multiplier using state-ofarts hybrid adders and proposed hybrid adder (Supply voltage = 0.8 V)

| Sr. No | Compressor<br>technique-based<br>4-bit VSM | Transistor<br>Count | Avge.<br>Power<br>(µW) | Max<br>Delay<br>(ps) | PDP<br>(f J) |
|--------|--------------------------------------------|---------------------|------------------------|----------------------|--------------|
| 1      | LPHS22T [4]                                | 370                 | 39.39                  | 653                  | 25.7         |
| 2      | TGA [6]                                    | 344                 | 38.89                  | 659                  | 25.63        |
| 3      | Kamsani [7]                                | 370                 | 39.36                  | 625                  | 24.6         |
| 4      | FTGHFA-21T                                 | 357                 | 38.75                  | 591                  | 22.9         |

process corner variations when operated at the nominal supply voltage.

TABLE VII: Process Corner Analysis results for the compressor technique-based 4-bit VSM composed of FTGHFA-21T Architecture (Supply voltage = 0.8 V)

| Sr No   | Parameters         | Process Corner |      |      |      |      |  |  |  |
|---------|--------------------|----------------|------|------|------|------|--|--|--|
| 51. 110 |                    | FF             | FS   | SF   | SS   | TT   |  |  |  |
| 1       | Avg. Power<br>(µW) | 51.2           | 37.4 | 41.1 | 30.5 | 38.8 |  |  |  |
| 2       | Max. Delay<br>(ps) | 480            | 592  | 596  | 748  | 591  |  |  |  |
| 3       | PDP<br>(f J)       | 24.6           | 22.2 | 24.5 | 22.8 | 22.9 |  |  |  |

# D. Proposed 4-bit 2-stage pipelined Signed Multiply and Accumulate (MAC) Unit

The 2-stage pipeline 4-bit Signed Multiply and Accumulate Unit (2s4BSMAC), designed using the proposed compressor technique, is simulated with a 0.8V supply voltage. Its functionality is verified using a random input pattern comprising 1000 samples, with a rise time of 1 ps, a fall time of 1 ps, a period of 2 ns, and a stop time of 2000 ns.

Fig. 18 illustrates the transient response of the proposed 2stage pipeline 4-bit Signed Multiply and Accumulate (MAC) Unit based on the compressor technique. Table VIII presents the simulation results for various proposed 2-stage pipeline 4-bit Signed Multiply and Accumulate (MAC) Units employing the compressor technique at a clock frequency of 500 MHz. The findings indicate that the MAC Unit incorporating the FTGHFA-21T architecture achieves the lowest PDP and the highest throughput compared to all stateof-the-art adder-based MAC Units.

The Figure of Merit (FoM), represented by the Power-Delay Product (PDP), for MAC units with various architectural combinations of RCA and VSM is evaluated through simulations at different supply voltages. The results, shown in Fig. 19, demonstrate that the compressor-based 2s4BSMAC unit incorporating the FTGHFA-21T architecture achieves a lower FoM, comparable to state-of-the-art 2s4BSMAC units, under supply voltage variations ranging from 0.72 V to 0.88 V.

Table IX presents the results of the process corner analysis performed on the 2s4BSMAC. The proposed MAC unit exhibits optimal and consistent PDP performance across all process corner variations.

#### VI. CONCLUSION

The proposed 2-stage pipeline 4-bit Signed Multiply and Accumulate (MAC) unit employs a compressor-based Vedic Signed Multiplier (VSM) and an RCA. The hybrid 1-bit full adder (FTGHFA-21T) is utilised in the RCA and VSM units, offering full swing, low power, and high-speed performance. The proposed adder features Type 0-1 and Type 1-0 structures, which use transmission gate-based multiplexers to generate Sum and Cout signals. Simulation results demonstrate a significant performance improvement in the proposed architecture. Integrating the FTGHFA-21T adder into the proposed 4-bit VSM structure results in a notable enhancement in PDP. The 2s4BSMAC unit that incorporates the FTGHFA-21T adder achieves the best FoM and the highest throughput compared to other adders. The performance of all adders, the proposed 4-bit VSMs, 16-bit RCAs, and 2s4BSMACs is evaluated under ±10% nominal supply voltage variation. The proposed compressor-based 16-bit RCA and 4-bit VSM structures effectively address the driving limitations of transmission gates without requiring buffers at intermediate stages.

#### DECLARATIONS

- Funding: No funding was received to assist with preparing this Manuscript.
- Authors' contributions: Parthiv Bhau: Methodology, simulations, interpretation of results, original draft writing, review and editing. Vijay Savani: Interpretation of results, review, and editing.
- Acknowledgements: Nirma University provided the Infrastructure facility.



Fig. 16: Transient Simulation of Compressor technique-based 4-bit VSM architecture

TABLE VIII: Performance parameters of MAC units with various architectural combinations of RCA and VSM (Supply voltage = 0.8 V)

| Sr. No. | Compressor<br>technique<br>based 4-bit<br>Signed MAC Unit | Transistor<br>Count | Avg.<br>Power<br>(µW) | Delay<br>(VSM)<br>(ps) | Delay<br>(RCA)<br>(ps) | Delay<br>(Sat.<br>Unit)<br>(ps) | Max.<br>Delay<br>(ps) | PDP<br>(f J) | Throughput<br>(bits/sec)     |
|---------|-----------------------------------------------------------|---------------------|-----------------------|------------------------|------------------------|---------------------------------|-----------------------|--------------|------------------------------|
| 1       | LPHS22T VSM +<br>LPHS22T RCA                              | 1813                | 112.7                 | 643                    | 590                    | 433                             | 643                   | 72.47        | $3.88*10^{17}$               |
| 2       | TGA VSM +<br>TGA RCA                                      | 1755                | 109.7                 | 649                    | 609                    | 353                             | 649                   | 71.2         | 3.85*10 <sup>17</sup>        |
| 3       | Kamsani VSM +<br>Kamsani RCA                              | 1813                | 111.8                 | 608                    | 593                    | 389                             | 608                   | 67.97        | $4.11*10^{17}$               |
| 4       | FTGHFA-21T VSM +<br>FTGHFA-21T RCA                        | 1784                | 110                   | 582                    | 601                    | 396                             | 601                   | 66.11        | <b>4.15*10</b> <sup>17</sup> |

Throughput (bits/sec) = Fclk / (Max delay \* number of pipeline stages)



Fig. 17: PDP vs. Supply Voltage Variation for 4-Bit VSMs Based on Compressor Techniques



Fig. 18: Simulation Results of proposed compressor-technique based 2s4BSMAC unit

#### REFERENCES

 J. Kuppili, M. Abhiram, and N. A. Manga, "Design of vedic mathematics based 16 bit mac unit for power and delay optimization," in 2021 4th Biennial International Conference on Nascent Technologies in Engineering (ICNTE), pp. 1–4, IEEE, 2021.



Fig. 19: The FoM (PDP) V/s. Supply voltage variation of MAC units with various architectural combinations (RCA + VSM)

TABLE IX: Process Corner Analysis results of the 2s4BSMAC unit composed of FTGHFA- 21T Architecture (Supply voltage = 0.8 V)

| Sr No    | Parameters         | Process Corners |       |       |       |       |  |  |  |
|----------|--------------------|-----------------|-------|-------|-------|-------|--|--|--|
| 51. 110. |                    | FF              | FS    | SF    | SS    | TT    |  |  |  |
| 1        | Avg. Power<br>(µW) | 147.7           | 112.5 | 111.5 | 89.66 | 110   |  |  |  |
| 2        | Max. Delay<br>(ps) | 591             | 609   | 586   | 738   | 601   |  |  |  |
| 3        | PDP<br>(fJ)        | 87.29           | 68.51 | 65.33 | 66.17 | 66.11 |  |  |  |

- [2] P. Bhattacharyya, B. Kundu, S. Ghosh, V. Kumar, and A. Dandapat, "Performance analysis of a low-power high-speed hybrid 1-bit full adder circuit," *IEEE Transactions on very large scale integration (VLSI)* systems, vol. 23, no. 10, pp. 2001–2008, 2014.
- [3] S. Goel, A. Kumar, and M. A. Bayoumi, "Design of robust, energyefficient full adders for deep-submicrometer design using hybrid-cmos logic style," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 14, no. 12, pp. 1309–1321, 2006.

- [4] M. Mewada, M. Zaveri, and R. Thakker, "Improving the performance of transmission gate and hybrid cmos full adders in chain and tree structure architectures," *Integration*, vol. 69, pp. 381–392, 2019.
- [5] C. H. Chang, J. M. Gu, and M. Zhang, "A review of 0.18-μm full adder performances for tree," *IEEE*, 2005.
- [6] A. M. Shams, T. K. Darwish, and M. A. Bayoumi, "Performance analysis of low-power 1-bit cmos full adder cells," *IEEE transactions on very large scale integration (VLSI) systems*, vol. 10, no. 1, pp. 20–29, 2002.
- [7] N. A. Kamsani, V. Thangasamy, S. J. Hashim, Z. Yusoff, M. F. Bukhori, and M. N. Hamidon, "A low power multiplexer based pass transistor logic full adder," in 2015 IEEE Regional Symposium on Micro and Nanoelectronics (RSM), pp. 1–4, IEEE, 2015.
- [8] N. H. Weste and K. Eshraghian, Principles of CMOS VLSI design: a systems perspective. Addison-Wesley Longman Publishing Co., Inc., 1985.
- [9] A. Morgenshtein, A. Fish, and I. A. Wagner, "Gate-diffusion input (gdi): a power-efficient method for digital combinatorial circuits," *IEEE transactions on very large scale integration (VLSI) systems*, vol. 10, no. 5, pp. 566–581, 2002.
- [10] M. Shoba and R. Nakkeeran, "Gdi based full adders for energy efficient arithmetic applications," *Engineering Science and Technology, an International Journal*, vol. 19, no. 1, pp. 485–496, 2016.
- [11] M. Hasan, H. U. Zaman, M. Hossain, P. Biswas, and S. Islam, "Gate diffusion input technique based full swing and scalable 1-bit hybrid full adder for high performance applications," *Engineering Science and Technology, an International Journal*, vol. 23, no. 6, pp. 1364–1373, 2020.
- [12] K. Sanapala and R. Sakthivel, "Ultra-low-voltage gdi-based hybrid full adder design for area and energy-efficient computing systems," *IET Circuits, Devices & Systems*, vol. 13, no. 4, pp. 465–470, 2019.
- [13] T. Rakshith and R. Saligram, "Design of high speed low power multiplier using reversible logic: A vedic mathematical approach," in 2013 International Conference on Circuits, Power and Computing Technologies (ICCPCT), pp. 775–781, IEEE, 2013.
- [14] S. B. K. Tirtha and V. S. Agrawala, *Vedic mathematics*, vol. 10. Motilal Banarsidass Publ., 1992.
- [15] M. Hanumantharaju, H. Jayalaxmi, R. Renuka, and M. Ravishankar, "A high speed block convolution using ancient indian vedic mathematics," in *International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007)*, vol. 2, pp. 169–173, IEEE, 2007.
- [16] H. D. Tiwari, G. Gankhuyag, C. M. Kim, and Y. B. Cho, "Multiplier design based on ancient indian vedic mathematics," in 2008 International SoC Design Conference, vol. 2, pp. II–65, IEEE, 2008.
- [17] V. Kunchigi, L. Kulkarni, and S. Kulkarni, "High speed and area efficient vedic multiplier," in 2012 International Conference on Devices, Circuits and Systems (ICDCS), pp. 360–364, IEEE, 2012.
- [18] D. Jaina, K. Sethi, and R. Panda, "Vedic mathematics based multiply accumulate unit," in 2011 International Conference on Computational Intelligence and Communication Networks, pp. 754–757, IEEE, 2011.
- [19] S. S. Saokar, R. Banakar, and S. Siddamal, "High speed signed multiplier for digital signal processing applications," in 2012 IEEE International Conference on Signal Processing, Computing and Control, pp. 1–6, IEEE, 2012.
- [20] A. R. Prakash and S. Kirubaveni, "Performance evaluation of fft processor using conventional and vedic algorithm," in 2013 IEEE International Conference ON Emerging Trends in Computing, Communication and Nanotechnology (ICECCN), pp. 89–94, IEEE, 2013.
- [21] J. M. Rabaey, A. P. Chandrakasan, and B. Nikolic, "Digital integrated circuits (vol. 2, pp. 1-761)," 2002.
- [22] T. T. Hoang, M. Själander, and P. Larsson-Edefors, "High-speed, energyefficient 2-cycle multiply-accumulate architecture," in 2009 IEEE International SOC Conference (SOCC), pp. 119–122, IEEE, 2009.
- [23] T. T. Hoang, M. Själander, and P. Larsson-Edefors, "A high-speed, energy-efficient two-cycle multiply-accumulate (mac) architecture and its application to a double-throughput mac unit," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 57, no. 12, pp. 3073–3081, 2010.
- [24] M. Mewada and M. Zaveri, "An input test pattern for characterization of a full-adder and n-bit ripple carry adder," in 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 250–255, IEEE, 2016.



**Parthiv Bhau** pursuing his PhD from the Institute of Technology, Nirma University, since October 2020. He received his BE degree in Electronics and Communication Engineering from Gandhinagar Institute of Technology, Gujarat University in 2010. He completed his masters in MTech (EC-CSE) from Charusat University, Changa in 2013. From 2013 to 2016, he worked as an Assistant Professor at Amiraj College of Engineering and Technology in the Department of Electronics and Communication Engineering. He served as an Assistant Professor at

the LJ Institute of Engineering and Technology from 2016 to 2018 and as a Design Engineer at Sankalp Semiconductor in physical design (ASIC) from 2018 to 2019. From 2019 to 2020, he worked as an Assistant Professor at Parul University. His research interests include VLSI design and Physical Design.



Vijay Savani working as an Assistant Professor in the Electronics and Communication Engineering Department since July 2005. He has more than 20 years of experience in teaching, research, and industry. He obtained his B.E. in Electronics and Communication Engineering in 2000 from Shantilal Shah Engineering College, Maharaja Krishnakumarsinhji Bhavnagar University. He completed his M.Tech. in VLSI Design in 2011 and his Ph.D. in 2018 from the Institute of Technology, Nirma University. He has published more than thirty re-

search papers in Embedded System and VLSI Design in international referred Journals and presented/published more than 20 papers in international conferences/proceedings. He has completed four research projects funded by GUJCOST and Nirma University. Dr Savani is a reviewer of International refereed journals in Analog and Mixed Signal Circuits, VLSI Design, and Embedded Systems. His areas of interest include embedded system design, digital system design, IoT, re-configurable hardware architecture, and VLSI design. He is a member of the Life Member of ISTE and an Associate Member of CSI.