EFFICIENT FPGA BASED MATRIX MULTIPLICATION USING MUX AND VEDIC MULTIPLIER

Most of the algorithms which are used in DSP, image and video processing, computer graphics, vision and high performance supercomputing applications require multiplication and matrix operation as the kernel operation.In this paper, we propose Efficient FPGA based matrix multiplication using MUX and Vedic multiplier. The 2x2, 3x2 and 3x3 MUX based multipliers are designed. The basic lower order MUX based multipliers are used to design higher order MxN multipliers with a concept of UrdhvaTiryakbyham Vedic approach. The proposed multiplier is used for image processing applications. It is observed that the device utilization and combinational delay are less in the proposed architecture compared to existing architectures.


INTRODUCTION
Matrix multiplication plays an important role in image processing applications. The computation algorithms such as image processing, video processing, numerical analysis and computer graphics involve matrix operation as the kernel operation. The performance of matrix multiplication is evaluated with respect to speed and area of hardware system. Matrix-vector multiplication generally requires several Multiply Accumulate Units(MAC). The traditional DSP processors are limited by parallel processing of multiplications and additions which in turn take several clock cycles to perform all necessary MAC operations. Field programmable Gate Arrays (FPGA's) which involve parallel processing are equipped with large embedded resources. Modern FPGA's can provide higher and more efficient processing rates with efficient resource utilization. Multipliers which form basic computational unit in FPGA's plays a major role in increasing the speed of intensive matrix computations for image processing applications. Vedic algorithms [1] used in multiplication reduces the delay in generating partial products as the architecture is based on vertical and crosswise structure of ancient Vedic mathematics which generates the partial products and their sum using minimum number of clock cycles. This also has an advantage of modular design where low order multipliers are used to design higher order multipliers of any size which reduces the design complexity while dealing with larger number of bits.
Contribution:In this paper Efficient FPGA based matrix multiplication using MUX and Vedic multiplier is proposed. The competent MxN multiplier is designed using Low order MUX based multipliers and Vedic multiplier.
Organization:The paper is organized as follows: Section II proposes the Related Work. Section III gives details of Methodology. Section IV provides the performance analysis and results. Section V provides conclusion followed by future work.

RELATED WORK
ZdenkaPurushottam D. Chidgupkar and Mangesh T. Karad [2] proposed multiplication process based on Vedic mathematics and its implementation on 8085 and 8086 microprocessors. A comparative study of processing time of conventional multipliers for 8085 and 8086 were analysed and shown that there is an appreciable saving in the processing time of the Vedic multiplier as when compared to that of a conventional multiplier. Vedic algorithms implementations on specially designed BCD architecture will also help to enhance processor throughput.Honey DurgaTiwari et al., [3] proposed multiplier and square architecture based on algorithm of ancient Indian Vedic Mathematics for low power and high speed applications. It is based on generating all partial products and their sums in one step.The multiplier using UrdhvaTiryakbyham sutra and Nikhilam sutra was compared with array and booth multiplier and reached to the conclusion that result obtained from Vedic multiplier is faster than array multiplier and Booth multiplier. SumitVaidya and Deepak Dandekar [4]proposed expanded UrdhvaTiryakbyham sutra for 16 bit multiplied output. Comparative study of different multipliers was done for low power requirement and high speed. The expanded UrdhvaTiryakbyham sutra is not an efficient algorithm for the multiplication of large numbers as a lot of propagation delay is involved in such cases and to overcome this problem, Nikhilam Sutra was suggested.
Paramasivam and Sabeenian [5] proposed a method for decomposing a perfect binary multiplication into smaller size using Nikhilam sutra and hence reducing the computation time and power consumption. The algorithm was broadly divided in three parts namely the initialization, pre-processing and processing. The algorithm evidently reduces a given 4 bit multiplication to a 2-bit multiplication by making use of basic shifting and addition operations, as a result of which the carry propagation in any standard 4 x 4 -bit multiplier is reduced to a great extent. Jayaprakasan et al., [6] discussed the use of an ancient (or Vedic) mathematical approach for building an ALU. Validation for the low power operation of the circuit were made by designing a conventional CMOS counterpart whose power is compared with ancient arithmetic design.Binary 4x4 Array Multiplier and UrdhvaTiryakbhyamas VedicMultiplier were taken for the comparison. Finally it was concluded UrdhvaTiryakbhyamis best for multiplication with respect to number of adders and power consumption.ShamimAkhter [7] proposed VHDL implementation of a NXN multiplier based on the Vedic mathematics.This gives less computation time for calculating the multiplication result for NxN bit and a way to implement the design of the Urdhva sutra based multiplier used bottom up design methodology. The design complexity gets reduced for inputs of large number of bits with increase in modularity.
HimanshuThapliyalet al., [8] proposed a design for square and cube architectures. It was very clearly evident that, the Vedic square and cube architecture were faster than the conventional square and cube calculations. The parallel architectures for computing square and cube of a given number based on ancient Indian Vedic mathematics were discussed. Vaithiyanathan et al., [9] proposed the comparative study of different multipliers for low power requirement and high speed, also gives information ofUrdhvaTiryakbhyam algorithm of Ancient Indian Vedic Mathematics which is utilized for multiplication to improve the speed, area parameters ofmultipliers.Sriraman andPrabakar [10] proposed multiplier architecture based on ROM approach using Vedic Mathematics. Theproposed architecture is similar to that of a Constant Coefficient Multiplier (KCM). However, for KCM one input is to be fixed, while their proposed multiplier can multiply two variables.Harpreet Singh Dhillon and AbhijitMitra [11] proposed NikhilamSutra algorithm which was further optimized by use of some general arithmetic operations such as expansion and bit-shifting to take advantage of bit-reduction in multiplication. Thealgorithm was implemented by reducing a general 4x4-bit multiplication to a single 2x2-bit multiplication operation.
NareshNaik et al., [12]proposed a method of multiplication technique by using Vedic mathematicsformula UrdhavaTiryakbhyam method which meansvertically and cross wire. All the operations in Vedicmultiplier were executed concurrently.Further the speed comparisons of thismultiplier with Normal Booth multiplier were presented.The results J a n u a r y 2 4 , 2 0 1 4 showed thatUrdhavaTiryakbhyam multiplier has great amount of impact on the DSPapplications to improve the execution speed of the DSPprocessors when compared to other multipliers.Jang et al., [13] proposed matrix multiplication as the benchmark to compare the performance of FPGAs, DSPs and embedded processors.The results show that the FPGAs can multiply two matrices with both lower latency and lower energy consumption than the other two types of devices making FPGA ideal choice for matrix multiplication in signal processing applications.Belkacemi et al., [14] presented the design and implementation of a high performance, fully parallel matrix multiplication core. The core was parameterized and scalable in terms of the matrix dimensions (i.e., number of rows and columns) and the input data word length. Fully floor planned FPGA configurations were generated automatically, from high-level descriptions of the matrix multiplication operation, in the form of Electronic Design Interchange Format (EDIF) netlist in less than one second.
Jianwen and Chuen [15] proposed Partiallyre-configurability feature which was exploited for the first time to compute matrix multiplication. Partially reconfigurable devices offer the possibility of changing the design implementation without stopping the whole execution process. The design was evaluated in terms of latency and area.MahendraVucha and ArvindRajawat [16] presented an effective design for the Matrix Multiplication using Systolic Architecture on Reconfigurable Systems like FPGAs. Here, the systolic architecture increases the computing speed by combining the concept of parallel processing and pipelining into a single concept.Syed M. Qasim et al., [17] presented a preliminary design and FPGA implementation of dense matrix-vector multiplication for use in an image processing application. The architecture was designed to multiply large matrix and a vector.Nivedita A. Pandeet al., [18] proposeda designmethodology for high-speed multiplications, where two integers of n-bit size each are multiplied to produce a 2n-bit product.

METHODOLOGY
In this section, we have introduced new concept of MxN bit multiplication based on multiplexer and UrdhvaTiryakbyham sutra (vertically and crosswise)Vedic concept. The disadvantage of direct multiplication s using UrdhvaTiryakbyham Vedic concept for higher order bits require more number of carry propagation results in more delay. The multiplexer based multiplier with UrdhvaTiryakbhyam Sutra concept eliminated delay and minimizes IC package count.
i)Design of MUX based 2x2 and 3x2 multiplier: The 2x2 and 3x2 MUX based multiplier using 4:1 multiplexer as shown in Fig. 1 with A as multiplicand and B as multiplier. The four input lines for multiplier B with two bits for 2x2 multiplier and three bits for 3x2 multiplier are considered with A multiplicand having two controls S0 and S1.The first line input of multiplier B is always s 00 or 000. The second, third and fourth line input values are any combinations of two or three bits based on either 2x2 or 3x2 multiplier . The third line input is connected to shift left by one shifter and fourth input line is connected to shift left by one alone with adder to get proper multiplication results in the output

Figure1: Hardware architecture of 2x2 and 3x2 multipliers
The truth table of 2x2 mux based multiplier is given in Table 1. The input values the first line is always 00 or 000. For the control bits 01, the input of second line is passed to the output i.e. the multiplication of A and B. The third line input is shifted left by one and is passed to the output i.e. multiplicationof A and B. The fourth line input shifted left by one and added is passed to the output which is multiplication of A and B. J a n u a r y 2 4 , 2 0 1 4 Table1: Truth table of

DESIGN OF MUX BASED 3X3 MULTIPLIER:
The 3x3 multiplier using multiplexer using 8:1 multiplexer is shown in Fig. 2. The multiplicand A has three bits S0, S1, S2.
The following steps are used for multiplication, step1: The group C and F are multiplied using 3x3 mux based multiplier. The one block of 2x2, four block of 3x2 and four block of 3x3 mux based multiplier along with four adders are used in 8x8 multiplier using Vedic concept as shown in Fig 4. J a n u a r y 2 4 , 2 0 1 4

PROPOSED 16X16 MULTIPLIER USING 8X8 MULTIPLIER
The block diagram of 16 x16 multiplier using 8x8 multiplier is shown in Fig. 5. The four 8x8 multiplier along with two adders are used to implement 16x16 multiplier. The two numbers of 16 bits are a0 to a15 and b0 to b15 are considered. The proposed multiplier can be extended to any value of MxN. J a n u a r y 2 4 , 2 0 1 4

PROPOSED MATRIX ARCHITECTURE FOR IMAGE PROCESSING APPLICATIONS
The proposed multiplier of any size used in multiplication of two matrices of any size for image processing applications is shown in Fig. 6 Figure 6. Architecture for matrix vector multiplication.
The matrices A and C are considered for multiplication. The corresponding column vector elements of C and the corresponding row elements of A shifted serially are multiplied by proposed mux based multiplier. The result of a multiplier is stored in RAM. For image processing applications, the image is considered as matrix A and the filter coefficients of any J a n u a r y 2 4 , 2 0 1 4 size can be considered as matrix C. The multiplication of any image with filter coefficients is performed using proposed mux based multiplier.

Synthesis Results
The performance parameters such as number of slices, number of 4 input LUT's and maximum combinational path delay are considered to test the proposed multiplier using Xilnx Spartan 3 FPGA family [19]. The number of performance parameters is measured for 8x8 and 16x16 proposed multiplier and is tabulated in Table 1. It is observed that as the order of multiplication increases, the delay is not varied significantly. The performance parameters of proposed multiplier is compared with existing 8x8 multiplier presented by Pushpalathaand Mehta [20] and is tabulated in Table II. It is observed that the number of slices and the combinational delay is reduced in the proposed method compared to existing method. The performance parameters of proposed multiplier is compared with existing 16x16 multiplier presented by Gurumurthy and Prahalad [21] and is tabulated in Table III. It is observed that the number of slices and the combinational path delay is reduced in the proposed method compared to existing method. TheMatrix multiplicationof 1028x28 and 28x1 is synthesized using Xilinx virtex4 200ff1513 board [22]. The performance comparison of proposed method with existing method presented by Syed MQasim [23] is given in table IV. It is observed that the number of slices and the combinational delay are reduced in the proposed method compared to existing method.

CONCLUSION
Higher order multipliers are required in image processing applications. In this paper Efficient FPGA based Matrix Multiplication using Mux and Vedic multiplier is proposed. The lower order MUX based multipliers are used withVedic multipliers to design a novel higher order multiplier of any dimensions. The proposed multiplier is used in image processing applications. It is observed that the performance parameters such as area and delay are reduced compared to existing algorithms. In future, multiplier can be designed using higher order MUX based multipliers. J a n u a r y 2 4 , 2 0 1 4