An Efficient Discrete Wavelet Transform Architecture with Low Power and Multiplier-Less Structure for Pervasive Biomedical Image Processing Application

INTRODUCTION: Over the past several years analysis of image has moved from larger system to pervasive portable devices. For example, in pervasive biomedical systems like PACS-Picture achieving and Communication system, computing is the main element. Image processing application for biomedical diagnosis needs efficient and fast algorithms and architecture for their functionality. Future pervasive systems designed for biomedical application should provide computational efficiency and portability. The discrete wavelet transform (DWT) designed in on-chip been used in several applications like data, audio signal processing and machine learning. OBJECTIVES: The conventional convolution based scheme is easy to implement but occupies more memory , power and delay. The conventional lifting based architecture has multiplier blocks which increase the critical delay. Designing the wavelet transform without multiplier is a effective task especially for the 2-D image analysis. Without multiplier Daubechies wavelet implementation in forward and inverse transforms may find efficient. The objective of the work is on obtaining low power and less delay architecture. METHODS: The proposed lifting scheme for two dimensional architecture reduces critical path through multiplier less and provides low power, area and high throughput. The proposed multiplier is delay efficient. RESULTS: The architecture is Multiplier less in the predict and update stage and the implementation carried out in FPGA by the use of Quartus II 9.1 and it is found that there is reduction in consumption of power at approximately 56%. There is reduction in delay due to multiplier less architecture. CONCLUSION: multiplier less architecture provides less delay and low power. The power observed is in milliwatts and suitable for high speed application due to low critical path delay


Introduction
In recent years the analysis of image has moved from larger system to portable devices. For example, in biomedical systems like PACS-Picture achieving and Communication system, computing is the main element. The speed and density of electronics components have exponentially increased to meet these standards. The implementation of various methods for analysismg the biomedical signal and image has taken a strong growth in recent years. Among them, DWT is playing a crucial part in image and signal processing. Compared to other transforms the Wavelet methods are flexible in design and can be implemented easily in Programmable arrays. The convolution and lifting based architecture have their own advantages and disadvantages. The Multipliers are the basic building blocks and occupy more area and consume more power. They also lead to longer latency. This limits the density and computing power of integrated circuits.
Also there may not be necessity or requirement for the grid integration all the time, so there has to be a controller for controlling the wind energy.
Source to operate the system effectively either in grid connected mode or in islanded as per the requirement (Yong zheng Zhang et al., 2013).
Nearby planetary group has been incorporated at the mark of regular coupling of the framework as an option sustainable power source (Mohammad Saleh Marhabaetal., 2018; A. Barkia et at., 2016; Raúl Sarrias-Mena et al., 2014). The framework is likewise coordinated with other RL load, enlistment engine load which can likewise be all the while worked addressing equal burdens activity. The framework is additionally made to work in relationship with the fixed speed wind energy frameworks. There might be unusual force quality issues inside the network association and to guarantee the appropriate activity FACTS gadget is likewise coordinated at the mark of regular coupling.
This paper includes the plan of variable speed wind turbine with a multimode control methodology (E.  , 2007) and breaking down the activity of the proposed framework with various experiments to guarantee the effective activity of the suggested framework under ordinary and blamed conditions. The paper is organized as follows: Section I: presents prologue to the proposed framework, Section II: manages the square chart clarification of proposed framework with the equal units and their numerical displaying. Segment III: manages the conversation of reproduction work and results. Section IV: manages the ends drawn from the proposed work.

Literature Survey
In literature, several authors have presented the algorithms and implementation of wavelet for image processing application (Chaitali, 1995). These architectures were used for 1 dimensSional and 2 D analysis (Lewis et al, 1991). Generic architecture were famous during the past which used biorthogonal wavelet transform based systems. These systems were implemented using generic structures and scalable architectures (Shahidmasud and John, 2001). The word length are parameterized in this design. The multiplier less implementation can be extended to a multiplier included systolic array architecture (Nayak S, 2005). The array architecture uses a single clock cycle for all filter co-efficient. But the architecture is advantage due to its higher utilization of less registers for implementation. Designing the wavelet transform without multiplier is a effective task especially for the 2-D Daubechies Wavelet Transform for image analysis. Without multiplier Daubechies wavelet implementation in forward and inverse transforms may find efficient. Various filters were used like 9/7 and 5/3 filters and shifts of bits are reduced adder counts (Pramod Kumar Mehar et al (2015). The pipeline reduces the power delay.

Problem Statement and Objective
The objective is to design a DWT architecture which is efficient in power or area or computation and multiplier free. Since in large scale integration power is directly proportional with area, speed and supply voltage, the optimization can be carried out only in one parameter. In literature as discussed several methods were implemented. These methods are effective but certain issues can be improved like the critical path delay or new device approach when compared to CMOS. Especially the critical delay is an important parameter occurring due to the multipliers which should be addressed. The processing elements consisting of adders multipliers etc in the signal processing block leads to critical delay. The major delay is contributed by the multiplier unit. This work addresses the issues due to the problems in multiplier. The paper presents the detail investigation on the power reduction and delay due to the presence of the multiplier and a multiplier less architecture using CMOS is presented. The main aim was to design a multiplier less lifting based DWT architecture using CMOS circuit in FPGA.

Background Methodology
In previous section the survey of different methodologies adopted in literature for the design of DWT architecture is presented.

Discrete Wavelet Transform (DWT)
Unlike the Fast Fourier Transform (FFT), extensively exploited to analyze stationary quantities, the DWT is for non-stationary signal analysis (Chakrabarti, Viswanath, 1996). The analyzed signal is band pass filtered using DWT decomposition frequency bands (C. Chakrabarti et al, 1993). The time information of the signal was preserved using wavelet signal. The filtering process carried out by the transform. Using various levels of decomposition the bands associated with the signals are determined (P. P. Vaidyanathan, 1987

Lifting scheme
The lifting scheme is an alternative for Convolution based architecture when high throughput is required (

Architecture
The Lifting based DWT architecture with two predict and two update stages is shown in Figure 1.During the initial split into odd and even , on even data is applied to predict step. Next the even samples are updated with the help of newly calculated odd samples the necessary properties are maintained. All samples are transformed by applying the repeated steps on the samples of the input signal. While designing the processor core for the same, the controller should choose propoer lifting coefficient for each clock cycle. The clocks should drive the registers in the each half of the structure through the even and odd clock. For example the pixels in an image processing application will get split into even and odd since they arrive serially with a speed of one pixel per clock cycle.
The DWT architecture will give low and high pass coefficient at the output for each pair of pixel values. The corresponding lifting coefficients for filter coefficients were given out. The update and predict block have multipliers so the execution time of DSP algorithm can be minimized using high-speed multiplier (

Architecture for 2D Analysis
For the image processing architecture the structure of the DWT should have a 1D row processor (RP) and column processor (CP) of the image matrix. The internal memory like a SRAM stores the intermediate 1D row processed coefficients. The Z-scanning method commonly known as dual line scan based structure will reduce the latency. This scanning is independent of size. In addition the number of registers used will be less when compared to conventional line-based architectures (Huang et al, 2000).
Hardware architectures can be optimized to accommodate different scaling factors. By this the area, power and latency will be reduced. By including pipelined critical path due to multiplier will reduce (Dillen et al 2003). But minor increase in latency will happen.

Z-scanning method
The 2-D DWT architecture is implemented in a direct mode if we use direct scan methodology. In the operation of direct scan method, initially a row wise 1-D DWT followed by column 1-D DWT are operated and results stored in memory N2. By this method of performing one level 2D DWT, a complete external memory of size 2N2 is necessary, this may result in increased consumption of energy (Acharyya, A., et al, 2009). For storing the intermediate coefficients the line based method uses internal line buffers. The line buffer size and the input frame size are dependent on each other. The inputs are divided into individual block and supplied to parallel architecture in case of block-based approach. It is advantageous in regarding its internal cache but addressing makes it inefficient to use in streaming applications. During the scanning process, the pixels in the first row of the input frames are read first and following that the next two pixels are read during the next clock cycle and this EAI Endorsed Transactions on Pervasive Health and Technology 01 2023 -02 2023 | Volume 9 | Issue 1 | e1 Maram Anantha Guptha, Surampudi Srinivasa Rao, Ravindrakumar Selvaraj 4 process is repeated for every row in the input frames. Here RP is performing row transform of adjacent rows and starts the processing of 1D column processing that are boundary treated. N2/2 clock cycle is utilized for read operation for a frame size of N × N in the scanning process. Z-scanning generally reduces the memory usage for transpose. For handling the row processing and boundary treatments at the frame boundary the registers D1, D2, D3 and D4 are employed with accurate values. Here, for boundary treatment the initialization of registers are done with zeros. Figure 2 illustrates the z scanning process.

DWT Architecture with Multiplier
A less multiplier based pipeline architecture in DWT for lifting method is illustrated in fig.3. The critical path to the adder is reduced by the use of shift and adds multipliers. The DWT is used in variety of applications such as signal processing, machine learning, coding of signal, compressing data, hiding data, data interpretation, geophysics, motion tracking, meteorology, etc., They are used in situations in which scalability and acceptable degradations are necessary. Similar to discrete cosine transform, the DWT architecture in good in its compression ratio, they do not have any blocking artefacts, excellent localisation in frequency domain and time domain, inbuilt scaling and greater flexibility. The implementation is done in CMOS technology.

Proposed Method: DWT without Multiplier
The data flow of 9/7 lifting steps in our proposed two input/two output 1D row processing element. In the first lifting stage even and odd input are derived as output by the predict module gets at its first clock cycle. The previous even input and the present even input are added together (s0i, s0i+1). Next, with the shift and add arithmetic the multiplication is performed. Then, for calculating the first predict coefficient (d1i) the result from multiplication is added together with odd input samples (d0i) during the fourth clock cycle. The first update value is estimated from past and present predict and input during the fifth clock cycle. In the data flow graph only the adders are used for computation and thus there exists a reduction in critical path to one adder delay. Pipelining increases the speed of operation. The RP and CP stage consists of 2D/N delay registers and they are utilised for row and column processors. In our proposed method the first lifting stage is constructed with the use of seven adders/subtractors and four shifters. Similarly ten adders and eight shifters are used for second lifting stage. The predict and update stages of the 1D row and column processing elements in the design of four stage pipelined system are fully pipelined with the pipeline registers and are denoted in vertical dashed lines. The input of RP needs five delay registers, two for retaining even inputs and the other three for predict module. For compensating the three and two clock cycle delay that occurs in the pipeline stage of predict 1 and update 1 modules of the predict1 stage a/e of RP and CP are delayed by six clock cycles. Likewise a delay of six and seven clock cycles are introduced in update 1 output and predict 2 accordingly. To do column processing the outputs from row predict 2 and update 2 process are connected.

Transposing unit
The operation of the overall 2D DWT is performed by TU switching mechanism that can feed the two alternate 1D row coefficients of CP. It uses five registers with two 2 × 1 multiplexers hence they are considered to be effective. Due to the use of Z-scanning in our proposed method, it is power and area efficient and they are not dependent on the size of the image. For column process synchronized with row process in the Z-scanning in consecutive cycles. The method is a line-based architecture and CP want to hold some time for entire row.

D Flip Flop
Generally the given data is accumulated as a group of bits, expressed in numbers and codes in digital circuits. Hence it is simple to pick the data in parallel lines and save it in successive flip flops ordered linearly. Registers are constructed by connecting D flip flops, since registers EAI Endorsed Transactions on Pervasive Health and Technology 01 2023 -02 2023 | Volume 9 | Issue 1 | e1 are the general multi-bit data services that can store multiple bits of data. The same clock input is given to every flip flops which are connected to the separate data input. When a positive edge triggered clock signal is given, the flip flop stores the data gathered from their appropriate D input.

Data Transfer
D flip-flops are interconnected with each other in cascade connection with same clock signal to construct shift registers, this helps in transferring data, and it is hugely applied in the data transfer applications. Once the clock pulse is given the data is shifted or transferred. For storing the data temporarily shift registers are employed and they are widely applied in serial to parallel and parallel to serial data conversion applications. Their applications are also in pulse extenders, delay circuits etc.

Simulation and Results
The RTL view of the DWT architecture of the existing lifting based architecture with and without multiplier are shown in figure 7, and 9. The power analysis output of the lifting based DWT without multiplier is represented in figure 8.

Conclusion
The area, power and critical path are decreased by the use of new multiplier less predict and update model for lifting based DWT architecture. The traditional method had shortcomings in the use of multipliers and this is overcome by the use of adders and shift registers in place of multipliers in our proposed architecture. Our less multiplier architecture of our proposed design is found to be effective. The pipelined architecture can be used in the proposed design as a future enhancement, this will increase the speed of the overall device. The power consumed is reduced by 56%. The DWT architecture is implemented for image processing areas in future perspective of the proposed methodology.