Deep Learning Application Pros and Cons Over Algorithm

Deep learning is a new area of machine learning research. Deep learning technology applies the nonlinear and advanced transformation of model abstraction into a large database. The latest development shows that deep learning in various fields and greatly contributed to artificial intelligence so far. This article reviews the contributions and new applications of deep learning. The main target of this review is to give the summarize points for scholars to have the analysis about applications and algorithms. Then review tries to investigate the main applications and uses algorithms. In addition, the advantages of using the method of deep learning and its hierarchical and nonlinear functioning are introduced and compared to traditional algorithms in common applications. The following three criteria should be taken into consideration when choosing the area of application. (1) expertise or knowledge of the author; (2) the successful application of deep learning technology has changed the field of application, such as voice recognition, chat robots, search technology and vision; and (3) deep learning can have a significant impact on the application domain and benefit from recent research with natural language and text processing, information recovery and multimodal information processing resulting from multitasking deep learning. This review provides a general overview of a new concept and the growing benefits and popularity of deep learning, which can help researchers and students interested in deep learning methods.


Introduction
Since 2006, deep structure learning, or additional learning commonly referred to as Deep Learning (DL) or classroom-conscious learning. It has become a brand-new space for Machine Learning (ML) analysis. Many years ago, the technology developed from DL analysis had affected many different signal and data processing tasks, including the normal, new, wider scope and key aspects of ML and Artificial Intelligence (AI), and Media reports on this progress. DL allows computational models composed of multiple processing layers to learn data representations with multiple levels of abstraction. These methods have greatly improved the technical level of speech recognition, visual target recognition, target detection, and many other fields such as drug discovery and genomics. DL uses backpropagation algorithms to discover complex structures in large data sets to instruct the machine how to change the internal parameters used to calculate the representation of each layer from the representation of the previous layer. Deep Convolutional Networks (DCNs) have made breakthroughs in processing images, video, voice and audio, while recursive networks have brought light to sequential data such as text and voice. In recent years, a series of seminars, tutorials and special issues or special conference sessions have been dedicated to in-depth learning and applications in many signal and data processing fields. The different research application fields

EAI Endorsed Transactions on AI and Robotics
Research Article of DL are explored, and a chronological advancement in research has been shown in each field. This review paper represents the significance of research in DL and shows how it will soon become the world's future. Also, its advantages and disadvantages are discussed. In the next section, some accredited researches are described. The article discusses facial recognition, speech recognition, image recognition or handwriting recognition, virtual assistants, chatbots, healthcare, entertainment and music and finally, robotics. Detailed discussion is also done in facial recognition, speech recognition and healthcare sectors where DL are used more than other applications. Then finally, the popular techniques used in DL are summarized. Liu et al. (2017) [1] proposed a novel approach towards face recognition using Convolutional Neural Networks (CNNs). They used angular softmax loss (A-Softmax loss) to enable CNNs and learn the discriminative features of a face with angular margin. They showed that A-Softmax loss could provide a suitable geometric interpretation by constraining the learned features to be discriminative on hyper-spherical manifolds, which essentially matches the prior features that the surfaces are also located on nonlinear manifolds. This connection makes A-Softmax very effective for learning facial expressions. Luan Tran et al, (2017) [2] have proposed Disentangled Representation learning Generative Adversarial Network (DR-GAN) with three different novelties. First, in addition to image synthesis, the encoder-decoder structure of the generator allows DR-GAN to learn, generate and discriminate representations. Secondly, this representation is clearly separated from other facial changes such as poses through the pose code provided to the decoder and the pose estimation in the discriminator. Third, DR-GAN can take one or more images as input to generate a unified representation and any number of composite images. Quantitative and qualitative evaluations of controlled databases and wild databases show that DR-GAN is superior to the latest technology. Yang et al, (2017) [3] proposed a Neural Aggregation Network (NAN) for video face recognition. The network takes a face video or a face image set (with a variable number of face images) as its input and generates a compact, fixed-size feature representation for recognition. The entire network consists of two modules. The feature embedding module is a deep CNN, which maps each face image to a feature vector. The aggregation module is composed of two attention blocks. The two attention blocks adaptively aggregate feature vectors to form a single feature in the convex hull that they span. Due to the attention mechanism, the aggregation is invariant to the image order. Their NAN uses standard classification or validation loss for training without any additional supervision signals. They found that it automatically learns to support high-quality face images while rejecting lowquality images such as blur, occlusion, and incorrect exposure of the human face.   [4] present FaceNet2ExpNet, a novel idea to train an expression recognition network based on static images. They first proposed a new distribution function to simulate the high-level neurons expressing the network. On this basis, a two-stage training algorithm is designed. In the pre-training stage, they train the convolutional layer of the expression network, which is regularized by the face network; in the refinement stage, they attach the fully connected layer to the pre-trained convolutional layer and jointly train the entire network. The visualization results show that the model trained with their method captures improved high-level expression semantics. The evaluation of the four public expression databases of CK+, Oulu-CASIA, TFD and SFEW shows that their method achieves better results than other methods.   [5] proposed a comprehensive framework based on CNN to overcome the challenges of Video-based Face Recognition (VFR). First, to learn fuzzy robust face representations, researchers artificially blur the training data composed of clear still images to make up for the lack of real-world video training data. Using training data composed of still images and artificial blur data, CNN is encouraged to automatically learn blur-insensitive features. Secondly, to enhance the robustness of CNN features to pose changes and occlusions, they proposed a trunk branch integrated CNN model (TBE-CNN), which extracts supplements from the overall face image and the cropped patches around the facial parts information. TBE-CNN is an end-to-end model that efficiently extracts features by sharing low-level and mid-level convolutional layers between the backbone and branch networks. Third, to further improve the discrimination ability of the representation learned by TBE-CNN, they proposed an improved triplet loss function. The systematic experiment proves the effectiveness of the proposed technique. The most impressive thing is that TBE-CNN has achieved the most advanced performance on three popular video face databases (PaSC, COX face and YouTube Faces). Using the proposed technology, this research won first place in the BTAS 2016 video person recognition evaluation. Nech et al, (2017) [6] stated some questions. If all algorithms for face recognition are very different; are large datasets the real key to more accurate facial recognition? Where does facial recognition need to improve? In a pursuit to find the answers to these sorts of questions, the researchers created a benchmark, MF2, which requires all algorithms to be trained on the same data and tested on a million scale. MF2 is a large public collection with 672K identities and 4.7 million photos, aiming to create a level playing field for large-scale face recognition. We compare our results with the results of two other large benchmarks, MegaFace Challenge and MS-Celebs-1M, in which groups can train on any private or public, and large or small collection. Some key findings are (1) the algorithms trained  on MF2 can reach the most advanced level, and the results  are comparable to those trained on large-scale private sets; (2) some algorithms perform well after training on MF2;

DL on Facial Recognition
(3) aging invariance as with MegaFace, the accuracy is lower, and in future tests, it is determined that a greater age changes may be required in the adjustment of identity or algorithm. Wang et al. (2018) [7] have stated that the traditional SoftMax loss of deep CNNs usually lacks the power of discrimination. Hence, they proposed a new loss function, the Large Margin Cosine Loss (LMCL), to realize this idea differently. They use L2 to normalize features and weight vectors to eliminate radial changes, re-express the maximum soft loss as a cosine loss, and introduce a cosine margin term to maximize the decision margin angular space further. As a result, the smallest intra-class variance and the largest inter-class variance are achieved through normalisation and maximisation of the cosine decision margin. They call this approach Cosface. In the mega face challenge, the CosFace approach shows its superiority for both the identification and verification tasks on both protocols. Mittal et al. (2018) [8] designed a mobile application that performs real-time multiple face recognition using CNNs. Their model implemented a ResNet network. It tries to learn a mapping from face images to a compact 128-D Euclidean space where the correlation between two vectors represents the face similarity. This method optimizes the Inception model using Triplet Loss Function and curriculum learning. They used the Labeled Faces in the Wild dataset to train their images. With this training, they classified the faces extracting features afterwards. Finally, with the data augmentation, their system transforms various degrees on the photos, and finally, identification was achieved. They found that with the increasing number of members in the photos or videos, recognition precision dropped. Cheng et al, (2018) [9] have proposed two face recognition methods based on multiple CNN classifiers. These methods have the same CNN structure but have different training sets. The first method used random sampling and replacement to train a single CNN using different training sets generated in the original data set. The second method generated different training sets by enlarging and reducing the image size of the original data set. These two methods finally used the voting method to merge a single CNN result. Experiments showed that the ensemble CNNs classifier is better than the single CNNs classifier, and face recognition accuracy is as 99.5%. Deng et al. (2019) [10] introduced Arcface that uses additive angular margin loss on Deep CNNs for facial recognition. Their experimentation showed that Arcface surpasses cosface, sphereface, centerloss, VGGFace and other opensource face recognition methods by a fair margin. They have also shown comprehensive comparisons with other types of losses in facial recognition, including each study's code. Rabiner (1997) [11] presented the current functions of speech recognition systems and how they are used in today's services and applications, and how they will evolve over time into the next generation of voice recognition services. he pointed out that there are different types of speech recognition, including spontaneous, conversational, hybrid recognition, etc. He also showed that the vocabulary size goes up to 64,000 words but comes with an error rate as high as 50%. He stated that although the main application of speech recognition would be telecommunication in the future, it could be extended to services like voice dialling, voice banking service, voice assistance, call completion agent technology, customer care computer to the phone integration voice dictation and so on. Li et al, (2004) [12] have mentioned several challenges in speech recognition. Human and machine speech recognition has a big gap. If there is no underlying recognition technology that can provide sufficient robustness and low error performance, the mainstream adoption of speech recognition is impossible. One of these programs is a computational model based on pronunciation, the science of language, and some fundamental aspects of human linguistic communication. Based on the general principles of computational phonetics and phonology, a quantitative statistical model is established. Advanced algorithms can be developed to optimize the measurement parameters automatically. These parameters have physical significance. Another program is based on humans. All aspects of speech perception go contrary to the common acoustic speech information used in the spectral envelopes frame by frame in all the main speech recognition systems. This technique aims to develop a new processing front acoustic that does not depend on the spectral envelope and uses several fronts on the whole of the temporal and frequency plane. The researchers state that overcoming the challenge of making speech recognition systems robust in noisy acoustic environments and the challenge of creating workable recognition systems for natural, free-style speech is the ultimate challenge for the system. They also mentioned a few existing directives to these challenges and showed the architecture of Microsoft's Speech Recognition System. Quoc et al. (2012) [13] proposed a robust speech recognition based on a binaural speech enhancement system as a pre-processing step. The system uses existing de-redundancy techniques, followed by a noise removal algorithm based on spatial masking, in which only signals from the desired direction are retained by using a threshold angle. Here, they urge consideration of adaptive calculation, where the threshold is first learned in a few noise frames and then updated frame by frame. Their robust speech recognition algorithm is based on binaural sound source separation and data loss. The simulation results show that because the target speech data obtained by the binaural sound source separation method can eliminate noise and interference, the algorithm can significantly improve the speech recognition performance in a complex acoustic environment. The result of speech recognition in a real environment shows the effectiveness of this method.   [14] studied the application of deep maxout network (DMN) in large vocabulary continuous speech recognition (LVCSR) tasks. They experimented on the challenging Babel corpus. They concluded that, compared with DNNs, DMNs could improve the performance of hybrid and BNF systems under limited LP conditions. Their conclusions further include that stacked denoising autoencoders (SDA)-based pre-training is effective for DMN initialization and brings benefits when DMN becomes truly deep, and DMN can be used as a sparse feature extractor to generate hierarchical high-level representations. They mentioned that in future work, they are interested in researching Restricted Boltzmann machines (RBM) for DMNs initialization, which requires a probabilistic pooling strategy to achieve a fully generative model, and they hope to extend the idea of sparse feature extraction to bottleneck features (BNF) and generate sparseness for tandem systems Bottleneck characteristics. Based on the Gaussian mixture model and the improved Nbest speech recognition algorithm, an improved latticebased speech keyword recognition system is proposed by Xiao Xi et al, (2015) [15]. First, they used tests to evaluate different simplified structures of the Gaussian mixture model. Then, an N-best token passing algorithm is proposed based on the classic token passing algorithm and using the unique pronunciation rules of Chinese. These two modifications improve the performance of 1-best and Nbest candidate speech recognition. Finally, a keyword recognition system based on the N-best lattice was developed to verify the effectiveness of these improvements. Noda et al. (2015) [16] introduced a connectionist Hidden Markov Model (HMM) system for anti-noise AVSR. First, use a deep denoising autoencoder to obtain audio features with strong noise robustness. By preparing training data for a network with audio characteristics degraded in several consecutive pairwise steps and corresponding eigen characteristics, the network is formed to produce noise elimination characteristics from the corresponding noise degraded characteristics. Secondly, use a CNN to extract visual features from the original oral region image. By preparing the training data of CNN as a pair of original images and corresponding phoneme label output, the network is trained to predict phoneme labels from the corresponding oral region input image. Finally, a multistream hidden Markov model (MSHMM) is used to integrate independently trained audio and video HMMs with their respective features. By comparing the normal and mel-frequency cepstral coefficients (MFCCs) (MFCC) as HMM audio features, our single-peak isolated word recognition results show that under the 10 dB signal-tonoise ratio (SNR) of the audio signal input, the denoising MFCC can obtain a word recognition rate gain of about 65%. Attention-based recursive networks have been successfully applied to many tasks such as handwriting synthesis, machine translation, image caption generation, and visual object classification. Researchers have introduced an extension of the attention-based recurrent network to make it suitable for speech recognition. Learning to recognize speech can be seen as learning to generate a sequence (transcription) given another sequence (speech). Chorowski et al, (2015) [17] proposed and evaluated an end-to-end trainable speech recognition architecture based on a hybrid attention mechanism that combines content and location information to select the next location in the input sequence for decoding. An ideal feature of the model is that it can recognize utterances that are much longer than the training time. In conclusion, their work provides two new ideas for the attention mechanism: a better-standardized method to produce smoother alignments and a general principle for extracting and using features from previous alignments. They state that both methods may be applied to fields other than speech recognition. However, despite the great achievements made in the past few decades, natural and powerful human-machine voice interaction seems still out of reach, especially in challenging environments with significant noise and reverberation. To improve robustness, modern speech recognizers usually use an acoustic model based on a recurrent neural network (RNN), which can naturally take advantage of large time context and long-term speech modulation. Therefore, it is very meaningful to continue to study appropriate technologies to improve the effectiveness of RNN in processing speech signals. Ravanelli et al. (2018) [18] modified one of the most popular RNN models, the gated recurrent unit (GRU), and proposed a simplified architecture that is very effective for ASR. The work has contributed on two fronts: First, they analyzed the role played by the reset gate and showed that there is significant redundancy with the update gate. They suggested removing the former from the GRU design, resulting in a more efficient and compact single-gate model. Then, they suggested replacing the hyperbolic tangent with a corrected linear unit activation. This change is well combined with batch normalization, which can help the model learn long-term dependencies without numerical problems. The results show that the proposed architecture light GRU reduces the epoch training time by more than 30% compared with the standard GRU and continuously improves the recognition accuracy in different tasks, input features, noise conditions, and different ASR paradigms from the standard DNN-HMM speech recognizer to the end-to-end connectionist temporal classification model.

DL in Image Recognition
Image recognition is a subclass or a derivation of computer vision and AI, representing a set of image detection and analysis methods to achieve automation of specific tasksidentify places, people, objects and many other types of items in a picture and draw conclusions through analysis [19][20].
Today, more and more people are using pictures to represent and convey information. It is convenient for us to learn much information from pictures. Image recognition is an important area of research that is widely used. For image recognition issues such as handwritten classification, we should know how to use the data to represent the image. The data here is not line pixels but image features with an advanced representation. The quality of the extraction of the characteristics is very important for the results [21][22][23]. Wu et al, (2015) [24] applied DL to handwritten character recognition and explored two common DL algorithms: the CNN and the deep belief network (DBN). CNN and DBN in the MNIST database and in the actual handwritten characters database. CNN and DBN in the MNIST database were 99.28% and 98.12 % accurate, respectively, and the accuracy of the classification accuracy in the database of the actual handwritten characters was 92.91% and 91.66%. Experimental results show that DL has a good ability to learn characteristics. There is no need to extract the characteristics manually; DL helps to understand the essential characteristics of the data better. Elleuch et al, (2016) proposes a method that constitutes two classifiers -CNN and SVM. They use it to identify Arabic characters from handwritten text. CNN and dropout regularization are not only used by this paper but also in other projects such as [25][26][27][28][29]. Pham et al. [30] used RNN with Dropout as one of the first projects to detect handwritten text and tested with a range of handwritten databases. Sun et al, [31] used Deep LSTM to detect Chinese handwriting. Their methods have been evaluated on the CASIA-OLHWDB set of data. Compared to the most advanced method, the accuracy and precision of the test set is reduced by more than 30 % over relative error [32]. Yang et al, (2016) [33] recommend combining Neural Network Classifier with Style Transfer Mapping (STM) for unsupervised author adaptation, which only requires author-specific unmarked data and is more common and efficient than supervised adaptation. To increase the performance of neural network classifiers, they used techniques such as Dropout, reread, momentum and Deep Monitoring Strategies. Experiments on CASIA-OLHWDB, a Chinese database writing online, show that this method effectively improves classification accuracy. Islam et al (2018) [34] presents an image-based food recognition method. They used CNNs to classify food images. The project used 16,643 images to classify a food dataset containing different food categories with 92.86% accuracy.

Virtual Assistants
Virtual Assistant, also known as AI assistant or digital assistant, is an application that understands natural language voice instructions and performs tasks for the user. These tasks are usually performed by a personal assistant or secretary and include dictating, reading texts or emails aloud, finding phone numbers, scheduling, making phone calls, and reminding appointments end-user. Currently, popular virtual assistants include Amazon Alexa, Apple's Siri, Google Assistant, and Microsoft's Cortana digital assistant, all built into Windows Phone 8.1 and Windows 10 [35][36]. Although the definition focuses on the digital form of virtual assistants, the term virtual assistant or virtual personal assistant is also often used to describe contract workers who work from home, who usually perform administrative tasks by administrative assistants or secretaries [37][38][39]. Virtual assistants can also be compared with another type of consumer-oriented AI programming (called smart advisors). Smart advisors are subjectoriented, while virtual assistants are task-oriented. Virtual assistants are typically cloud-based apps to function that is nowadays integrated into almost all devices by default. Three of these apps are Siri on Apple, Cortana on Microsoft, and Google Assistant on Android. There are also devices dedicated to virtual assistance. Amazon's Alexa, Google Assistant and Microsoft's Cortana, and Apple's Siri on Apple Home are the most popular. Usually, these devices have a call signal like for Google Assistant; it's "Hey Google or Okay Google", or for Cortana, it's "Hey Cortana". The LEDs on the device indicate to the user that they are ready to receive orders. Often these are simple language requests such as "Tell me a joke," or "play some music", or "buy a shirt". These requests will be addressed and stored in the Amazon cloud. The technology that powers virtual assistants requires large amounts of data from artificial power intelligence (AI) platforms, including ML, DL and AI, natural language processing, and speech recognition platforms. When the user interacts with the virtual assistant, AI programming uses complex algorithms to learn from data input and better predict the end user's needs. Virtual assistants usually perform simple tasks for end-users, such as adding tasks to the calendar; providing information that is usually searched in a web browser; or controlling and checking the status of smart home devices, including lights, Air conditioners, ovens etc. Users also require virtual assistants to make and receive calls, create text messages, get directions, listen to the news, and weather forecasts, find hotels or restaurants, check flight reservations, listen to music, or play games. Page et al, (2017) [40] showed how an Artificially Intelligent Virtual Assistant Helps Students Navigate the Road to College. This paper shows a pure application of an AI-powered VA to reduce workload in real life. The paper states that they use Conversational AI to provide personalized, SMS-based outreach and guidance to thousands of college freshmen to help them complete each task that requires support, thereby effectively supporting them. The system which was tested through field trials at Georgia State University (GSU) was called Pounce. GSU promised that students receiving treatment via the Virtual Assistant showed a greater success rate in terms of preentry requirements. And at the same period of time, the pressure on university teachers is greatly reduced. Kepuska et al, (2018) [41] have designed a prototype system that showcases the future of current Virtual Personal Assistants (VPAs). Their application used a multi-modal dialogue system that processes two or more combined user input modes, such as voice, image, video, touch, manual gestures, gaze, and body movement, to design the next generation of VPAs model. The new model of VPAs will be used to increase human-machine interaction by using different technologies, such as gesture recognition, image and video recognition, voice recognition, many knowledge-based on conversations and dialogues and a common knowledge base. In addition, the new VPA system can be used in various other application areas, including education aid, medical assistance, robots and vehicles, systems for people with disabilities, home automation and secure access control. Giancarlo Iannizzotto et al. (2018) [42] combine some of the most advanced technologies in the fields of computer vision, DL, speech generation and recognition, and AI into the virtual assistant architecture of the smart home automation system. The proposed assistant is efficient and customizable, and the implemented prototype runs on a low-cost, small-sized Raspberry PI 3 device. The system was integrated with an open-source home automation environment for testing purposes and ran for a few days while encouraging people to interact with it and proving that it was accurate, reliable, and attractive. Dipanshu et al, (2020) [43] designed a system that can successfully use the integrated webcam to capture gestures, process them, convert them into text format, display them on the input frame, and convert them into text then into audio format when receiving a call command. The audio becomes the query of the virtual assistant, and the audio output is successfully converted to text format and displayed on the screen again. This system is built to showcase a gesture-based virtual assistant that is targeted towards people with speaking or hearing disabilities. They have used DL and TensorFlow along with OpenCV to accomplish their goal. Vishnu et al, (2021) [44] proposed a CNN-based solution that utilizes transfer learning for developing a scene recognition Android application. They have used MobileNet to implement the CNN model with the TensorFlow ML framework. To reduce the inference time, they used hardware acceleration using GPU is used in the application. They trained the system using images with occlusion, different illumination and background clutter for improving the robustness. In their system, they trained the system with the MIT Indoor Recognition dataset and then used it for testing. The main goal is to showcase how DL can be used to expedite currentgeneration virtual assistants' advancement further and even expand its territory into the Robotic field with AI. Carnier et al, (2021) [45] ToraxIA is a Virtual Assistant for Radiologists Based on DL from Chest X-Ray. The system was trained with over 240000 images of chest x-rays. And the system can detect several pathological signs from them with an accuracy of about 97%. Even it was used to detect COVID-19 signs by training the system to differentiate between normal x-ray images and images from the COVID-19 dataset. Ultimately, this paper presents a virtual assistant made with DL capabilities for automatic detection of pathological signs suggestive of respiratory disease, reducing the diagnostic error, increasing efficiency, and quality of care in the medical field, reducing efforts of radiologists.

Chatbots
Chatbots are programs that use natural languages to interact with users. This technology has been gaining popularity since the 1960s [46]. Chatbots had been in use by almost every field of business, e-commerce, organization out there. Prominent uses include as a Tool to Learn and Practice a Language, as a tool to retrieve information, as a tool to conduct business, as a tool to provide customer support etc. Where there is no voice supported virtual assistant or where users cannot use voicebased VA, Chatbots are the way to go. One recent example is Microsoft's Bing Chatbox on their search engine [47], which uses AI. And chatbots have been on the rise since 2016 at an unprecedented speed with the introduction of Microsoft's Cortana, Amazon's Alexa etc [48]. Chatbots made with traditional methods such as conditional programming are not as efficient as using AI and DL. The current-day DL method has made training models for Natural Language Processing (NLP) far easier than before, which is exactly why chatbots and similar fields like VA can take advantage of it [49]. ELIZA is one of the first Chatbots to be built as a program to pass the Turing test [50] but it was simple and totally built on a rule-based framework. Other chatbots such as A.L.I.C.E. and PARRY was also built using the same technology [36 and 49]. Jabberwocky was one of the first chatbots to be built with AI to conduct hilarious conversations [51]. Hybrid chatbots with Recurrent Neural networks (RNN) and Long/Short-Term Memory (LSTM) prove more efficient in many cases [49]. Wu et al, (2019) [52] state the different ways chatbots are built in their presentation. They show how retrieval-based chatbots are built with single turn response selection and multi-turn response selection. They infer various conclusions based on comparing the different ways DL is used in building chatbots. For example, in the case of single turn response selection, the neural tensor is a powerful matching function and that combining information from different sources can be very beneficial for matching sequence etc. Research has been on a high horse since 2016 due to the sudden uprising for using human interaction programs, and various methods and frameworks have been in development since then to make the job even smoother. For example, some notable mentions are the DL to Respond (D2LR) model [53] with MLP matching, Coupled LSTM [54] with MAP matching etc. Conversational AI for Chit-Chat Programs or chatbots is catching on fire as it is more and more involved in mass business domains because big data technology is also on the rise [55].

DL in Healthcare
Health care is the care or improvement of health through the prevention, diagnosis, treatment, recovery, treatment of illnesses, illnesses, injuries and other physical and mental injuries. Definition of anger from the book that everyone knows because it concerns our well-being. In this area, the convergence of technologies makes it easier to achieve this goal for people from all walks of life. Implications of DL in healthcare has been increasing in the past 6 years [56][57][58]. This has mostly happened because of the increase in Big Data analytics and research, IoT advances, and ML advancements [59 and 60]. CNNs can be trained for medical imaging thanks to the massive advancements of Computer Vision with DL. CNNs and RNNs both can be used to predict health problems [56]. DNNs can be used in genomics for research in genetics, gene splicing, generating new gene sequences etc [61]. The following table 1 shows the various applications of DL in Health care. Automatically assign diagnoses to patients based on the patient's clinical status RBM [75] Predict the suicide risk of mental health patients through the lowdimensional representation of EHR Chinese medical concepts RBM [76] Discovery and detection of characteristic physiological patterns in clinical time series Stacked AE [77] Longitudinal sequence of model serum uric acid measurement to indicate multiple population subtypes and distinguish the characteristics of uric acid from gout and acute leukemia Stacked AE [78] Doctor AI : Use the patient's medical history to predict the diagnosis and medication for follow-up visits GRU RNN [79] Deeper: An end-to-end system for predicting accidental readmissions after discharge CNN [80] Predicting disease onset from longitudinal laboratory tests LSTM RNN [81] De-Identification the identification of patient clinical records LSTM RNN [82] Genetics Predicting chromatin markers from DNA sequences CNN [83] Basset: an open-source platform for predicting DNase I hypersensitivity in multiple cell types and quantifying the effect of SNV on chromatin accessibility CNN [84] DeepBind: Predict the specificity of DNA and RNA binding proteins CNN [85] Predicting

DL in Entertainment
Content consumption behaviour is becoming more complex and growing faster than ever. Data science, ML and AI in media and entertainment offer huge possibilities for ready companies to invest in viable long-term solutions. Companies such as Google, Amazon, and Netflix have influenced the media and entertainment sectors with their data-driven business models. Advanced data science and AI can understand customers and create value for customers by showing content that customers may like and may be relevant to them. But this change does not happen overnight; only by gradually developing into a truly datapowered organization can its full potential be realized [95]. Chou et al. (2020) [96] have designed a four-dimensional (4D) audio-visual entertainment system called Sense. The system includes a scene recognition system and hardware modules. When users watch movies and animations at home, these modules provide users with tactile sensations. Their recognition module uses Google Cloud Vision to detect common scene elements in the video, such as fire, explosion, wind, and rain. After the system considers the target detection result through the scene recognition system, the system generates the corresponding tactile or haptic sensation. Their system seamlessly integrates DL, auditory signals, and touch to provide an enhanced viewing experience. This is one of the more recent research as to show how DL can be used in media consumption. But DL has been in use for a longer period in digital games. Digital game-based learning or DGBL is an old learning method to learn about player goals and playing patterns. This is used to analyze the in-game approaches of tutorial bots and feedback systems [97]. Popular games like ATARI, Racing, DOOM, Minecraft, and StarCraft all use DL, albeit not a perfect implementation, and can cause problems such as unwanted NPC behaviour [98], enhance user experience and make the game easier and suits it to the consumer. Prediction is the future of DL [99]. Player data is used to predict future game contents and delivered as DLC, which is the current business model for the game industry. Gudmundsson et al. (2018) [100] showed that it is possible to predict the next most "human" action, and from that, they can update future contents of the game. They used Candy Crush Saga (CCS) and Candy Crush Saga (CCSS) player data to learn human playtesting methods and generate key metrics for future content creation. Another major advancement in the entertainment industry with the help of DL is Deepfake. It is a technology where the algorithm learns the facial characteristic of a person, and those facial characteristics are copied to a video or image of another individual. In easier words, it's refacing someone. Deepfake movies can make the movie setting more efficient, and we can watch movies without subtitles in our native language [101]. In the music and audio sector, DL is being used. Generating realistic original music is time-consuming and requires energy. Technology has made that easier. DL and midi files of music can learn music genre and generate real music [102]. Online music streaming platforms like Spotify, YouTube music, apple music rely on AI and DL for the music suggestion to their users. Still, novel ways to do the same task arises like T-RECSYS [103]. Services like Google, shazam used DL to extract music information from the audio [104].

DL in Robotics
In the past 5 years, DL technology has completely changed many aspects of computer vision and has been rapidly adopted by robotics [105][106][107][108][109][110]. Punjani et al. (2015) [111] in their research, shows that the performance of DNN in obtaining helicopter dynamics of the remote-controlled helicopter is about 60% higher than that of other methods. DL is now being actively used in industrial robots for purposes like object identification and picking [112]. Le et al. (2019) [113] showed this field using Kinect and Denso robot. They teach the robot to identify and pick up USB packs using both offline and online methods. In the era of online purchase and e-commerce, where goods are delivered right to the doorstep, object picking or bin picking is very important; Amazon conducted the first competition of this exact challenge using DL in 2015 [114].

Conclusions
DL is raging today, as major breakthroughs in artificial neural networks have prompted companies in all industries to implement DL solutions as part of their strategy over the past few years. From chatbots in client service to image and object recognition in retail and more, DL has opened up many new and complex AI applications. In recent years, DL has been particularly appealing to many organizations as more and more publicly accessible data pre-training models have been developed. However, this does not mean that DL answers all questions related to ML. One of the chief advantages of DL is its ability to resolve complex problems that need discovering hidden patterns in the data and a deep understanding of the complex relationships between a large number of interrelated variables. DL algorithms can learn patterns on their own hidden from the data, combine them and build more efficient decision rules. DL is excellent for complex tasks that often require processing large amounts of unstructured data, such as image classification, natural language processing, speech recognition, etc. However, classical ML may be better for simple tasks involving more direct features engineering that do not require unstructured data processing. In this paper, several fields have been described that is shining in DLimage recognition, object detection, healthcare, education, entertainment, music, robotics etc. in whole DL consist of four main network named CNNs, recurrent neural networks (RNNS), generative adversarial networks (GAN) and reinforcement learning (RL). The CNNs are mainly used in image and video applications like face recognition and object detection, RNN are applicable for text classifications and commercials like Exchange and image captions. In Computer Vision and Natural Language Processing (NLP), The Gans are mostly used for non-real images like deep fake videos, text to image conversion, and photo editing. And the last on RL is used as the basis of computational neuroscience to model decision-making processes like self-driving cars, traffic light control, robotics. A summary of each application has been described in the following figure (1). As the figure 1 shows, the most popular DL technique is CNN. Every field's research begins from CNN. Then other techniques are incorporated on top of it. The second most used technique is RNN or Recurrent Neural Network. CNN and RNN seem to go hand in hand in all fields. The expansion of DL has been steadily spreading over the world. Steady but not slow. As seen above how AI and DL has been touching all aspects of normal life. From education to business to industry to entertainmentvery soon, all fields will be using ML to complete their fundamental tasks. And it looks like a blessing. The main advantages of DL are as follows: Features are automatically derived and optimized to achieve desired results. Features do not need to be extracted in advance. This avoids time-consuming ML techniques. The robustness of ML to natural variations in data. The same neural network-based approach can be applied to many applications and different types of data. The GPU can be used to perform large-scale parallel calculations and is expandable for large amounts of data, and when the data is large, it can provide better performance results. The DL architecture is flexible and can adapt to new future issues.

Figure 1. DL Techniques used in different fields
But nothing is without drawbacks. Some of the downsides of DL are as follows: It takes a lot of data to be better than other technologies. The cost of training a model is very high due to complex data models. Besides, DL requires expensive GPUs and hundreds of machines, which increases the cost for users.
There is no standard theory to guide you in choosing the appropriate DL tool, as it requires knowledge of topology, training methods and other parameters, so it is not easy to adopt it by less qualified people. It is not easy to understand outputs based on learning, and a classifier must understand. The CNN-based algorithm performs these tasks. DL is currently known as the hot topic in the field of ML and social cognition. This article discusses face recognition, speech recognition, Chinese character recognition, neural networks, and chat robots. We also discussed the future impact of these applications on society. In DL, the learning of humans and animals is unsupervised: we tend to discover the earth's structure through perception rather than by being told the name of each object. Human vision is a powerful method that continuously collects optical arrays in an intelligent, task-specific manner, using tiny, highresolution fovea and super-large, low-resolution surrounds. We tend to expect that, in the long run, many advances in vision will be restored from systems that take square measurements from the end of training to the end and mix Conv networks and RNNs. These RNNs use reinforcement learning to let you Make up your mind anywhere. Systems that combine DL and reinforcement learning begin to measure in their infancy. Tongue understanding is another space during which DL will have a huge impact in the next few years. Ultimately, major advances in computer science can be achieved through systems that combine illustration learning with advanced reasoning. However, language and handwriting have adopted in-depth study and straightforward reasoning. DL is extensively used in the field of medical image recognition. Face recognition, recognition of character and many things. It trains a model on a given set of information to complete the specific task of new information, but ancient medical image recognition is inefficient in feature extraction and data processing, and its popularization effect is not ideal. These application domain units are changing with each passing day, and a large number of the latest skill domain units have also joined these technologies.