Human Activity Recognition System For Moderate Performance Microcontroller Using Accelerometer Data And Random Forest Algorithm

There has been increasing interest in the application of artificial intelligence technologies to improve the quality of support services in healthcare. Some constraints, such as space, infrastructure, and environmental conditions, present challenges with assistive devices for humans. This paper proposed a wearable-based real-time human activity recognition system to monitor daily activities. The classification was done directly on the device, and the results could be checked over the internet. The accelerometer data collection application was developed on the device with a sampling frequency of 20Hz, and the random forest algorithm was embedded in the hardware. To improve the accuracy of the recognition system, a feature vector of 31 dimensions was calculated and used as an input per time window. Besides, the dynamic window method applied by the proposed model allowed us to change the data sampling time (1-3 seconds) and increase the performance of activity classification. The experiment results showed that the proposed system could classify 13 activities with a high accuracy of 99.4%. The rate of correctly classified activities was 96.1%. This work is promising for healthcare because of the convenience and simplicity of wearables.


Introduction
In recent years, the digital transformation in healthcare has taken place strongly, and four prominent approaches have emerged: i) Using robots to interact with patients, provide personalized services, and provide advice [1]; ii) Optimizing the use of personal electronic wearable devices to monitor health status and prevent diseases [2,3]; iii) Using computer vision in healthcare [4]; iv) Actively using artificial intelligence and big data in the treatment of serious diseases [5]. The above approaches require complex data warehouses and digital platforms, with high construction and implementation costs. Thus, they were difficult to data obtained from inertial sensors to monitor daily activities [7][8][9][10][11][12].
Internet of things (IoT) applications [13][14][15] in the field of predicting human activities were growing strongly. Many works on human activity recognition (HAR) [16,17] ways of increasing interest due to its availability in practical applications such as sports tracking systems [18]; monitoring systems and prevention of sedentary conditions at work in the office [19]. Moreover, sensors integrated with wearable devices have many advantages, such as cost savings, fast response time, cost savings, and low energy consumption, making them suitable for wearable applications [7,12,[20][21][22].
Sensor placement (Fig 1) greatly affected the performance of the recognition model [6,11,12,15,19,20,23,24]. Sensors must be attached at an appropriate location on the body. This was determined by the activities that required monitoring. Smart watches are usually worn on the left wrist. It is often less related to activities that do not use the hand, such as sitting, standing, lying down... Besides, the sensors attached to the waist help with overall body movement monitoring and major body movements. But data collected from waist circumference is often less sensitive to activities involving the use of hands, such as writing, holding, grasping, etc. Therefore, recent works [25][26][27][28] have combined many sensors in different locations to increase the amount of information collected about activities. However, carrying many wearable devices on the body could cause discomfort. In addition, wearable devices needed to be compact in size and light in weight to create a comfortable feeling for the users, especially for the elderly or those recovering from an accident. These devices were worn on the body during treatment to monitor the progress of daily activities and detect abnormal activities. An example: a sudden change in posture, such as lying down while walking. Not only that, healthy people could use wearable devices to monitor sports cycles, monitor living habits and determine energy consumption for activities.
The performance of the recognition model improved substantially when machine learning was utilized to identify human activities [3, 6, 7, 10-12, 15, 19, 24, 29]. While the device's integrated processors were low-power microcontrollers, they had memory and processing speed limitations. Besides, ML application recognition models must be optimised and compiled into the C/C++ language before being embedded in the microcontroller. Thus, choosing an embeddable machine learning algorithm on microcontrollers was a big challenge. Embedded devices typically have limited memory, processing power, power capacity, and more. Since embedded systems were designed for specific uses, there were limited resources left for machine learning models. To solve this problem, we built a human activity recognition model based on a machine learning algorithm with low complexity, small size, and embedding ability in microcontrollers.
Machine learning algorithms (Fig 2) based on past experience, observations, and data to predict future corresponding work instead of just following established rules pre-programmed. In general, there were four main approaches in the field of machine learning, including: 1) Supervised learning [30,31]; 2) Unsupervised learning [32]; 3) Reinforcement learning [33]; 4) Semi-supervised learning [34]. With HAR, supervised learning algorithms would predict the label (activity) of a new descriptive dataset based on the correlation between that label and the previously known as descriptive data. In mathematics, a set of input data had the form χ = {x 0 , x 1 , x 2 , ..., x n } and a set of labels had the form γ = {y 0 , y 1 , x 2 , ..., y n }. Where, x i was a data descriptor vector for the label y i with i = 1, 2, ..., N . Data pairs of the form (x i , y i ) ∈ χ × γ were divided into training and testing datasets. On the training dataset, a correlation function was calculated to map the elements of the set χ to a corresponding element of the set γ. From there, when new data x was available, this function f would predict the label y corresponding to y = f (x). However, this function did not exist in practice, so the function f needed to be built well enough for the best classification performance so that y ≈ f (x). Building a good HAR model was influenced by many factors, including the quality of data describing the action, the size of the data window, features used, the machine learning algorithm applied, and the complexity of activities.
In addition, limited memory on wearables made it difficult to process large amounts of data and an increased number of activities. By using the dynamic windowing approach, our study has concentrated on enhancing data quality to capture the timeframe of state change operations. In this study, a device was designed like a belt, which can be worn on the waist in order to classify many activities and cause less discomfort for users. First, we collected accelerometer data for each activity and extracted a set of 31 features (time domain) per data window. Next, sets of features were divided into training and testing sets at the rate of 75/25. The random forest algorithm was applied to classify 13 routine activities. Finally, the appropriate model would be installed on a microcontroller with moderate performance (ESP32). Volunteers supported us in completing research to assess the usefulness of the dynamic window approach. Each device was able to transmit activity data to the data server via wifi or the internet. Through an application called "Human Activity Classification" that we provide to the Google Play store, users can track the activities of volunteers. The activity information was logged on the data server and backed up on the integrated memory card of each device. Each volunteer would wear a belt-mounted device. Since then, an internet-based remote activity monitoring system has been developed. With this method, we could evaluate the classification accuracy of wearable devices in a real-time environment. As a result, the experimental result was evaluated and compared with a number of related works on the classification of real-time HAR in the papers [12,22,35].

Related works
Various machine learning algorithms have been applied to build activity recognition models in many recent studies and achieved impressive results. Research by Mannini et al. [25] suggested using support vector machine (SVM) classifiers and activity data collected from sensors located at the ankle and wrist. The obtained results showed that the data collected at the ankle was better (10%) than the data collected at the wrist. Unlike them, Bali and co-authors [29] combined information including footsteps, gyroscope acceleration, and heart rate to recognize human activities. The features were extracted using the principal component analysis (PCA) method. Classification algorithms were used in C4.5, random forest (RF), K nearest neighbours algorithm (KNN), and SVM. Both the k-nearest neighbors and Nave Bayes classifiers were compared to the use of accelerometers and gyroscopes separately by Aiguo et al. [36]. We know from experiments that using both gyroscopes and accelerometers together improves categorization accuracy. Naive Bayes achieved a better overall accuracy (90.1% and 87.8%) than KNN. A large number of sensors could improve accuracy, but it would be impractical to have to wear them all the time. Adding more sensors would also make the system more expensive. Another study by Biagetti et al. [8] proposed a human activity recognition system consisting of wireless sensor network nodes (biological and accelerometer) and transmitted to a computer for data analysis. Results when applying the KNN classifier achieved an overall accuracy of 85.7%. In the study [35], Yang and Zhang propose a wearable operationally categorized system that resembles a wristwatch and is worn on the hand. The time and frequency domain characteristics of the accelerometer data are extracted, followed by the use of the decision tree method. On the STM32L low-power microcontroller, their modelling can be executed in real time. Despite the short number of activities (walking, sitting, jumping, bicycling, and jogging), the accuracy is below 90%. Five biaxial accelerometers were worn in a variety of positions on the study team by Bao et al. [37] including the hip, wrist, arm, ankle, and thigh. 20 activities were categorized with an accuracy of 84% by the decision tree classifier. The increased number of sensors, moving data, and accelerometer set to the designated orientation were the restriction, though. Many scientists in the last few years have looked into the possibility of employing a single accelerometer to collect the signal necessary for activity recognition [38]. Piyush Gupta et al. [38] used a belt-worn 3-axis accelerometer to construct an activity identification and feature selection system. Using Nave Bayes and KNN, they observed that wrapper-based feature selection was superior than filter-based. Data collection was limited to seven volunteers. All were young (22)(23)(24)(25)(26)(27)(28). Thu and co-authors [12] built a realtime recognition system for six activities on lowperformance microcontrollers. Their system used two features (mean and standard deviation) in a combined decision tree algorithm. The result achieved above, 92%, was quite good, but this result was lower than the result achieved on their collected data (99%). The difference came from the data itself and the limited number of activities. For example: when switching from a static state (lying) to a dynamic state (sitting up) or from a dynamic state (sitting down) to a stationary state 3 EAI Endorsed Transactions on Energy Web 06 2022 -12 2022 | Volume 9 | Issue 40 | e4 (sitting), these activities could be confused with the activities in a state of walking or jogging.

Wearable device
Frequently, the posture and velocity of human movements alter. Different activities produced different 3-axis acceleration values according to where the accelerometer was placed on the body [12,23,35]. In this research, accelerometer signals from the MPU6050 sensor were collected. Fig 3 shows the block diagram of the proposed system. The inertial sensor used was the MPU6050, capable of measuring 6 axes, including 3 axes of acceleration and 3 axes of gyroscope. The ESP32 was a power-saving microcontroller circuit that integrated Wi-Fi and Bluetooth. Additionally, the ESP32 was equipped with a 16MB flash memory, which was widely used in IOT applications. Besides, the ESP32 also supported the integration of embedding mini machine learning algorithms through integrated tools such as tensorflow, micropython. Therefore, this was the ideal choice when integrating machine learning algorithms such as random forest, support vector machine, et cetera. In this work, the central processor (ESP32) used I2C communication (inter-integrated circuit) with MPU6050 and DS1307 (time integrated circuit) to collect acceleration data over time at a sampling rate of 20 Hz (20 samples per second).
A human activity usually lasts from 2 seconds to 3 seconds and this time is longer in the elderly or people who have just recovered from an accident. Besides, a sampling frequency of 50Hz or more did not give a better result [36], and the performance of the device depended on the accuracy when combining the classification algorithm with data features. Acceleration values were calculated according to the formula (1). The source of use of the wearable device was a 3.7V-2000 mah lithium battery, so the battery power it could provide was 3.7V × 2000maH = 7400mW h.

Energy consumption
The average power consumption of the device in each working hour was about 60mWh. The device could work for up to 7400 : 60 ≈ 123.33 hours, which is equivalent to 5.14 days.
In that, A i was the acceleration value in the direction i; Sam i was the sampled value on the i-axis; R was the reference suspension resistor; O i was compensation and S i was the sensitivity.  In order to limit power depletion during operation, the device needed to be charged after a period of continuous operation. The alarm information was sent to users via the mobile application (Fig 5 left), and the alarm sounds from the device. Because the battery voltage (VBAT) was between 3.7V and 4.2V but the microcontroller's maximum withstand voltage was 3.3V, the battery voltage signal was passed through the voltage divider circuit (Fig 4). Next, the capacity of the battery was calculated by formula (2) as a percentage. As a result, the battery reached 100% when the battery voltage was 4.2V. The device stopped working when the battery voltage dropped to 3.7V, or 0%. On the phone application, the capacity warning icon needed to be charged when the battery capacity was below 10%.

Mobile monitoring application
To collect activity data, the wearable device was configured as a mini server, and users could connect 4 EAI Endorsed Transactions on Energy Web 06 2022 -12 2022 | Volume 9 | Issue 40 | e4 to the device via wifi thanks to an interactive app called "Accelerometer Gyrometer Logger". It is publicly shared on the Google Play store. This data was stored on 2GB of expandable local memory, and this memory communicated with the ESP32 via a standard serial peripheral interface (SPI). Each collecting and monitoring device was a node with its own address code and communicated with the data server by using the HTTP/GET protocol. Activity data was saved on the MySQL database.
The application interface (Fig 5 right) provided users with functions including: setting the sampling frequency; labels corresponding to activity to be collected; 3-axis accelerometer and 3-axis gyroscope; function to delete the collected data when users made a mistake in the operation or the data errors; battery capacity notifications; and the logging function of the collected data was controlled directly via the START/STOP function button. Besides interacting with the software, users could enable or stop the data logging function by using a physical button on the device. In addition, pressing the button for 5 seconds could delete the recorded data. After that, the device would delete the file and reboot.

Sampling and pre-processing
Public dataset. The activities of daily living (ADLs) dataset introduced in work [37] described 36 movements, including 20 falls and 16 daily living activities. Data of activities were collected by 17 volunteers (10 males and 7 females). Each volunteer performed each activity in turn, and each action consisted of 5 or 6 tests between 30 seconds and 60 seconds. When performing the activities, each volunteer would wear 6 devices at 6 corresponding positions: head, chest, waist, right wrist, right, and right ankle. The collection device included three sensors: an accelerometer, a gyroscope, and a magnetometer with a sampling frequency of 25Hz.
Researchers [23,38] investigated the fall state based on this dataset and showed that the sensor data collected at the waist position gave the best result. However, the data included a number of similar activities, so it was not necessary to distinguish the details. For example, sit-down on the bed, chair, in the air, or on the sofa. Thus, we would combine these activities into a common action. For example: walking forward and backward into walking; sit-down on different surfaces such as sit-air, sit-bed, sit-sofa, sit-chair into sit-down; tripping, sliding but controlled states into tripover; coughing-sneezing.
The data collected in the public dataset needed to be processed to improve classification accuracy. Fig 6 shows the rising activity after 6 tests over a period of more than 250 seconds, with acc_x, acc_y, and acc_z corresponding to acceleration on the x, y, and z axes. With a sampling frequency of 25Hz, the data of the tests was approximately 430 samples. In the first 150 data samples, the volunteer was horizontal, barely moving (static state), then he sat up for the next 2-3 seconds, and finally sat motionless (static state).
Signal statistics in these tests showed the information of activity was concentrated in signal regions on three x, y or z axes and had a range of greater than 0.5m/s 2 . Therefore, we determined the activity state was static or dynamic on each signal segment of size 1 second (time windows). After reducing the activity data in the static state, the descriptive information of the actions was more concentrated (Fig 7) than before processing.
Besides, non-normal values would be replaced by interpolated values, which were determined by averaging the adjacent signal magnitudes before and after those values. The above process was applied similarly to the remaining activities. 5 EAI Endorsed Transactions on Energy Web 06 2022 -12 2022 | Volume 9 | Issue 40 | e4 The public dataset contained the data for activities that follow the same process: state-dynamic-static but including repetitive and non-repeating activities. For example, lying down (lie-down) was a non-repetitive activity, volunteers had to rise and then did it again, similar to activities like sit-down, rising, tripover, and squatting. In contrast, activities such as walking, jogging, and limp could be performed continuously. Therefore, in order to improve the classification performance, we classified three activities in a static state, including sitting, standing, and lying. The data for these activities was the data areas separated from the data processing of non-repeating activities. For example, the descriptive data for two activities sitting and lying were the data areas in the static state before and after the lie-down activity was performed. Similarly, the data of standing was the signal area before the sit-down activity took place. However, the timing of the activities was not exactly the same. For example, the squatting activity had a duration of up to 5 seconds, while the transitions such as lie-down, rising, and sit-down had a duration of 3 seconds. Unlike them, activities that take place when a person moves, such as jogging, walking, tripover, and limp only need 2 seconds to be classified. Therefore, we proposed using different sized windows for each activity in order to provide better activity information. 13 activities were presented in Tab 1.
Private dataset. The private dataset was built based on a group of volunteers wearing waist data collection devices (Fig 8) and performing the following activities: walking, jogging, squatting, bending, bend-p, limp, tripover, sit-down, lie-down, rising, lying, standing, and sitting. This group consisted of 20 students, including 12 males and 8 females. The recorded data had the following format: activity, timestamp, sensor, values of x, y, and z, frequency. Activity logger files with a type of TXT were named with the activity and the time included lying-rising-standing; lie-down included from sitting-lying down to bed-lying.
The process of data collection activities was applied at a different collection time. We set the survey time to 3 seconds for non-repetitive activities (squatting, bendp, tripover, lie-down, sit-down, rising). Each activity took place for a period of 3 seconds and was repeated in subsequent discrete intervals. This would help our data be centralised and avoid loss when activity data is divided into time windows. The remaining activities, such as bending, limp, sitting, standing, lying, walking, jogging were surveyed at continuous intervals (30 seconds to 50 seconds). Collected data would be segmented into 3 seconds fixed size windows because this dataset has been sampled for limited periods of time.

Data transformation
Data segmentation was one of the stages of the receiving process. This phase allowed us to understand the impact of signals by dividing activity data after preprocessing. The data window had two approaches, including dynamic and static windows.
Static window size. With this windowing approach, the data of activities were segmented into equally sized data windows as shown in Fig 9. The size of windows was an important parameter for gathering a lot of information in this approach. As the size of the window changed, the amount of information changed accordingly. If the window size was too large, the processor would take too long to process the information, and there was a chance that information from other activities might appear. In addition, choosing a window with a short size caught information loss or misinformation with transitions occurring in a short time. Therefore, this approach was suitable for the classification of activities in a stationary state (sitting, standing, lying, bending) and repetitive activities for a long time (jogging, walking, limp).
Datasets used after preprocessing, removing noise, would be segmented and labelled. The data segmentation was done on the computer, so the static windowing method was applied at this stage. However, the size of the window would depend on how long each activity takes place. The window size applied to each activity in the public dataset was shown in Tab 1. For example, lie-down and sit-down were 3 seconds, squatting was 5 seconds, and rising was 4 seconds. Slightly different from public data, activity data in private data was segmented into fixed-sized windows of 3 seconds. The number of observations after data segmentation was presented in Tab 2.  Dynamic window size. With this approach, the window size was not constant with activity. In Fig 10, the data of squatting (squat down then stand up) was split into different sized windows to detect when this activity took place. In practice, activities took place at different time intervals, and applying dynamic windowing was necessary to improve activity predictability. Therefore, we applied the dynamic window method to the proposed model in conducting experiments.
When conducting experiments, we used windows of size 1 second to determine the static state or the dynamic state. Static and dynamic windows were determined by calculating range (H) by formula (9) and average resultant acceleration (ARA) by formula (12). Where ARA was the average of the square root of the sum of the squares of the signal strength on three axes. Research in [39] collected breathing data when a person was in a stationary state. This state was determined by X. Sun by calculating the average resultant acceleration for the 3-axis acceleration measured on the smartwatch. Accordingly, when a person was not moving, the calculation results were 7 EAI Endorsed Transactions on Energy Web 06 2022 -12 2022 | Volume 9 | Issue 40 | e4  approximately 10m/s 2 . Therefore, a time window was defined as a static state if two conditions were satisfied: H runs from 0 to 1m/s 2 and ARA runs from 9.6 to 10m/s 2 .
The window size changed from 1-3 seconds had a great influence on the classification results shown in the study [40]. Therefore, in this study, we limited the maximum number of seconds of data to 3 seconds to identify an activity. This was understood that the data window size would be automatically changed for each activity until the next activity is in a static state or the time between two consecutive activities should not exceed 3 seconds. This algorithm is depicted in Fig 11.

Feature engineering
After converting to time windows, the data was still difficult to apply directly to the machine learning algorithm and not the entire sample of activity data needed for the classification process. Therefore, the data of activities was transformed into a new dataset consisting of selected features in the time domain compatible with the classification algorithms.
These features reflected the correlation between the original data and each activity and helped to improve the accuracy of the classification model. This work considered 12 statistical features including: mean (µ), standard deviation (σ ), median (Med), average absolute difference (AAD), maximum value (Max), minimum value (Min), range, root mean square (RMS), correlation coefficient (Corr), average resultant acceleration (AAD), correlation between the average of the original signal series (µ start ) and the average of the final signal series (µ stop ) over a time window (CAOSS). These features were extracted in three axes (x, y, z) based on the respective formulas (3)- (13).
In that, X j was jth data segment j; x i , y i , z i were i th acceleration values on the 3 x, y, and z axes on the X j ; N w was the number of samples on the window X j . Correlation coefficient (Corr(A j , B j )) was a function that calculated correlation between pairs of axes: (x, y), (y, z) và (x, z). In CAOSS, µ start and µ stop were the average values of signals on each x, y, and z axis into the first and last second, respectively. The T-distributed Stochastic Neighbour Embedding (t-SNE) [3,7,12] tool made it simple to see how these features affected the 8 EAI Endorsed Transactions on Energy Web 06 2022 -12 2022 | Volume 9 | Issue 40 | e4 Activities might be easily recognized by the use of features. There was some confusion between the various activities in the moving state, though. system (Fig 12). To visualize high-dimensional data, there was a technique called t-SNE. By converting similarity between data points into joint probabilities, it attempted to reduce the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. Although there were up to 13 activities that needed to be classified, the number of activities that were confused was minimal. The activities in the moving state were where the confusion was most prevalent.
A feature vector with 31 dimensions was calculated from 13 measurements on the x, y, and z axes of the MPU6050 sensor (acceleration). These features were extracted from each data segment (time window) of each activity. Then, the features were merged and created into a feature vector representing the corresponding data segment. These feature vectors would be divided into training and testing sets at the rate of 75/25.

Recognition model
Random forest. While most data processing workloads with machine learning algorithms run in the cloud, there is another trend towards on-device machine learning algorithms (on-device-ML). On-device-ML means deploying machine learning models on embedded devices for direct inference at the real-time data source. Because the prediction and classification of activities take place directly, these devices did not need an external network and ensured that sensitive data was not revealed on the public internet. Cloudbased ML requires embedded devices to send data to the cloud for inference. As a result, the cloud-based computational model is virtually unlimited, but causes latency. There is always a delay when transferring data to and from embedded devices. Embedded systems are often deployed in locations where connectivity is limited, so it is not practical to operate ML for human activity recognition systems as well as the reliability of data in wireless transmission. Not all machine learning algorithms could be applied to microcontrollers; they needed to be miniaturized and optimized to run on low-power devices without too much loss of accuracy. Machine learning algorithms of suitable complexity and size could be embedded on microcontrollers, including decision tree (DT) and random forest (RF). Decision tree [41] was an algorithm that could imitate human thinking. This algorithm was based on the correlation between data to understand the logic between input and output data (Fig 13). Each tree node (rhombic) represented a data feature threshold, each branch represented a rule, and each leaf (ellipse) represented an activity or prediction label.
Decision tree models often suffer from the overfitting problem that leads to false predictions. To fit the data, it kept creating new nodes, and eventually the tree became too complex. Hence, it worked greatly on the training data but started making a lot of errors on the testing data. These algorithms are often 9 EAI Endorsed Transactions on Energy Web 06 2022 -12 2022 | Volume 9 | Issue 40 | e4 Many single decision trees are built that will work with random data from a dataset. Thus, the results of object prediction were aggregated based on the majority rule from these trees.
unstable when adding a new layer of data and are not suitable for large datasets. Thus, it was difficult to apply to microcontrollers when the number of output predictions was large. Furthermore, not all of the chosen features are appropriate for activity. This feature might be effective for one activity but not for another.
Multiple decision trees were gathered using the random forest technique to create a more robust model (Fig 14). This algorithm uses the bootstrap technique [42] to randomly select data from the dataset (random sampling with replacement). As a result, each new dataset might be duplicated, and a new decision tree would be built on this dataset. The process of building for each decision tree had a random element, so the decision trees in the RF algorithm might be different. The output prediction results that were aggregated from many independent decision trees will give optimal results.
Features are involved in and have an impact on the outcome of activity prediction with RF. In this research, the recognition model using the random forest algorithm was built on the training data. The testing data yielded the best recognition model, which would then be converted into C/C++ and integrated into the ESP32.
Human activity recognition using a random forest algorithm for wearable devices. The feature set obtained after feature extraction would be randomly divided into a training set and a test set at a ratio of 75/25. The recognition model was trained on a labeled training dataset corresponding to each feature vector. Then, the classification result based on the test data would help to evaluate the built model. For the random forest algorithm, we used the publicly shared sklearn library to build the model. Besides, we used a parameter n_estimators to set the number of single decision trees, and each branch of a decision tree would work with 1 feature. In this work, our parameter set was 50. Fig 15  shows the process of building a classification algorithm on a wearable device.
To build the best recognition model and achieve high accuracy, the model was trained on the computer, and the best model was transformed to be embedded in the ESP32 microcontroller. The computer that had been used to train the model and operate the algorithm had the following configuration: Dell G515, 16GB ram, Intel (R) Core I7-9750H processor, 256GB SSD, 1TB HDD. The random forest model applied to the ESP32 was illustrated in Fig 16. Activity 1 or activity 2 represented 1 of 13 daily activities. The final prediction result for an activity was based on the prediction results of a majority of individual decision trees. The recognition model that gave the best results on the test dataset was compiled into a library named "RF_acc.h". This was included in the embedded executable on the ESP32 processor. Accordingly, the ESP32 integrated into the wearable would read the acceleration signal from the MPU6050. After collecting enough data samples, ESP32 would extract a vector of 31 features from each window and perform classification based on this feature vector. This process was performed every 1 second, corresponding to 20 data samples. If 1 second windows were static, the system would classify the activity immediately and continue to classify every 1 second, followed by the same rule. Conversely, if the current window was in a dynamic state, the processor would save 1 second of previous data and continue to consider data windows in the next 1 data second. The human activity recognition system would result when a data window size reaches a maximum interval of 3s or a next 1 second window in a static state. Thanks to the soft change of the data sample size (dynamic window size), activities were detected faster with less delay. At the same time, our system could understand the start time and end time of a forward activity (lie-down, sitdown, rising, bendp, tripover). 10 EAI

Evaluation indicators
Confusion matrix.
The recognition models in this study were evaluated based on the following indicators: accuracy (ACC), sensitivity (SEN ), negative predictive value (N P V ), and positive predictive value (P P V ). These indexes were calculated based on the parameters including true positive (T P ), true negative (T N ), false positive (FP ), and false negative (FN ) according to the corresponding formula (14)- (17).
Considering an activity of sitting, T P was the number of these activities that were correctly classified as compared to the actual label, FN was the number of these activities that were misclassified, FP was the number of other activities that were mistakenly classified as sitting, and T N was the number of activities other than sitting that were correctly classified. A specific example of the definition of parameters T P , T N , FP , and FN is presented in Tab 3. In this example, Sitting has T P = 100, FN = 3, FP = 7, and T N = 120; Standing has T P = 50, FN = 6, FP = 2, and T N = 170. Cross validation. The input data must be large to achieve a good type of recognition model. However, this was unlikely because it could not determine how much data was needed. The K-fold cross validation was applied to provide a lot of data to train the client to validate the model with multiple tests simultaneously. This method would apply to proposed models and other machine learning models such as decision tree (DT), support vector machine (SVM), K nearest neighbours (KNN), gradient boosted decision tree (GBDT). With this method, the data was divided into k equal parts. In which, training data was used (k − 1) parts, the rest was testing data. The model output was evaluated based on the results, including the mean (µ) and standard deviation (σ ) of k times of data divisions. This research applied cross-validation with k = 5 in the evaluation process.

Model performance
Performance evaluation on the public dataset. The classification result of the proposed model was presented in the form of a confusion matrix (Fig 17). Activities were misclassified often due to too fast or too long execution time or similarity to other activities. For example, when a person was walking on crutches with an injured leg, the movement speed was so slow that it could be mistaken for standing. Squatting was a combination of sitting down, sitting, standing up, and standing for a period of time. This usually occurs when a person exercises (gym). However, if this took a long time, squatting could be mistaken for sit-down. With the public dataset, sit-down activity (6/20) was mistaken for squatting. In addition, the activities of limp and walking were confused with each other due 11 EAI to the similarity in posture or in people with minor injuries in the leg, less hindering movement. Some other activities were mistakenly classified as walking, but the number of these activities was insignificant. For example, 2/80 jogging, 3/47 tripover, 1/91 sit-down, and 1/28 rising were mistaken for walking. Details of the performance evaluation of the proposed recognition model on the public data set are described in Tab 4. The two indexes, ACC and N P V , of all activities, were 98%. In particular, activities such as sitting, bending, bendp, lie-down had the indexes of ACC, SP E, P P V , and N P V were 100%. This was followed by the figure for squatting, limp and tripover activities, with ACC, P P V , and N P V over 90%. However, the difference lies in the SP E index. The activities of limp and tripover had a SP E index of over 80%, while squatting was only 65%. This was consistent with our analysis of how long these activities take. Overall, the proposed model had the indexes of ACC, SP E, P P V , and N P V indexes over 93%, which was good.
The result of the classification of activities when applying 5-folds cross validation on the public dataset was good (Tab 5). The proposed model had the best result, stood at µ = 96%. This was greater than the results for algorithms of GBDT, KNN and SVM, making up 95.3%, 94.9%, and 94.2%, respectively. Yet, the result for DT was the lowest, standing at 92.6%. Besides, the recognition model using GBDT algorithm gives the lowest standard deviation σ = 0.4%. The classification results had a low standard deviation of 0.4%-0.7%. This showed that the data from activities was less scattered and had high reliability.   Performance evaluation on the private dataset. The private dataset gave a result of 99.7%. True to our previous analysis, the limited time (3 seconds) method of collecting activity data has helped our data to be highly reliable. The negative impacts on classification results were minimised. These negative impacts were caused for three reasons: 1) One other activity interfered when an activity took place over a long period of time; 2) The time difference of an activity with each volunteer; 3) The data sampling process had not been well controlled leading to excess or lack of activity data. These catches especially affect transitions such as: rising, lie-down, sit-down, tripover, bending pick up (bendp), squatting. If the timing of these activities could not be unknown, the data describing them threatened to degrade the performance of recognition models. Fig 18 shows most of the actions distinguished with a 100% accuracy rate, except for squatting (24/25). However, the classification results obtained were impressive. The problems encountered when classifying on the public dataset had almost been solved.
The classification results on the private dataset showed the strong performance of the proposed model. The evaluation indexes of ACC, SP E, P P V , and N P V all reached over 96% (Tab 6). Except for squatting and lying, all activities had indicators reaching 100%. This was greater than the result for squatting with ACC = 99.7%, SP E = 96%, P P V = 100%, N P V = 99.7%, and lying with ACC = 99.7%, SP E = 100%, P P V = 96.6%, N P V = 100%. However, the results of the recognition model evaluation on this dataset was impressive with ACC = 100% and N P V = 100%.
The results when applying 5-folds cross validation on this dataset were also gradually different when they were applied on the public dataset. The proposed model got the best results with 99.1%, but the standard

Experimental evaluation
The experimental process was conducted on volunteers for a period of 30 seconds to 60 seconds, and the sampling frequency was 20Hz. Volunteers wore wearables suggested and performed all 13 activities according to a predefined scenario. Transitions such as sit-down, lie-down, rising, tripover, and bendp were limited to a maximum execution time of 3 seconds. The remaining activities took place in 10-15 seconds. A portion of the experimental procedure for finding mixed activity sequences was shown in Fig 19. Since we concentrated on recognizing when activities were started in a dynamic state, activities could be discriminated against more quickly. When tested with the real sequence of activities, the device was still able to detect the activity with high accuracy even though the sampling method for the activities was uniform and no additional activities were present.
Overall, the proposed model reached 96.1%. The results of classification versus actual observation were presented as a confusion matrix as shown in Fig  20. Besides, transition activities were mistaken for each other when the speed of them was so slow. For example, 2/52 lie-down was mistaken for lying, and 13 EAI Endorsed Transactions on Energy Web 06 2022 -12 2022 | Volume 9 | Issue 40 | e4 1/49 rising was mistaken for lying or sitting. This was similar to squatting, where 1/61 squatting was mistaken for rising or lie-down. However, the rate of occurrence of these errors was not significant. In general, the proposed model gave good classification results. The experimental result helped to evaluate the proposed model's performance when the input was real-time data (Tab 7). The classification accuracy with activities (ACC) and the correct prediction rate of nonoccurrence actions (N P V ) were both above 99%. The lowest correct prediction rate for actions (P P V ) was 89.6% for walking activity, while this reached 100% for squatting and lie-down activities. A sensitivity (SP E) of over 90% showed that it was feasible to apply dynamic windowing methods to detect activities, especially with state transition activities. For example, bendp and liedown activities had corresponding sensitivity indexes of SP E = 93% and SP E = 96.2%. Mover, rising and sitdown activities were both SP E = 100%. In general, the proposed model had good evaluation indicators with ACC = 99.4%, SP E = 96.5%, P P V = 96.2%, N P V = 99.7% in reality. These overall evaluation indicators had a negligible difference with those calculated when evaluating our model on public and private datasets. In addition, the proposed model, when conducting experiments, has achieved relatively uniform indexes of over 90%.
The classification performance of our model has significantly improved between the public dataset and real-time data. For example, squatting and tripover increased from 65% and 80.9% to 93.4% and 90.6%, respectively. This result was possible thanks to the sampling process having been improved to increase the quality of the activities and applying the dynamic window method to understand the activity process.

Discussion
This work attempted to develop a recognition model for a real-time application that would recognize human activity. The dynamic window technique was merged and optimized with the applicable algorithm (random forest). As a result, the human activity recognition system is much more accurate. The result showed that the ability to classify real-time activities on wearables was good at 96.1%, although it was slightly lower than the evaluation results on public and private data. The first reason was the dispersion among feature vectors in real-time data. Meanwhile, with public and private datasets, activities were closely monitored and feature quality was enhanced. Additionally, because activities occur in sequential order in daily life, the acquired data may be more homogeneous than real-time data. In reality, training and testing datasets derived from realworld scenarios may differ significantly. The existence of so many emergent scenarios made it hard to gather an exhaustive set of training data from all types of activity. As a result, the samples of test data would be different from those of training data.
The recognition model was unable to perform well if the training dataset was insufficient. Therefore, we tried to collect activity data for a limited time and surveyed many volunteers. In addition, it was a significant task to classify 13 activities. Previous studies investigated some repetitive activities such as sitting, lying, standing, walking, walking upstairs and walking downstairs as in [43]; walking, jogging, upstairs, downstairs, sitting, standing in [20,44]. With these activities, the application of static windows gave good results [41,45,46]. However, with state transition activities, the static window had a major drawback: it was not able to determine the activity time because the time of these activities was not the same.
Many related works in the field of HAR have been interested in real-time classification capabilities and the application of different classification algorithms (Tab 8).
Thu et al. [12] applied 2 features (mean and standard deviation) to a 3-axis accelerometer on each time window of size 6 seconds and combined it with a decision tree algorithm. The result when applied experimentally was 92%, and the accuracy was 95.2%. Their device classified 6 activities, including sitting, standing, lying, walking, and jogging, in real-time. A three-level decision tree algorithm (DT) was built as a recognition model suitable for low-performance microcontrollers, but the classified activities were repetitive activities over time and they had low complexity. Besides, the time of 6 seconds for each classifier applied was too long, leading to a large delay if changing activities. As the number of activities increased and more complex ones were added, it was difficult for their model to achieve high accuracy because of memory limitations and usage features. Similar to Yang's study [35], the stm32 microcontroller was used in his study to embed a realtime recognition model using decision tree algorithm (C4.5). In their study, they used up to 16 features from 6 measures, including mean, magnitude of the acceleration of the three axes, variance, cumulative, skewness, and coefficients. The classification accuracy for five activities, including sitting, walking, jumping, jogging, and cycling, was 90% on average. This established that the DT algorithm could not provide perfect accuracy, as their idea was that data acquired from several people would be trained and analyzed jointly. Consequently, it is possible that their algorithm is not optimal for the individual.
Embedding on a recognition model on lowperformance microcontrollers required optimization of machine learning algorithms, the number of features and time window size, so the results were not good [12,35]. In particular, the static window method in 14 EAI Endorsed Transactions on Energy Web 06 2022 -12 2022 | Volume 9 | Issue 40 | e4  those works had shown limitations on the quality of features as well as limitations on the number of activities that could be classified.
Another direction of real-time human activity recognition was presented in the paper [22]. Suto et al. tested the classification results between offline and online with 15 features extracted from accelerometer and gyroscope sensors. A wearable used was a phone that ran the Android operating system. The device was attached to the right ankle and was performed by 3 volunteers (2 men and 1 woman). The study investigated 3 methods, including artificial neural networks (ANN), convolutional neural networks (CNN), and 1NN (similar to KNN). Their results showed that the ANN model was highly accurate. CNN was not a good choice for the real-time classification problem because the training time was too large. Applying a time window of 3.88 seconds with 50% overlap had achieved 88.8% when performing real-time classification with 7 activities, including sitting, standing, walking, jogging, running, and lying. The result for accuracy was low for three activities of jogging (8.1%), cycling (13.5%), and sitting (16.2%) when the first volunteer performed the test. This showed data collected from the ankle position was not the best choice. Results improved in the next 2 volunteers reaching over 80% with these actions. However, the recognition model using the ANN algorithm was difficult to apply to the real-time classifier directly on the microcontroller platform because it had a large training time and high complexity. The usability of this model on phones was high, but the large phone size would cause discomfort when placed at the ankle. Besides, the battery power on the phone was not suitable for applications that would operate continuously for a long time.
The application of real-time human activity recognition on wearable devices was a problem with many challenges in terms of microcontroller memory usage, algorithm complexity, sampling rate, and time to survey an activity. We have used up to 31 features in the time domain and used a highly complex random forest algorithm to solve the classification problem of 13 routine actions. The results achieved 96.1% with 99.4% accuracy, showing high applicability in practice. In addition, our model was aimed at complex problems 15 EAI Endorsed Transactions on Energy Web 06 2022 -12 2022 | Volume 9 | Issue 40 | e4 such as analyzing the movements and activities of many different subjects, such as firefighters, searchers and rescuers, and newcomers recovering from accidents. The state of activity was closely related to their health status. Especially for firefighters, their environment often has influencing factors such as smoke, fire, and high temperatures. Activities are often intense and constantly changing. Monitoring the activities of firefighters will help the commander to provide flexible and timely support. In addition, the fire environment is often unpredictable, so wearable-based monitoring is a viable option.

Conclusion
This work presented a real-time activity recognition system integrated onto wearable devices. The dynamic window method and the random forest algorithm were combined to create the system using a novel methodology. We created a private dataset comprising 13 activities in addition to the public dataset, including walking, jogging, squatting, bending, bendp, limp, tripover, sit-down, lie-down, rising, standing, and sitting. With the assistance of volunteers, this data was gathered via the ESP32 integrated wearable and the MPU6050 sensor (accelerometer). The suggested recognition model was evaluated on these two datasets using both the confusion matrix and 5-fold crossvalidation.
The recognition model building process consists of two steps: First, we extracted feature vectors with 31 dimensions per time window of changed size. These vectors were labeled and combined with the random forest algorithm to build a recognition model for 13 activities. Next, this model was embedded on the ESP32 (a high-speed microcontroller) and was tested with real-time data. Experimental results showed that the features were highly correlated with the activities to be classified by applying the dynamic window method. Static activities (sitting, standing, bending, lying) were quickly detectable after just 1 second of data collection. Transition activities (rising, sit-down, lie-down, squatting, tripover) were classified with over 99% accuracy. The classification results with real-time data reached 96.1%, and the accuracy of 99.4% showed that the proposed model had high performance. Research results were shared at "https://github.com/daohieuictu/HAR-realtimerandom-forest". To increase the recognition model's accuracy, we will combine different algorithms when developing it in the future. Besides, we tend to study the support system for firefighters with complex behaviors (rolling, crawling) and survival states (falling, unconscious).

Ethical Approval
Research volunteers are students and lecturers from our university. All agreed to participate in the experiment, and their information was kept confidential.

Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.