Random and systematic errors in pairwise computer programming: A systematic review

In this article, a systematic review is carried out to identify random and systematic errors in studies on computer programming in pairs in higher education students. Methodologically, we applied the fundamentals of the PRISMA statement. One thousand one hundred eighty articles were selected from the Scopus, Web of Science


Introduction
The rapid changes in the labor market have generated a high demand for professionals with computer skills worldwide. Many higher education graduates need to be made aware of the skills the labor market requires. At the same time, some universities still need to prepare to offer teaching according to the needs of society. 1 The proliferation of big data and artificial intelligence (AI) technologies is transforming various aspects of our society, such as the economy, organizations, social relations, commercial transactions, education, and politics. 2,3 AI has become an essential tool for solving problems and making decisions. Programming is a skill that allows you to make the most of the potential of AI and is, therefore, crucial for development and innovation. 4,5 Studies show that computer science and information systems students struggle to complete computer programming courses successfully. Learning to code is complex, and failure and dropout rates in college-level programming classes remain high. 6 There is a latent problem of low performance, student desertion, and demotivation in the face of computer programming in the careers in which related subjects are studied. This happens very frequently in the Engineering career. 7,8 The overall probability of passing a first introductory programming course the first time has been 40% across all majors, with an initial failure rate of 19.5% and a dropout rate of 40.5%. 9 These results are worrisome for higher education managers, as well as for programming language teachers. Many authors have proposed methodologies for teaching the programming language. They affirm that there is no perfect or unique methodology, but it should be able to cope with all learning styles. Each student has different knowledge, motivations, conditions, and abilities, so each has a different learning process. 10,11 Learning to program is a multifaceted process, combining theory and practice. This process implies acquiring skills in analyzing, designing, and implementing computer programs and developing efficient algorithms. 12-14 For many years -and even today-most introductory programming courses have focused on developing skills with educational models and assessment systems based on the individual paradigm. These use exams and programming tasks so that the student develops personalized skills and can finally program. [15][16][17] However, software development is rarely a solitary activity. A software project comprises different phases, from analysis to evaluation. In each of them, a work team participates that shares methodologies, programming language paradigms, databases, a dictionary of terms used in the software, and methods for evaluating the software's quality. Therefore, working individually would be complex in an actual application project. Software development resembles a team sport in which working alone will affect quality, team morale, and the ability to overcome the complexity and risks associated with the project. 18 Numerous scientific investigations have been carried out on pair programming. [19][20][21][22][23] Some researchers argue that pair programming is neither as economical nor productive as individual programming. 24 In comparison, others point out the need for more studies in this regard. [25][26][27][28] The divergence in the results may be due to the investigation's validity. In explanatory and experimental research, validity is a determining factor. Therefore, special care must be taken with the instruments used to measure the applicability of the phenomenon. In all research, the minimization of errors must be guaranteed since there can be two types: random and systematic. Regarding the random error, the size and selection of the sample significantly influence validity. 29 Furthermore, this error is related to the concept of precision since an estimate or measurement is more precise the more minor the random error component is. In contrast, systematic error is due to factors such as the lack of control of extrinsic variables or the poor calibration of the measuring instruments. It is essential to identify and minimize random and systematic errors in an experimental study to guarantee the validity and reliability of the results obtained. 30,31 Systematic reviews on pair programming have analyzed different factors that influence student effectiveness. Such is the case of the study by Salleh (2008), 32 who identified the factors that affect the effectiveness of students. This research focused on psychosocial factors such as compatibility, personality, and gender issues. The results showed that personality type is the most investigated factor in these studies. Subsequently, Salleh et al. (2011) conducted a meta-analysis of 74 studies to identify the factors that make pair programming highly effective. They reviewed articles published during the period from 1999 to 2007. Their study revealed that a student's skill level greatly affected the effectiveness of pair programming. 33, 34 In a recent study, Satratzemi, Stelios, and Tsompanoudi (2022) conducted a systematic literature review (SLR) that included 57 studies on distributed pair programming (DPP) in higher education. 35 The objective of this review was to identify those studies that investigated factors related to the effectiveness of DPP as a method for learning to program, as well as factors associated with mediating and stimulating interactions between students. In addition, the measures and instruments used to explore these factors and the tools and their characteristics were analyzed. Xu and Correia (2023) conducted a systematic review of DPP studies published after 2010 to understand the issues and factors that impact the effectiveness of the DPP team. 36 The results showed that individual characteristics such as previous programming experience, actual and perceived ability, gender, personality, time management, confidence, and self-esteem had been the subject of significant research. Regardless of the investigations, studies focused on specifically analyzing random and systematic errors concerning the problem posed have yet to be found. This article aims to systematically review couple programming to identify and analyze random and systematic errors in the studies analyzed.

Methodology
This study has been developed using the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) methodology for systematic reviews 37 based on the guidelines of Serrano, Navarro, & González (2022). 38 The following phases have been developed: • First phase: formulating a clear and specific question that must be answered through the systematic review. • Second phase: Conduct a comprehensive literature search to identify all relevant studies that address the research question. • Third phase: application of inclusion and exclusion criteria to the identified studies to select those that meet the eligibility criteria for the systematic review. • Fourth phase: evaluation of the methodological quality of the studies included in the systematic review using standardized tools. • Fifth phase: extracting relevant data from each study included in the systematic review and performing statistical analysis and data synthesis. • Sixth phase: interpreting the systematic review results based on the research question and identifying the practical implications and possible future research directions. • Seventh phase: writing a detailed report of the systematic review. This report includes a complete description of the methodology used, the results obtained, and the review's conclusions. [39][40][41] Search strategy An exhaustive search was carried out in specialized databases to find relevant information. Table 1

Inclusion and exclusion criteria
The inclusion and exclusion criteria refer to the standards and rules we previously established to determine which studies will be considered in the systematic review. These criteria are based on the research objectives and the research question addressed.

Characteristic Inclusion Exclusion Participants
Higher education programming students

Sample selection
After processing the inclusion and exclusion criteria, the sample was restricted. This was done to analyze only those articles related to the proposed objective. In the flow diagram of Figure 1, 1180 identified articles are detailed. After eliminating duplication and applying the inclusion and exclusion criteria, 93 articles were obtained. Articles were excluded for different reasons, finally leaving a sample of 23.

Results
The most relevant characteristics are found in Table 3. The attributes were considered: author, year, country, research design, sample, sample selection, type of test, control group, and test. The results obtained in this study show the distribution of the selected articles according to their country of origin. Most articles (47.83%) were published in the US, followed by Greece (21.74%). The other contributing countries were Turkey, Ireland, the United Arab Emirates, Germany, the Philippines, and Australia. Each one represented 4.35% of the total of the selected articles. These findings suggest great internationality in the subject studied, although it is essential to note that the database selection may have influenced this distribution. Studies often use a quantitative approach and experimental design to address the research problem. In addition, 60.87% involve intervention and include a control group. This allows a more accurate and reliable comparison of the results obtained. At the same time, 39.13% of the studies did not consider a control group in their experimental designs. 34.78% used non-parametric tests, and 65.22% used parametric tests. Regarding the specific tests used, it was observed that 39.13% of the studies used the Student t-test, followed by Mann-Whitney U (17.39%), Wilcoxon (8.69%), and Factor Analysis, Kruskal Wallis and Chisquare with 4.35% each. 39.13% of the participants were recruited voluntarily, and 60.87% were students in programming courses.

Identification Elegibility Inclusion
Number of excluded records (n=1020) Screening  19 two exploratory analyses were conducted to investigate the variables affecting pair programming. In the first study, 300 firstand second-year students participated; but only 16 enrolled. This is common in part-time distance learning contexts. In the second study, 1769 students were invited; but only 24 completed the final examination. In the same way, the sample was not selected randomly, implying a bias in the results obtained. Saltz and Shamshurin (2017) 62 evaluated pair programming in a graduate course on data science. One hundred ten students participated in it, divided into sections of 20 to 24 students. However, it is not specified whether the sample was selected in a representative manner, implying a possible limitation in the results. In the research by Omer and Suleyman (2021), 61 the study group consisted of 64 volunteer junior and senior students from a university's Department of Information Technology Education and Educational Technology. This indicates a selection bias and affects the generalizability of the results. In Gehringer's (2003) study, 63 all students were surveyed at the end of the semester. Responses were only received from 59 of the 96 students who participated in the projects. This shows a limitation in the sample's representativeness due to the high rate of absent participants. While in the Tsompanoudi, Satratzemi, and Xinogalos' study (2015), 49 Computer Science students with basic knowledge of the Java programming language were voluntarily recruited. The sample consisted of 48 students, of which 10% were first year, 23% second year, 27% third year, 11% fourth year, and 29% fifth year. Although most participants had taken a Java course in the past, an obvious selection bias was observed due to the lack of uniformity in the sample. This also limits the generalizability of the results. On the other hand, in the study by Sfetsos et al. (2009), 53 70 students out of 90 were voluntarily selected to participate in the experiment. This shows a selection bias that affects the representativeness of the results. It is important to note that in many studies on pair programming, an utterly random assignment of students is impossible due to the need to specifically pair students to maximize the chances that they will work together. 64 In addition, it is essential to specify the validation and reliability of the instruments used in most investigations. Random error is an important factor in research due to the variability inherent in measuring the variables and the characteristics of the voluntary participants, such as motivation, abilities, and competencies. 30 In many investigations, an error is made in the selection of the subjects since the nature of the study forces the selection of adequate samples for the research. Often this occurs for convenience, which is added to the voluntary participation of the students. An acceptable procedure is not always followed to determine the representative sample of students who will be part of the investigation. Regarding the design of the analytical studies, the aim is to estimate the effect of a study factor on a response variable. To guarantee an adequate inference, in addition to the group exposed to the study factor, it is necessary to use a control group to reference what happens in the subjects not exposed to said factor. This way, the results obtained in both groups can be compared, and more precise and reliable conclusions can be obtained. However, 39.13% have not considered a control group in the results received.

Discussion and conclusions
A systematic review of the research on pair programming in higher education was carried out, focusing on evaluating the validity of the experimental research. 65 When analyzing the reviewed articles, it was found that quantitative investigations of experimental design predominate, and only one investigation was oriented towards a mixed approach (qualitative and quantitative). Pair programming can have a positive impact on student learning. Bernadé and Liebenberg (2017) 21 found that most students enjoy working in a team in programming, which is beneficial for them. Additionally, Saltz and Shamshurin (2017) 62 reported that pair programming effectively improves communication and produces higher-quality code in less time for data scientists. 66 These findings suggest that this programming may be a valuable strategy to improve student learning and performance in programming. A study by Isong et al. (2016) 23 examined this technique in depth. It concluded that paired work is superior to individual programming regarding task completion time, correctness, and code quality. The null hypothesis was rejected, indicating that teamwork is significantly more effective than working alone. Beasley and Johnson (2022) 44 argue that pair programming offers a significant advantage in overall classroom performance compared to individual coding, even when pairs collaborate remotely. On the other hand, Salleh et al. (2011) 32 have highlighted the influence of students' academic performance on pair programming, noting that high openness paired students perform better than their counterparts. In addition, Gehringer (2003) has shown that through pair programming and tools to track progress, students can avoid problems that often occur in team programming. 63 Likewise, Bernadé and Liebenberg (2017) 21 reported that most students have a positive experience working in a team in programming. Saltz and Shamshurin (2017) 62 found that pair programming can improve data scientists' communication and code quality. Additionally, Saltz and Shamshurin (2017) 62 reported that pair programming effectively improves communication and produces higherquality code in less time for data scientists. These findings suggest that this programming may be a valuable strategy to improve student learning and performance in programming. 67,68 The divergence in the results underlines the importance of analyzing the validity of the investigations. For this, it is necessary that the results can be generalized to a broader population, such as all programming students worldwide. Bennedsen and Caspersen (2007) 69 provide an idea of the size of this population, reporting that in 1999 there were more than one million students enrolled in computing in 72 countries. However, most of the research has used convenience samples and students who participated voluntarily, which is not representative of the general population. As a result, this creates bias and causes both random and systematic errors. Therefore, more research using random samples and more representative of the general population is needed to improve the external validity of the results. 70,71 After analyzing the studies included in our review, it was found that 39.13% correspond to pre-experimental studies. As is known, this type of research involves implementing an intervention or treatment; but without a control group to compare the results. Instead, a measurement is made before and after the intervention to determine if there was any change in the measured variable. Because it does not allow conclusively establishing a causal relationship, this type of study is considered the weakest regarding scientific evidence. 72 On the other hand, 60.87% of the studies included here have a quasi-experimental design. It is understood that in this type of research, variables are manipulated, but a random assignment of the participants to the groups is not carried out. Instead, groups can be formed by convenience, specific characteristics of the participants, and geographic location, among others. The results are then compared between the groups to determine if the intervention had any effect. Despite not being as strong as a pure experimental study, this type of research is considered more robust than a pre-experimental study. 73 The results indicate that 46.66% of the investigations did not use a control group, meaning no reference group would allow comparing what happens in the groups not exposed to programming in pairs. 74 Therefore, without using a control group, it is difficult to determine if the results obtained are due to pair programming. The investigations have not adequately analyzed the necessary procedures for all experimental research, such as the reliability and validity criteria of the measurement instruments and the conditions to carry out an empirical investigation. In all scientific research, it is essential to properly design the study to avoid errors that may compromise the proposed objectives. However, random errors have been found in most of the investigations, which can be attributed to the voluntary selection of participants and the investigator's convenience. This results in an unrepresentative sample. In addition, the inherent variability in the attribute measurement process can affect the validity and reliability of the assessment instruments. These factors can also influence the determination of the sample, the conformation of the pairs, and the selection of the control group. It has been shown that the lack of sensitivity in the application of the tests is one of the primary sources of error. Therefore, it is recommended to carry out investigations with more rigorous criteria in selecting the sample and applying the measurement instruments to guarantee the validity and reliability of the results obtained in the examination.