Scalable Source Code Similarity Detection in Large Code Repositories

Firas Alomari; Muhammed Harbi

doi:10.4108/eai.13-7-2018.159353

Scalable Source Code Similarity Detection in Large Code Repositories

Authors

Firas Alomari Saudi Aramco (Saudi Arabia)
Muhammed Harbi Saudi Aramco (Saudi Arabia)

DOI:

https://doi.org/10.4108/eai.13-7-2018.159353

Keywords:

clones, software similarity, Control Flow Graphs, Fingerprints

Abstract

Source code similarity are increasingly used in application development to identify clones, isolate bugs, and find copy-rights violations. Similar code fragments can be very problematic due to the fact that errors in the original code must be fixed in every copy. Other maintenance changes, such as extensions or patches, must be applied multiple times. Furthermore, the diversity of coding styles and flexibility of modern languages makes it difficult and cost ineffective to manually inspect large code repositories. Therefore, detection is only feasible by automatic techniques. We present an efficient and scalable approach for similar code fragment identification based on source code control flow graphs fingerprinting. The source code is processed to generate control flow graphs that are then hashed to create a unique fingerprint of the code capturing semantics as well as syntax similarity. The fingerprints can then be efficiently stored and retrieved to perform similarity search between code fragments. Experimental results from our prototype implementation supports the validity of our approach and show its effectiveness and efficiency in comparison with other solutions.

References

Downloads

Published

04-07-2019

Issue

Vol. 6 No. 22 (2019): EAI Endorsed Transactions on Scalable Information Systems

Section

Research articles

License

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.

How to Cite

Alomari F, Harbi M. Scalable Source Code Similarity Detection in Large Code Repositories. EAI Endorsed Scal Inf Syst [Internet]. 2019 Jul. 4 [cited 2026 Jul. 24];6(22):e3. Available from: https://publications.eai.eu/index.php/sis/article/view/2161

Download Citation

Scalable Source Code Similarity Detection in Large Code Repositories

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission