Scalable Image Clustering to screen for self-produced CSAM

Authors

DOI:

https://doi.org/10.4108/eetiot.6631

Keywords:

CSAM, clustering, metadata, EXIF, digital image forensics, data science, anti-forensic, source camera identification

Abstract

The number of cases involving Child Sexual Abuse Material (CSAM) has increased dramatically in recent years, resulting in significant backlogs. To protect children in the suspect’s sphere of influence, immediate identification of self-produced CSAM among acquired CSAM is paramount. Currently, investigators often rely on an approach based on a simple metadata search. However, this approach faces scalability limitations for large cases and is ineffective against anti-forensic measures. Therefore, to address these problems, we bridge the gap between digital forensics and state-of-the-art data science clustering approaches. Our approach enables clustering of more than 130,000 images, which is eight times larger than previous achievements, using commodity hardware and within an hour with the ability to scale even further. In addition, we evaluate the effectiveness of our approach on seven publicly available forensic image databases, taking into account factors such as anti-forensic measures and social media post-processing. Our results show an excellent median clustering-precision (Homogeinity) of 0.92 on native images and a median clustering-recall (Completeness) of over 0.92 for each test set. Importantly, we provide full reproducibility using only publicly available algorithms, implementations, and image databases.

Downloads

Download data is not yet available.
<br data-mce-bogus="1"> <br data-mce-bogus="1">

References

Omar Al Shaya, Pengpeng Yang, Rongrong Ni, Yao Zhao, and Alessandro Piva. A new dataset for source identification of high dynamic range images. Sensors, 18 (11):3801, 2018. DOI: https://doi.org/10.3390/s18113801

Chiara Albisani, Massimo Iuliani, and Alessandro Piva. Checking PRNU Usability on Modern Devices. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2535–2539. IEEE, 2021. DOI: https://doi.org/10.1109/ICASSP39728.2021.9413611

Mebarka Allaoui, Mohammed Lamine Kherfi, and Abdelhakim Cheriet. Considerably improving clustering algorithms using umap dimensionality reduction technique: a comparative study. In Image and Signal Processing: 9th International Conference, ICISP 2020, Marrakesh, Morocco, June 4–6, 2020, Proceedings 9, pages 317–325. Springer, 2020. DOI: https://doi.org/10.1007/978-3-030-51935-3_34

Daniele Baracchi, Massimo Iuliani, Andrea G Nencini, and Alessandro Piva. Facing image source attribution on iphone x. In Digital Forensics and Watermarking: 19th International Workshop, IWDW 2020, Melbourne, VIC, Australia, November 25–27, 2020, Revised Selected Papers 19, pages 196–207. Springer International Publishing, 2021. DOI: https://doi.org/10.1007/978-3-030-69449-4_15

Jarosław Bernacki. Digital camera identification by fingerprint’s compact representation. Multimedia Tools and Applications, pages 1–34, 2022.

Jarosław Bernacki, Kelton AP Costa, and Rafał Scherer. Individual source camera identification with convolutional neural networks. In Asian Conference on Intelligent Information and Database Systems, pages 45–55. Springer, 2022. DOI: https://doi.org/10.1007/978-981-19-8234-7_4

George Bissias, Brian Levine, Marc Liberatore, Brian Lynn, Juston Moore, Hanna Wallach, and Janis Wolak. Characterization of contact offenders and child exploitation material trafficking on five peer-to-peer networks. Child abuse & neglect, 52:185–199, 2016. DOI: https://doi.org/10.1016/j.chiabu.2015.10.022

Charles Bouveyron, Gilles Celeux, T. Brendan Murphy, and Adrian E. Raftery. Model-Based Clustering and Classification for Data Science: With Applications in R. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019. doi:10.1017/9781108644181.002. DOI: https://doi.org/10.1017/9781108644181.002

Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining, pages 160–172. Springer, 2013. DOI: https://doi.org/10.1007/978-3-642-37456-2_14

Eoghan Casey. Digital Evidence and Computer Crime. Elsevier, 2011.

Eoghan Casey, Monique Ferraro, and Lam Nguyen. Investigation delayed is justice denied: proposals for expediting forensic examinations of digital evidence. Journal of forensic sciences, 54(6):1353–1364, 2009. DOI: https://doi.org/10.1111/j.1556-4029.2009.01150.x

CIPA. Exchangeable image file format for digital still cameras: Exif Version 2.32. Standard, Camera & Imaging Products Association, 2019.

Chiara Galdi, Frank Hartung, and Jean-Luc Dugelay. Socrates: A database of realistic data for source camera recognition on smartphones. In ICPRAM, pages 648–655, 2019. DOI: https://doi.org/10.5220/0007403706480655

Thomas Gloe. Feature-based forensic camera model identification. In Yun Q. Shi and Stefan Katzenbeisser, editors, Transactions on Data Hiding and Multimedia Security VIII, pages 42–62, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-31971-6. DOI: https://doi.org/10.1007/978-3-642-31971-6_3

Thomas Gloe. Forensic analysis of ordered data structures on the example of jpeg files. In 2012 IEEE International Workshop on Information Forensics and Security (WIFS), pages 139–144. IEEE, 2012. DOI: https://doi.org/10.1109/WIFS.2012.6412639

Thomas Gloe and Rainer Böhme. The’dresden image database’for benchmarking digital image forensics. In Proceedings of the 2010 ACM symposium on applied computing, pages 1584–1590, 2010. DOI: https://doi.org/10.1145/1774088.1774427

Benjamin Hadwiger and Christian Riess. The forchheim image database for camera identification in the wild. In International Conference on Pattern Recognition, pages 500–515. Springer, 2021. DOI: https://doi.org/10.1007/978-3-030-68780-9_40

Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2:193–218, 1985. DOI: https://doi.org/10.1007/BF01908075

IPTC. Social media sites photo metadata test results, 2020.

Massimo Iuliani, Marco Fontani, and Alessandro Piva. A leak in prnu based source identification—questioning fingerprint uniqueness. IEEE Access, 9:52455–52463, 2021. DOI: https://doi.org/10.1109/ACCESS.2021.3070478

Xiang Jiang, Shikui Wei, Ting Liu, Ruizhen Zhao, Yao Zhao, and Heng Huang. Blind image clustering for camera source identification via row-sparsity optimization. IEEE Transactions on Multimedia, 23:2602–2613, 2020. DOI: https://doi.org/10.1109/TMM.2020.3013449

Da-Yu Kao, Ni-Chen Wu, and Fuching Tsai. A triage triangle strategy for law enforcement to reduce digital forensic backlogs. In 2020 22nd International Conference on Advanced Communication Technology (ICACT), pages 1173–1179. IEEE, 2020. DOI: https://doi.org/10.23919/ICACT48636.2020.9061240

Eric Kee, Micah K. Johnson, and Hany Farid. Digital image authentication from jpeg headers. IEEE Transactions on Information Forensics and Security, 6(3): 1066–1075, 2011. doi:10.1109/TIFS.2011.2128309. DOI: https://doi.org/10.1109/TIFS.2011.2128309

Sahib Khan and Tiziano Bianchi. Fast image clustering based on compressed camera fingerprints. Signal Processing: Image Communication, 91:116070, 2021. DOI: https://doi.org/10.1016/j.image.2020.116070

Hee-Eun Lee, Tatiana Ermakova, Vasilis Ververis, and Benjamin Fabian. Detecting child sexual abuse material: A comprehensive survey. Forensic Science International: Digital Investigation, 34:301022, 2020. ISSN 2666-2817. doi:https://doi.org/10.1016/j.fsidi.2020.301022. URL https://www.sciencedirect.com/science/article/pii/S2666281720301554. DOI: https://doi.org/10.1016/j.fsidi.2020.301022

Chang-Tsun Li and Xufeng Lin. A fast source-oriented image clustering method for digital forensics. EURASIP Journal on Image and Video Processing, 2017(1):1–16, 2017. DOI: https://doi.org/10.1186/s13640-017-0217-y

Xufeng Lin and Chang-Tsun Li. Large-scale image clustering based on camera fingerprints. IEEE Transactions on Information Forensics and Security, 12(4): 793–808, 2016.

Benedikt Lorch, Franziska Schirrmacher, Anatol Maier, and Christian Riess. Reliable camera model identification using sparse gaussian processes. IEEE Signal Processing Letters, 28:912–916, 2021. DOI: https://doi.org/10.1109/LSP.2021.3070206

Jan Lukas, Jessica Fridrich, and Miroslav Goljan. Digital camera identification from sensor pattern noise. IEEE Transactions on Information Forensics and Security, 1(2): 205–214, 2006. DOI: https://doi.org/10.1109/TIFS.2006.873602

Francesco Marra, Giovanni Poggi, Carlo Sansone, and Luisa Verdoliva. Blind prnu-based image clustering for source identification. IEEE Transactions on Information Forensics and Security, 12(9):2197–2211, 2017. DOI: https://doi.org/10.1109/TIFS.2017.2701335

Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11):205, 2017. DOI: https://doi.org/10.21105/joss.00205

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. Umap: Uniform manifold approximation and projection. The Journal of Open Source Software, 3 (29):861, 2018. DOI: https://doi.org/10.21105/joss.00861

Patrick Mullan, Christian Riess, and Felix Freiling. Forensic source identification using jpeg image headers: The case of smartphones. Digital Investigation, 28:S68– S76, 2019. DOI: https://doi.org/10.1016/j.diin.2019.01.016

Patrick Mullan, Christian Riess, and Felix Freiling. Towards open-set forensic source grouping on jpeg header information. Forensic Science International: Digital Investigation, 32:300916, 2020. DOI: https://doi.org/10.1016/j.fsidi.2020.300916

Juliane Müller. Socemo: surrogate optimization of computationally expensive multiobjective problems. INFORMS Journal on Computing, 29(4):581–596, 2017. DOI: https://doi.org/10.1287/ijoc.2017.0749

National Center for Missing & Exploited Children (NCMEC). 2021 CyberTipline Reports by Country, 2021. https://www.missingkids.org/gethelpnow/cybertipline/cybertiplinedata, last accessed 2023-02-06.

National Center for Missing & Exploited Children (NCMEC). 2022 CyberTipline Reports by Country, 2022. https://www.missingkids.org/content/dam/missingkids/pdfs/2022-reports-by-country.pdf, last accessed 2023-07-13.

AL Sandoval Orozco, DM Arenas González, J Rosales Corripio, LJ Garcıa Villalba, and JC Hernandez-Castro. Techniques for source camera identification. In Proceedings of the 6th international conference on information technology, pages 1–9, 2013.

Myeongsuk Pak and Sanghoon Kim. A review of deep learning in image recognition. In 2017 4th International Conference on Computer Applications and Information Processing Technology (CAIPT), pages 1–3, 2017. doi:10.1109/CAIPT.2017.8320684. DOI: https://doi.org/10.1109/CAIPT.2017.8320684

Quoc-Tin Phan, Giulia Boato, and Francesco GB De Natale. Accurate and scalable image clustering based on sparse representation of camera fingerprint. IEEE Transactions on Information Forensics and Security, 14(7): 1902–1916, 2018. DOI: https://doi.org/10.1109/TIFS.2018.2886929

Darren Quick and Kim-Kwang Raymond Choo. Impacts of increasing volume of digital forensic data: A survey and future research challenges. Digital Investigation, 11 (4):273–294, 2014. DOI: https://doi.org/10.1016/j.diin.2014.09.002

Marcus K Rogers, James Goldman, Rick Mislan, Timothy Wedge, and Steve Debrota. Computer forensics field triage process model. Journal of Digital Forensics, Security and Law, 1(2):2, 2006. DOI: https://doi.org/10.15394/jdfsl.2006.1004

Andrew Rosenberg and Julia Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pages 410–420, 2007.

Dasara Shullani, Marco Fontani, Massimo Iuliani, Omar Al Shaya, and Alessandro Piva. VISION: a video and image dataset for source identification. EURASIP Journal on Information Security, 2017(1):1–16, 2017. DOI: https://doi.org/10.1186/s13635-017-0067-2

Matthew James Sorrell. Digital camera source identification through jpeg quantisation. In Multimedia forensics and security, pages 291–313. IGI Global, 2009. DOI: https://doi.org/10.4018/978-1-59904-869-7.ch014

Downloads

Published

15-07-2024

How to Cite

[1]
S. Klier and H. Baier, “Scalable Image Clustering to screen for self-produced CSAM”, EAI Endorsed Trans IoT, vol. 10, Jul. 2024.