An approach for fast compressed text matching and to avoid false matching using WBTC and wavelet tree

Shashank Srivastav; P. K. Singh

doi:10.4108/eai.23-10-2020.166717

An approach for fast compressed text matching and to avoid false matching using WBTC and wavelet tree

Authors

Shashank Srivastav Madan Mohan Malaviya University of Technology
P. K. Singh Madan Mohan Malaviya University of Technology

DOI:

https://doi.org/10.4108/eai.23-10-2020.166717

Keywords:

Modern Information Retrieval, Wavelet Tree, Word-Based Tagged Code, Compressed Pattern Matching

Abstract

Text matching is a process of finding the frequency of occurrences of text pattern in a corpus. It's very costly to store, process, and retrieve a vast volume of text data. In this paper, we present a method to keep the massive text corpus in lesser memory space by using text compression and to retrieve the results by matching directly on this compressed corpus without decompression using compressed pattern matching (CPM). The proposed approach also helps to minimize the time taken to perform matching without compromising the false matching results. We used word-based tagged coding to perform text compression and Wavelet Trees for representing the compressed text in memory. The proposed Text Matching in Compressed text using Parallel Wavelet Tree (TMC_PWT) method is quite fast in comparison to other existing text matching algorithms that support CPM. In the context of CPM, the proposed method provides a good compression ratio and does not suffer from the problem of false matching.

References

Downloads

Published

23-10-2020

Issue

Vol. 8 No. 30 (2021): EAI Endorsed Transactions on Scalable Information Systems

Section

Research articles

License

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.

How to Cite

Srivastav S, Singh PK. An approach for fast compressed text matching and to avoid false matching using WBTC and wavelet tree. EAI Endorsed Scal Inf Syst [Internet]. 2020 Oct. 23 [cited 2026 Jul. 24];8(30):e6. Available from: https://publications.eai.eu/index.php/sis/article/view/2084

Download Citation

An approach for fast compressed text matching and to avoid false matching using WBTC and wavelet tree

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission