 | The ACM Journal of Experimental Algorithmics |
Volume 5, Article 4, 2000
Fast and Flexible String Matching by Combining Bit-Parallelism and Suffix Automata
by
and
http://www.jea.acm.org/2000/NavarroString/
Abstract:
The most important features of a string matching algorithm are its efficiency
and its flexibility. Efficiency has traditionally received more attention,
while flexibility in the search pattern is becoming a more and more important
issue. Most classical string matching algorithms are aimed at quickly finding
an exact pattern in a text, being Knuth-Morris-Pratt (KMP) and the Boyer-Moore
(BM) family the most famous ones. A recent development uses deterministic
"suffix automata" to design new optimal string matching algorithms, e.g. BDM
and TurboBDM. Flexibility has been addressed quite separately by the use of
"bit-parallelism", which simulates automata in their nondeterministic form
by using bits and exploiting the intrinsic parallelism inside the computer word,
e.g., the Shift-Or algorithm. Those algorithms are extended to handle classes of
characters and errors in the pattern and/or in the text, their drawback being
their inability to skip text characters.
In this paper we merge bit-parallelism and suffix automata, so that a
nondeterministic suffix automaton is simulated using bit-parallelism. The
resulting algorithm, called BNDM, obtains the best from both worlds. It is
much simpler to implement than BDM and nearly as simple as Shift-Or. It
inherits from Shift-Or the ability to handle flexible patterns and from BDM
the ability to skip characters. BNDM is 30%-40% faster than BDM and up to
7 times faster than Shift-Or. When compared to the fastest existing algorithms
(which belong to the BM family) on exact patterns, BNDM is from 20% slower to
3 times faster, depending on the alphabet size. With respect to flexible
patterns, BNDM is by far the fastest technique to deal with classes of
characters and is competitive to search allowing errors. In particular, BNDM
seems very adequate for computational biology applications, since it is the
fastest algorithm to search on DNA sequences and flexible searching is an
important problem in that area. As a theoretical development related with
flexible pattern matching, we introduce a new automaton to recognize suffixes
of patterns with classes of characters.
- CR Categories:
F.2.2 Nonnumerical algorithms and problems [Pattern matching, Computations on discrete structures],
H.3.3 Information search and retrieval [Search process]
- Sources:
- The LaTeX
version of the article;
this is a Unix file that contains the LaTeX source;
note that you must be a subscriber (institutional or individual) to access
this file.
- The Postscript
version or PDF
version of the article;
note that you must be a subscriber (institutional or individual) to access
these files.
- The HTML version
of the article;
note that you must be a subscriber (institutional or individual) to access
this file.
- The software suite accompanying the artic
le;
this is a small Unix tar file, which includes the source code, a Makefile,
and the test files used in the article.
- The bibliography given in the article.
Received
|
Accepted
|
Final Revision
|
Published
|
August 12, 1999
|
September 9, 2000
|
January 8, 2001
|
January 11, 2001
|
Last updated and validated January 11, 2001, by editor@jea.acm.org