Alvaro VidelaRead my Thoughts. Follow my Leads.

Shift-And Visualization

January 05 2014

In this article I would like to describe the Shift-And algorithm for pattern matching, based on the book Flexible Pattern Matching in Strings. Apart from the actual description, you can scroll down for a demonstration of how the algorithm works.

Shift-And is categorized as a prefix-based algorithm, similar to KMP, but what Shift-And does is to keep a set of all prefixes p that matches a suffix of the text read so far. Keep in mind that is the suffix of the text read so far, not a suffix of the whole text.

The set of prefixes is updated using bit-parallelism, that is, the set of prefixes is kept in a bit mask called D. D is defined as: dm … d1, where m is the canonical way of referring to the pattern length. For example, if we have a six character pattern, and so far our first character matches against the text, then D would be 000001.

With every character read from the text, D will be updated, if the bit dm is active, then we have a match. Is it possible to update this set in constant time? Yes, using bit-parallel operations.

This algorithm will first build a table B, holding a bit mask bm … b1 for every character in the pattern. So, if we are searching for the word announce, the entry for the character n will look like this: 00100110, that is, the word announce has an n in the 2nd, 3rd and 6th positions, which are marked in the bit mask from right to left. So to recap: the algorithm will scan the pattern we are searching, and then it will build this table, updating the bit mask as it goes. Here’s the code in javascript:

// init the table
for (var i = 0; i < l; i++) {
    b[p.charAt(i)] = 0;
}

// build bit mask table;
for (var i = 0; i < l; i++) {
    b[p.charAt(i)] = b[p.charAt(i)] | (1 << i);
}

So the bit mask table for the word announce will end up looking like this:

var B = {
  'a': 00000001,
  'n': 00100110,
  'o': 00001000,
  'u': 00010000,
  'c': 01000000,
  'e': 10000000,
  '*': 00000000
}

Of course we will have the actual numbers, not those bit masks, with the padding and so on. We also added an * entry in the table. That’s because whenever a read character from the text doesn’t match, then the bit mask used will be 0.

Once we have the table built, we start searching for the pattern in the text. First we set D to 0, and then we start reading from the text, character by character. Let’s say we read an a, so we search for the a bit mask in our table, it will return 00000001. To calculate the new D we first shift it 1 to the left, and we OR a 1 to it : (D << 1) | 1. That value gets AND'ed to the mask we retrieved from the table, so this would be the complete line: D <- ((D << 1) | 1) & B[current-char].

We keep iterating over the text until we find a match. How do we find a match? We need to check if the bit dm is active. To do that we have to run the following operation: D & (1 << m-1). We create the following bit mask 10000000 and we AND it to D. If this is not 0, then we have a match.

Here’s the complete algorithm in a vanilla implementation using Javascript:

/**
  * string p: pattern to search (needle);
  * string text: haystack
**/
function shift_and(p, text) {
    var b = {};
    var l = p.length;
    var tl = text.length;
 
    // init the table
    for (var i = 0; i < l; i++) {
        b[p.charAt(i)] = 0;
    }

    //build bit mask table;
    for (var i = 0; i < l; i++) {
        b[p.charAt(i)] = b[p.charAt(i)] | (1 << i);
    }
 
    var d = 0;
    var matchMask = 1 << l-1;
    for (var i = 0; i < tl; i++) {
        d = ((d << 1) | 1) & (b[text.charAt(i)] | 0);
        var matched = (d & matchMask);
        if (matched != 0) {
            return i - l + 1;
        }
    }
    return -1;
}

And now to the interesting part, the visual simulation:

Shift-And visual simulation

Let me briefly explain what’s going on here. First we have a form, where you can enter the pattern to search, and the text where to search it for. That’s easy.

Once hit start, the visualization will display the text and bellow it the pattern you are searching for. Beneath them, there will be the b table and next to it, the results for each step will be reported.

For example, we will see the initial value of D, then it will be (D << 1) | 1. Then we read a character from the text, and we get the value from the table, AND'ing it to the value of D and getting the result below, and so on.

I hope the colors are pretty self explanatory (I’m colorblind so don’t trust me on this one).

So, click start and enjoy the show.

Shift-And

I hope this post has made justice to what I think is one of the most elegant algorithms that I know of. If there was a book called “Algorithms from the Book” similar to this, I’m sure this algorithm would make it.

If you are interested in learning more about Pattern Matching in Strings, then I would recommend you get a copy of the book Flexible Pattern Matching in Strings since it’s really worth every penny.

I would like to mention also Mr. @sicher who basically single handedly fixed the mess of HTML and CSS that I had created. So if you enjoyed the visual styles, then send him some props on twitter.

blog comments powered by Disqus