heuristic for pattern searching. Just like bad character heuristic, a preprocessing table is generated for good suffix heuristic.
Good Suffix Heuristic
t
be substring of text
T
which is matched with substring of pattern
P
. Now we shift pattern until : 1) Another occurrence of t in P matched with t in T. 2) A prefix of P, which matches with suffix of t 3) P moves past t
Case 1: Another occurrence of t in P matched with t in T
Pattern P might contain few more occurrences of
t
. In such case, we will try to shift the pattern to align that occurrence with t in text T. For example-
Explanation:
In the above example, we have got a substring t of text T matched with pattern P (in green) before mismatch at index 2. Now we will search for occurrence of t (“AB”) in P. We have found an occurrence starting at position 1 (in yellow background) so we will right shift the pattern 2 times to align t in P with t in T. This is weak rule of original Boyer Moore and not much effective, we will discuss a
Strong Good Suffix rule
Case 2: A prefix of P, which matches with suffix of t in T
It is not always likely that we will find the occurrence of t in P. Sometimes there is no occurrence at all, in such cases sometimes we can search for some
suffix of t
matching with some
prefix of P
and try to align them by shifting P. For example –
Explanation:
In above example, we have got t (“BAB”) matched with P (in green) at index 2-4 before mismatch . But because there exists no occurrence of t in P we will search for some prefix of P which matches with some suffix of t. We have found prefix “AB” (in the yellow background) starting at index 0 which matches not with whole t but the suffix of t “AB” starting at index 3. So now we will shift pattern 3 times to align prefix with the suffix.
Case 3: P moves past t
If the above two cases are not satisfied, we will shift the pattern past the t. For example –
Explanation:
If above example, there exist no occurrence of t (“AB”) in P and also there is no prefix in P which matches with the suffix of t. So, in that case, we can never find any perfect match before index 4, so we will shift the P past the t ie. to index 5.
Strong Good suffix Heuristic
q = P[i to n]
got matched with
t
c = P[i-1]
is the mismatching character. Now unlike case 1 we will search for t in P which is not preceded by character
c
. The closest such occurrence is then aligned with t in T by shifting pattern P. For example –
Explanation:
In above example,
q = P[7 to 8]
got matched with t in T. The mismatching character
c
is “C” at position P[6]. Now if we start searching t in P we will get the first occurrence of t starting at position 4. But this occurrence is preceded by “C” which is equal to c, so we will skip this and carry on searching. At position 1 we got another occurrence of t (in the yellow background). This occurrence is preceded by “A” (in blue) which is not equivalent to c. So we will shift pattern P 6 times to align this occurrence with t in T.We are doing this because we already know that character
c = “C”
causes the mismatch. So any occurrence of t preceded by c will again cause mismatch when aligned with t, so that’s why it is better to skip this.
Preprocessing for Good suffix heuristic
As a part of preprocessing, an array
shift
is created. Each entry
shift[i]
contain the distance pattern will shift if mismatch occur at position
i-1
. That is, the suffix of pattern starting at position
i
is matched and a mismatch occur at position
i-1
. Preprocessing is done separately for strong good suffix and case 2 discussed above.
1) Preprocessing for Strong Good Suffix
Before discussing preprocessing, let us first discuss the idea of border. A
border
is a substring which is both proper suffix and proper prefix. For example, in string
“ccacc”
“c”
“cc”
is a border because it appears in both end of string but
“cca”
is not a border. As a part of preprocessing an array
bpos
(border position) is calculated. Each entry
bpos[i]
contains the starting index of border for suffix starting at index i in given pattern P. The suffix
?
beginning at position m has no border, so
bpos[m]
m+1
m
is the length of the pattern. The shift position is obtained by the borders which cannot be extended to the left. Following is the code for preprocessing –
void preprocess_strong_suffix(int *shift, int *bpos,Explanation:
P = “ABBABAB”, m = 7