String Algorithms in C: Naive, KMP, Rabin-Karp, Trie & Suffix Array/Tree with Code Examples
1. What is pattern matching in strings?
Pattern matching involves finding all occurrences (or any occurrence) of a pattern string within a text string. It’s widely used in text processing, search engines, and bioinformatics.
2. What is the Naive Pattern Matching algorithm?
The Naive algorithm checks for the pattern at each position in the text by comparing characters one by one, sliding the pattern over the text.
3. Can you give an example of Naive Pattern Matching in C?
Example: Finding all occurrences of pattern "AABA" in text "AABAACAADAABAAABAA".
#include <stdio.h>
#include <string.h>
void naivePatternMatch(char* text, char* pattern) {
int n = strlen(text), m = strlen(pattern);
if (m == 0 || n < m) {
printf("Invalid input\n");
return;
}
for (int i = 0; i <= n - m; i++) {
int j;
for (j = 0; j < m; j++) {
if (text[i + j] != pattern[j]) break;
}
if (j == m) printf("Pattern found at index %d\n", i);
}
}
int main() {
char text[] = "AABAACAADAABAAABAA";
char pattern[] = "AABA";
naivePatternMatch(text, pattern); // Output: Pattern found at index 0, 9, 13
return 0;
}
4. What is the time complexity of Naive Pattern Matching?
Time Complexity: O(n * m) (n = text length, m = pattern length).
Space Complexity: O(1).
5. What are the advantages and disadvantages of Naive Pattern Matching?
- Advantages: Simple to implement, no preprocessing required, works for any text/pattern.
- Disadvantages: Inefficient for large texts/patterns, no optimization for repeated patterns.
6. What is the KMP Algorithm?
The Knuth-Morris-Pratt (KMP) algorithm is an efficient pattern matching algorithm that uses a longest prefix-suffix (LPS) array to skip redundant comparisons, avoiding backtracking in the text.
7. How does KMP work?
- Precompute the LPS array, where
LPS[i]is the length of the longest proper prefix that is also a suffix forpattern[0..i]. - Use LPS to skip unnecessary comparisons when a mismatch occurs.
8. Can you give an example of KMP in C?
Example: Finding all occurrences of pattern "AABA" in text "AABAACAADAABAAABAA".
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void computeLPS(char* pattern, int m, int* lps) {
int len = 0, i = 1;
lps[0] = 0; // First element is always 0
while (i < m) {
if (pattern[i] == pattern[len]) {
len++;
lps[i] = len;
i++;
} else {
if (len != 0) len = lps[len - 1]; // Fall back to previous prefix
else lps[i] = 0, i++;
}
}
}
void kmpPatternMatch(char* text, char* pattern) {
int n = strlen(text), m = strlen(pattern);
if (m == 0 || n < m) {
printf("Invalid input\n");
return;
}
int* lps = (int*)malloc(m * sizeof(int));
if (!lps) {
printf("Memory allocation failed\n");
return;
}
computeLPS(pattern, m, lps);
int i = 0, j = 0;
while (i < n) {
if (pattern[j] == text[i]) {
i++;
j++;
}
if (j == m) {
printf("Pattern found at index %d\n", i - j);
j = lps[j - 1];
} else if (i < n && pattern[j] != text[i]) {
if (j != 0) j = lps[j - 1];
else i++;
}
}
free(lps);
}
int main() {
char text[] = "AABAACAADAABAAABAA";
char pattern[] = "AABA";
kmpPatternMatch(text, pattern); // Output: Pattern found at index 0, 9, 13
return 0;
}
9. What is the time complexity of KMP?
Time Complexity: O(n + m) (n = text length, m = pattern length).
Space Complexity: O(m) (LPS array).
10. What are the advantages and disadvantages of KMP?
- Advantages: Efficient (
O(n + m)), no backtracking in text. - Disadvantages: Requires preprocessing, more complex than Naive.
11. What is the Rabin-Karp Algorithm?
Rabin-Karp is a pattern matching algorithm that uses hashing to compare the pattern with substrings of the text, reducing character comparisons.
12. How does Rabin-Karp work?
- Compute the hash of the pattern and the first substring of the text.
- Slide the pattern, updating the hash using a rolling hash function.
- If hashes match, verify characters to confirm the match.
13. Can you give an example of Rabin-Karp in C?
Example: Finding all occurrences of pattern "AABA" in text "AABAACAADAABAAABAA".
#include <stdio.h>
#include <string.h>
#define d 256 // Number of characters in input alphabet
#define q 101 // A prime number for modulo
void rabinKarp(char* text, char* pattern) {
int n = strlen(text), m = strlen(pattern);
if (m == 0 || n < m) {
printf("Invalid input\n");
return;
}
int i, j, p = 0, t = 0, h = 1;
// Calculate h = pow(d, m-1) % q
for (i = 0; i < m - 1; i++) h = (h * d) % q;
// Calculate hash for pattern and first window
for (i = 0; i < m; i++) {
p = (d * p + pattern[i]) % q;
t = (d * t + text[i]) % q;
}
// Slide pattern over text
for (i = 0; i <= n - m; i++) {
if (p == t) { // Check characters if hashes match
for (j = 0; j < m; j++) {
if (text[i + j] != pattern[j]) break;
}
if (j == m) printf("Pattern found at index %d\n", i);
}
if (i < n - m) { // Update hash for next window
t = (d * (t - text[i] * h) + text[i + m]) % q;
if (t < 0) t += q;
}
}
}
int main() {
char text[] = "AABAACAADAABAAABAA";
char pattern[] = "AABA";
rabinKarp(text, pattern); // Output: Pattern found at index 0, 9, 13
return 0;
}
14. What is the time complexity of Rabin-Karp?
Time Complexity:
- Average:
O(n + m) - Worst:
O(n * m)(many hash collisions)
Space Complexity: O(1).
15. What are the advantages and disadvantages of Rabin-Karp?
- Advantages: Efficient average case, adaptable for multiple patterns.
- Disadvantages: Hash collisions can degrade to
O(n * m), requires a good hash function.
16. How do Naive, KMP, and Rabin-Karp compare?
- Naive: Simple,
O(n * m), no preprocessing. - KMP:
O(n + m), preprocessing with LPS, deterministic. - Rabin-Karp:
O(n + m)average, hashing-based, probabilistic but versatile.
17. What are Trie-based problems?
Trie-based problems involve using a Trie data structure to solve string-related tasks, leveraging its prefix-based structure for efficient operations like searching, autocomplete, or pattern matching.
18. What are common Trie-based problems?
- Word Search: Find if a word exists in a dictionary.
- Autocomplete: Suggest words with a given prefix.
- Longest Common Prefix: Find the longest prefix shared by a set of strings.
- Spell Checker: Check if a word is valid.
19. Can you give an example of a Trie-based problem (Longest Common Prefix) in C?
Example: Finding the longest common prefix for words {"flower", "flow", "flight"}.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define ALPHABET_SIZE 26
struct TrieNode {
struct TrieNode* children[ALPHABET_SIZE];
int count; // Number of words passing through node
bool isEndOfWord;
};
struct TrieNode* createNode() {
struct TrieNode* node = (struct TrieNode*)malloc(sizeof(struct TrieNode));
if (!node) {
printf("Memory allocation failed\n");
exit(1);
}
node->count = 0;
node->isEndOfWord = false;
for (int i = 0; i < ALPHABET_SIZE; i++) node->children[i] = NULL;
return node;
}
void insert(struct TrieNode* root, char* word) {
struct TrieNode* curr = root;
for (int i = 0; word[i]; i++) {
int index = word[i] - 'a';
if (!curr->children[index]) curr->children[index] = createNode();
curr = curr->children[index];
curr->count++;
}
curr->isEndOfWord = true;
}
void freeTrie(struct TrieNode* root) {
if (!root) return;
for (int i = 0; i < ALPHABET_SIZE; i++) {
freeTrie(root->children[i]);
}
free(root);
}
void longestCommonPrefix(struct TrieNode* root, char* result, int n) {
struct TrieNode* curr = root;
int index = 0;
while (curr) {
int singleChild = -1;
if (curr->count != n) break; // Stop if not all words share this prefix
for (int i = 0; i < ALPHABET_SIZE; i++) {
if (curr->children[i]) {
if (singleChild == -1) singleChild = i;
else break; // More than one child
}
}
if (singleChild == -1) break;
result[index++] = 'a' + singleChild;
curr = curr->children[singleChild];
}
result[index] = '\0';
}
int main() {
char* words[] = {"flower", "flow", "flight"};
int n = 3;
struct TrieNode* root = createNode();
for (int i = 0; i < n; i++) insert(root, words[i]);
char result[100];
longestCommonPrefix(root, result, n);
printf("Longest Common Prefix: %s\n", result); // Output: fl
freeTrie(root); // Clean up memory
return 0;
}
20. What is the time complexity of Trie-based operations?
- Insert/Search:
O(m)(m = word length). - Longest Common Prefix:
O(min(m, k))(m = shortest word length, k = prefix length). - Space Complexity:
O(ALPHABET_SIZE * N * m)(N = number of words).
21. What are the advantages of Tries for string problems?
- Advantages: Fast prefix-based queries, efficient for dictionary-like operations.
- Disadvantages: High memory usage, limited to specific character sets.
22. What is a Suffix Array?
A Suffix Array is an array of integers representing the starting indices of all suffixes of a string, sorted in lexicographical order. It’s used for efficient string matching and substring queries.
23. How is a Suffix Array constructed?
- Generate all suffixes of the string.
- Sort the suffixes lexicographically.
- Store the starting indices of the sorted suffixes.
- Example: For "banana", suffixes are {banana, anana, nana, ana, na, a}, and the suffix array is [5, 3, 1, 0, 4, 2].
24. What are the applications of Suffix Arrays?
- Pattern matching (faster than Naive but simpler than KMP/Rabin-Karp for some cases).
- Longest Common Prefix (LCP) array for substring problems.
- String searching in text editors or bioinformatics.
25. What is the time and space complexity of Suffix Arrays?
- Construction:
O(n log n)with efficient sorting (e.g., DC3/Skew algorithm can achieveO(n)). - Pattern Matching:
O(m + log n)(m = pattern length). - Space Complexity:
O(n)(array storage).
26. What is a Suffix Tree?
A Suffix Tree is a compressed trie containing all suffixes of a string, where each path from the root to a leaf represents a suffix. Edges store substrings, reducing space usage.
27. How is a Suffix Tree constructed?
- Naive Approach: Insert each suffix into a trie (
O(n²)). - Ukkonen’s Algorithm: Builds the tree in
O(n)by incrementally adding suffixes and using suffix links for efficiency. - Example: For "banana", the tree has paths for {banana$, anana$, nana$, ana$, na$, a$}, with $ as a terminator.
28. What are the applications of Suffix Trees?
- Exact string matching.
- Longest repeated substring.
- Longest common substring of multiple strings.
- Pattern searching in DNA sequences.
29. What is the time and space complexity of Suffix Trees?
- Construction:
O(n)(Ukkonen’s algorithm). - Pattern Matching:
O(m)(m = pattern length). - Space Complexity:
O(n)(compressed structure, but high constant factor).
30. What are the advantages and disadvantages of Suffix Arrays vs. Suffix Trees?
- Suffix Arrays:
- Advantages: Simpler, less memory (
O(n)), practical for large strings. - Disadvantages: Slower queries (
O(m + log n)vs.O(m)), less flexible.
- Advantages: Simpler, less memory (
- Suffix Trees:
- Advantages: Faster queries (
O(m)), supports complex string operations. - Disadvantages: High memory usage, complex construction.
- Advantages: Faster queries (
31. Why are Suffix Arrays/Trees considered advanced?
They handle complex string problems efficiently, but their construction (e.g., Ukkonen’s algorithm) and applications (e.g., LCP array) require a deep understanding of string properties and optimization.