Algorithm for comparing word with randomly distributed substring

asdf23211

Hello, I'm facing a strange problem. I have a word for example "ABCDE". I want to match this word againsta sentence like "ABC EFGH IJKL DE". As you can see, the query word is splited and distributed across the senetence. How can I design algorithm which can give me result like the query word has acombination in the sentence and their position. An example; QUERY word: ABCDEF TARGET sentence: EFGH UIOP ABC GHY JKLU EF I expect the output as the query word has a perfect match by the combining words 3,6. I tried several algorithms like smithwatermann, levenstein distance and others, but I could figure out how I can compare string with randomly distributed sub-strings in a sentence. One way of doing is, breaking sentence into words and make all possible combinations and start matching with query which would take forever if I have 500 words. Any help or suggestion would be highly appreciated.

Sanmayce

Hi srikanth2321, probably there are some better ways to do that, but my suggestion is the most simple: Step1: Do Wildcard search for *ABCDE* Step2: Do Wildcard search for *ABCD*E* Step3: Do Wildcard search for *ABC*DE* Step4: Do Wildcard search for *AB*CDE* Step5: Do Wildcard search for *A*BCDE* My friend, I sense the problem, you should be more precise both in testing and explaining, for instance your second example is buggy. If you want to play you can try my superb search tool Kazahana (downloadable here at CP). Let me know if you find any difficulties. The first exmple done with Step3:

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>copy con test.txt
ABC EFGH IJKL DE
EFGH UIOP ABC GHY JKLU EF
^Z
1 file(s) copied.

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type test.txt
ABC EFGH IJKL DE
EFGH UIOP ABC GHY JKLU EF

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>Kazahana.exe "*ABC*DE*" test.txt 999
Kazahana, a superfast exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcher, r. 1-++fix+nowait_critical_nixFIX_Wolfram+fixITER+EX+CS, copyleft Kaze 2014-Mar-25.
Enforcing Case Insensitive wildcard mode ...
Enforcing SLOW wildcard mode ...
Pattern: *abc*de*
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating Master-Buffer 999KB ... OK

Kazahana: Total/Checked/Dumped xgrams: 2/2/1
Kazahana: Performance: 0 KB/clock
Kazahana: Performance: 0 xgrams/clock
Kazahana: Performance: Total/fread() clocks: 17/0
Kazahana: Performance: I/O time, i.e. fread() time, is 0 percents
Kazahana: Performance: RDTSC I/O time, i.e. fread() time, is 0 ticks
Kazahana: Done.

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type Kazahana.txt
ABC EFGH IJKL DE

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>

Sanmayce

Okay, this is more serious, 31 patterns are needed to exhaust your request. 0 internal '*': Step01: *ABCDE* 1 internal '*': Step02: *ABCD*E* Step03: *ABC*DE* Step04: *AB*CDE* Step05: *A*BCDE* 2 internal '*': Step06: *A*B*CDE* Step07: *A*BC*DE* Step08: *A*BCD*E* Step09: *AB*C*DE* Step10: *AB*CD*E* Step11: *ABC*D*E* 3 internal '*': Step12: *A*B*C*DE* Step13: *A*B*CD*E* Step14: *A*BC*D*E* Step15: *AB*C*D*E* Above 15 patterns are optional, they would yield hits with 'high' rank earlier than all the 31 ones below. Finding all substrings (without elision): Step01+15: *A*B*C*D*E* Finding all substrings (with elision): C(5,4) equals number of wildcard patterns, 5 choose 4 = 5!/(4!*(5-4)!) = 5*4*3*2*1/(4*3*2*1*1) = 5 Step02+15: *B*C*D*E* Step03+15: *A*C*D*E* Step04+15: *A*B*D*E* Step05+15: *A*B*C*E* Step06+15: *A*B*C*D* C(5,3) equals number of wildcard patterns, 5 choose 3 = 5!/(3!*(5-3)!) = 5*4*3*2*1/(3*2*1*2*1) = 10 Step07+15: *A*B*C* Step08+15: *A*B*D* Step09+15: *A*B*E* Step10+15: *A*C*D* Step11+15: *A*C*E* Step12+15: *A*D*E* Step13+15: *B*C*D* Step14+15: *B*C*E* Step15+15: *B*D*E* Step16+15: *C*D*E* C(5,2) equals number of wildcard patterns, 5 choose 2 = 5!/(2!*(5-2)!) = 5*4*3*2*1/(2*1*3*2*1) = 10 Step17+15: *A*B* Step18+15: *A*C* Step19+15: *A*D* Step20+15: *A*E* Step21+15: *B*C* Step22+15: *B*D* Step23+15: *B*E* Step24+15: *C*D* Step25+15: *C*E* Step26+15: *D*E* C(5,1) equals number of wildcard patterns, 5 choose 1 = 5!/(1!*(5-1)!) = 5*4*3*2*1/(1*4*3*2*1) = 5 Step27+15: *A* Step28+15: *B* Step29+15: *C* Step30+15: *D* Step31+15: *E* In essence the number e.g. C(5,4) equals all ways to order any four of five elements, which is 5!/(5-4)! divided by all ways to order any group of 4, which is 4!. Thus, the number of distinct ways to choose 4 out of 5 is 120 divided by 24. By chance I wrote the fastest wildcard search function (iterative) so these 31 patterns won't be much of a problem:

int WildcardMatch_Iterative_Kaze(const char* mask, const char* name) {
// Revision 1:
/*
const char* maskSTACK;
const char* nameSTACK;
int BacktrackFlag = 0;
Backtrack:
for (nameSTACK = name, maskSTACK = mask; *nameSTACK; ++nameSTACK, ++maskSTACK) {
if (*maskSTACK == '&') {
mask = maskSTACK+1;
if (!*mask) return 1;
name = nameSTACK;
BacktrackFlag = -1;
goto Backtrack;
}
//else if (*maskSTACK == '+') {
//} else {
else if (*maskSTACK != '+') {
//if (tolower(*nameSTACK) != tolower(*maskSTACK)) {