Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Algorithms
  4. Algorithm for comparing word with randomly distributed substring

Algorithm for comparing word with randomly distributed substring

Scheduled Pinned Locked Moved Algorithms
helpdatabasedesignalgorithmsregex
3 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A Offline
    A Offline
    asdf23211
    wrote on last edited by
    #1

    Hello, I'm facing a strange problem. I have a word for example "ABCDE". I want to match this word againsta sentence like "ABC EFGH IJKL DE". As you can see, the query word is splited and distributed across the senetence. How can I design algorithm which can give me result like the query word has acombination in the sentence and their position. An example; QUERY word: ABCDEF TARGET sentence: EFGH UIOP ABC GHY JKLU EF I expect the output as the query word has a perfect match by the combining words 3,6. I tried several algorithms like smithwatermann, levenstein distance and others, but I could figure out how I can compare string with randomly distributed sub-strings in a sentence. One way of doing is, breaking sentence into words and make all possible combinations and start matching with query which would take forever if I have 500 words. Any help or suggestion would be highly appreciated.

    S 2 Replies Last reply
    0
    • A asdf23211

      Hello, I'm facing a strange problem. I have a word for example "ABCDE". I want to match this word againsta sentence like "ABC EFGH IJKL DE". As you can see, the query word is splited and distributed across the senetence. How can I design algorithm which can give me result like the query word has acombination in the sentence and their position. An example; QUERY word: ABCDEF TARGET sentence: EFGH UIOP ABC GHY JKLU EF I expect the output as the query word has a perfect match by the combining words 3,6. I tried several algorithms like smithwatermann, levenstein distance and others, but I could figure out how I can compare string with randomly distributed sub-strings in a sentence. One way of doing is, breaking sentence into words and make all possible combinations and start matching with query which would take forever if I have 500 words. Any help or suggestion would be highly appreciated.

      S Offline
      S Offline
      Sanmayce
      wrote on last edited by
      #2

      Hi srikanth2321, probably there are some better ways to do that, but my suggestion is the most simple: Step1: Do Wildcard search for *ABCDE* Step2: Do Wildcard search for *ABCD*E* Step3: Do Wildcard search for *ABC*DE* Step4: Do Wildcard search for *AB*CDE* Step5: Do Wildcard search for *A*BCDE* My friend, I sense the problem, you should be more precise both in testing and explaining, for instance your second example is buggy. If you want to play you can try my superb search tool Kazahana (downloadable here at CP). Let me know if you find any difficulties. The first exmple done with Step3:

      D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>copy con test.txt
      ABC EFGH IJKL DE
      EFGH UIOP ABC GHY JKLU EF
      ^Z
      1 file(s) copied.

      D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type test.txt
      ABC EFGH IJKL DE
      EFGH UIOP ABC GHY JKLU EF

      D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>Kazahana.exe "*ABC*DE*" test.txt 999
      Kazahana, a superfast exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcher, r. 1-++fix+nowait_critical_nixFIX_Wolfram+fixITER+EX+CS, copyleft Kaze 2014-Mar-25.
      Enforcing Case Insensitive wildcard mode ...
      Enforcing SLOW wildcard mode ...
      Pattern: *abc*de*
      omp_get_num_procs( ) = 2
      omp_get_max_threads( ) = 2
      Enforcing HEXADECAD i.e. hexadecuple-threads ...
      Allocating Master-Buffer 999KB ... OK

      Kazahana: Total/Checked/Dumped xgrams: 2/2/1
      Kazahana: Performance: 0 KB/clock
      Kazahana: Performance: 0 xgrams/clock
      Kazahana: Performance: Total/fread() clocks: 17/0
      Kazahana: Performance: I/O time, i.e. fread() time, is 0 percents
      Kazahana: Performance: RDTSC I/O time, i.e. fread() time, is 0 ticks
      Kazahana: Done.

      D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type Kazahana.txt
      ABC EFGH IJKL DE

      D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>

      1 Reply Last reply
      0
      • A asdf23211

        Hello, I'm facing a strange problem. I have a word for example "ABCDE". I want to match this word againsta sentence like "ABC EFGH IJKL DE". As you can see, the query word is splited and distributed across the senetence. How can I design algorithm which can give me result like the query word has acombination in the sentence and their position. An example; QUERY word: ABCDEF TARGET sentence: EFGH UIOP ABC GHY JKLU EF I expect the output as the query word has a perfect match by the combining words 3,6. I tried several algorithms like smithwatermann, levenstein distance and others, but I could figure out how I can compare string with randomly distributed sub-strings in a sentence. One way of doing is, breaking sentence into words and make all possible combinations and start matching with query which would take forever if I have 500 words. Any help or suggestion would be highly appreciated.

        S Offline
        S Offline
        Sanmayce
        wrote on last edited by
        #3

        Okay, this is more serious, 31 patterns are needed to exhaust your request. 0 internal '*': Step01: *ABCDE* 1 internal '*': Step02: *ABCD*E* Step03: *ABC*DE* Step04: *AB*CDE* Step05: *A*BCDE* 2 internal '*': Step06: *A*B*CDE* Step07: *A*BC*DE* Step08: *A*BCD*E* Step09: *AB*C*DE* Step10: *AB*CD*E* Step11: *ABC*D*E* 3 internal '*': Step12: *A*B*C*DE* Step13: *A*B*CD*E* Step14: *A*BC*D*E* Step15: *AB*C*D*E* Above 15 patterns are optional, they would yield hits with 'high' rank earlier than all the 31 ones below. Finding all substrings (without elision): Step01+15: *A*B*C*D*E* Finding all substrings (with elision): C(5,4) equals number of wildcard patterns, 5 choose 4 = 5!/(4!*(5-4)!) = 5*4*3*2*1/(4*3*2*1*1) = 5 Step02+15: *B*C*D*E* Step03+15: *A*C*D*E* Step04+15: *A*B*D*E* Step05+15: *A*B*C*E* Step06+15: *A*B*C*D* C(5,3) equals number of wildcard patterns, 5 choose 3 = 5!/(3!*(5-3)!) = 5*4*3*2*1/(3*2*1*2*1) = 10 Step07+15: *A*B*C* Step08+15: *A*B*D* Step09+15: *A*B*E* Step10+15: *A*C*D* Step11+15: *A*C*E* Step12+15: *A*D*E* Step13+15: *B*C*D* Step14+15: *B*C*E* Step15+15: *B*D*E* Step16+15: *C*D*E* C(5,2) equals number of wildcard patterns, 5 choose 2 = 5!/(2!*(5-2)!) = 5*4*3*2*1/(2*1*3*2*1) = 10 Step17+15: *A*B* Step18+15: *A*C* Step19+15: *A*D* Step20+15: *A*E* Step21+15: *B*C* Step22+15: *B*D* Step23+15: *B*E* Step24+15: *C*D* Step25+15: *C*E* Step26+15: *D*E* C(5,1) equals number of wildcard patterns, 5 choose 1 = 5!/(1!*(5-1)!) = 5*4*3*2*1/(1*4*3*2*1) = 5 Step27+15: *A* Step28+15: *B* Step29+15: *C* Step30+15: *D* Step31+15: *E* In essence the number e.g. C(5,4) equals all ways to order any four of five elements, which is 5!/(5-4)! divided by all ways to order any group of 4, which is 4!. Thus, the number of distinct ways to choose 4 out of 5 is 120 divided by 24. By chance I wrote the fastest wildcard search function (iterative) so these 31 patterns won't be much of a problem:

        int WildcardMatch_Iterative_Kaze(const char* mask, const char* name) {
        // Revision 1:
        /*
        const char* maskSTACK;
        const char* nameSTACK;
        int BacktrackFlag = 0;
        Backtrack:
        for (nameSTACK = name, maskSTACK = mask; *nameSTACK; ++nameSTACK, ++maskSTACK) {
        if (*maskSTACK == '&') {
        mask = maskSTACK+1;
        if (!*mask) return 1;
        name = nameSTACK;
        BacktrackFlag = -1;
        goto Backtrack;
        }
        //else if (*maskSTACK == '+') {
        //} else {
        else if (*maskSTACK != '+') {
        //if (tolower(*nameSTACK) != tolower(*maskSTACK)) {

        1 Reply Last reply
        0
        Reply
        • Reply as topic
        Log in to reply
        • Oldest to Newest
        • Newest to Oldest
        • Most Votes


        • Login

        • Don't have an account? Register

        • Login or register to search.
        • First post
          Last post
        0
        • Categories
        • Recent
        • Tags
        • Popular
        • World
        • Users
        • Groups