Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. regex help

regex help

Scheduled Pinned Locked Moved C#
htmldatabaseregexhelp
13 Posts 7 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • U uglyeyes

    Hi, I need to strip out text from a big file and my condition is: I need to strip out text from below <div class="a">apple</div> <p>  </p> <p>red delicious</p> <div class="b">banana</div> <p>  </p> <p>riped banana</p> <div class="c">chives</div> <p>  </p> <p>fresh green chives</p> to below 'apple', 'red delicious' 'banana', 'riped banana' 'chives', 'fresh green chives' so that i can enter each of them to database. I would really appreciate if you could please provide me a regex that could do this. thanks for your help!!! please note the content of the text is a concatination of multiple html pages.

    V Offline
    V Offline
    vivasaayi
    wrote on last edited by
    #3

    The string you presented is a valid XML (root element is missing). If performance is not an issue, first add a root element and then you can use XmlDocument or XmlReader to extract the information you needed.

    U 1 Reply Last reply
    0
    • V vivasaayi

      The string you presented is a valid XML (root element is missing). If performance is not an issue, first add a root element and then you can use XmlDocument or XmlReader to extract the information you needed.

      U Offline
      U Offline
      uglyeyes
      wrote on last edited by
      #4

      no only the text i have provided you is valid. the content of the entire csv files has text like ----1---- my contents.. ----2---- my content2.. ... ... so could you please help how to extract the data i want? i have this regex that seem to work to get text between div tag

      ?<=\<div class=""middlead""\>).*?(?=\</div\>

      but i need to get description of apple too. i tried to use below code

      string fName = @"data.txt";//path to text file
      StreamReader testTxt = new StreamReader(fName);
      string allRead = testTxt.ReadToEnd();//Reads the whole text file to the end
      testTxt.Close(); //Closes the text file after it is fully read.

              //Regex rx = new Regex(@"(?<=\\<div class=""middlead""\\>).\*?(?=\\</div\\>)", RegexOptions.Singleline);
              Regex rx1 = new Regex(@"(?<=\\<p\\>&nbsp;&nbsp;&nbsp;\\</p\\>).\*?(?=\\</p\\>)", RegexOptions.Singleline);
      
                  
      
              //MatchCollection matches = rx.Matches(allRead);
              MatchCollection matches1 = rx1.Matches(allRead);
      
              StreamWriter sw = new StreamWriter(@"realdata.txt");
              int count = 0;
              foreach (Match match in matches1)
              {
                  sw.WriteLine(count.ToString());
                  sw.WriteLine(match.ToString());
      
                  foreach (Match match1 in matches1)
                  {
                      sw.WriteLine(match1.ToString());
                  }
                  count++;
      
              }
              sw.Close();
      
      
            
              
          }
      

      but some how regex rx1 is not only giving text that i want but its doing greedy matching and try to match everything that has

      could you please help as to how can i extract the description of those products.

      realJSOPR 1 Reply Last reply
      0
      • U uglyeyes

        no only the text i have provided you is valid. the content of the entire csv files has text like ----1---- my contents.. ----2---- my content2.. ... ... so could you please help how to extract the data i want? i have this regex that seem to work to get text between div tag

        ?<=\<div class=""middlead""\>).*?(?=\</div\>

        but i need to get description of apple too. i tried to use below code

        string fName = @"data.txt";//path to text file
        StreamReader testTxt = new StreamReader(fName);
        string allRead = testTxt.ReadToEnd();//Reads the whole text file to the end
        testTxt.Close(); //Closes the text file after it is fully read.

                //Regex rx = new Regex(@"(?<=\\<div class=""middlead""\\>).\*?(?=\\</div\\>)", RegexOptions.Singleline);
                Regex rx1 = new Regex(@"(?<=\\<p\\>&nbsp;&nbsp;&nbsp;\\</p\\>).\*?(?=\\</p\\>)", RegexOptions.Singleline);
        
                    
        
                //MatchCollection matches = rx.Matches(allRead);
                MatchCollection matches1 = rx1.Matches(allRead);
        
                StreamWriter sw = new StreamWriter(@"realdata.txt");
                int count = 0;
                foreach (Match match in matches1)
                {
                    sw.WriteLine(count.ToString());
                    sw.WriteLine(match.ToString());
        
                    foreach (Match match1 in matches1)
                    {
                        sw.WriteLine(match1.ToString());
                    }
                    count++;
        
                }
                sw.Close();
        
        
              
                
            }
        

        but some how regex rx1 is not only giving text that i want but its doing greedy matching and try to match everything that has

        could you please help as to how can i extract the description of those products.

        realJSOPR Online
        realJSOPR Online
        realJSOP
        wrote on last edited by
        #5

        Once again, html IS xml. Just use Linq-To-XML to parse it - using regex is a WASTE OF TIME. It's easy - really. All you have to do is man-up and do some frakking research. It's just a few lines of code.

        .45 ACP - because shooting twice is just silly
        -----
        "Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997
        -----
        "The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001

        U B 2 Replies Last reply
        0
        • realJSOPR realJSOP

          Once again, html IS xml. Just use Linq-To-XML to parse it - using regex is a WASTE OF TIME. It's easy - really. All you have to do is man-up and do some frakking research. It's just a few lines of code.

          .45 ACP - because shooting twice is just silly
          -----
          "Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997
          -----
          "The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001

          U Offline
          U Offline
          uglyeyes
          wrote on last edited by
          #6

          please note the descript content doesnt have ID so i cant really use dom. any suggestions as to have to get text inside

          with no id?

          1 Reply Last reply
          0
          • realJSOPR realJSOP

            Once again, html IS xml. Just use Linq-To-XML to parse it - using regex is a WASTE OF TIME. It's easy - really. All you have to do is man-up and do some frakking research. It's just a few lines of code.

            .45 ACP - because shooting twice is just silly
            -----
            "Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997
            -----
            "The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001

            B Offline
            B Offline
            basantakumar
            wrote on last edited by
            #7

            Hi, Please use below Regex Pattern to get your result. .*?>([a-zA-Z0-9].*?)< Which will return the collection of apple red delicious banana riped banana chives fresh green chives Please let me know if you have any doubt.

            U 1 Reply Last reply
            0
            • B basantakumar

              Hi, Please use below Regex Pattern to get your result. .*?>([a-zA-Z0-9].*?)< Which will return the collection of apple red delicious banana riped banana chives fresh green chives Please let me know if you have any doubt.

              U Offline
              U Offline
              uglyeyes
              wrote on last edited by
              #8

              Hi thanks but the text i provided was just an example real text looks something like below and i need a regex for this one below <p>   </p> <p>red delicious.</p> <div class="central"><p>   </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/apple.png" value="Click here" /> </form><p>   </p> <p>   </p> <p>riped pear.</p> <div class="central"><p>   </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/pear.png" value="Click here" /> </form><p>   </p> this one is not working for me as its getting more text than i need <pre> (?<=\s\s\<p\>&nbsp;&nbsp;&nbsp;\</p\>\n\t\t\s\s\<p\>).*?(?=\\</p\>\n\t\t\s\s\<div class=""central""\> </pre>

              modified on Wednesday, January 27, 2010 6:14 PM

              U A 2 Replies Last reply
              0
              • U uglyeyes

                Hi thanks but the text i provided was just an example real text looks something like below and i need a regex for this one below <p>   </p> <p>red delicious.</p> <div class="central"><p>   </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/apple.png" value="Click here" /> </form><p>   </p> <p>   </p> <p>riped pear.</p> <div class="central"><p>   </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/pear.png" value="Click here" /> </form><p>   </p> this one is not working for me as its getting more text than i need <pre> (?<=\s\s\<p\>&nbsp;&nbsp;&nbsp;\</p\>\n\t\t\s\s\<p\>).*?(?=\\</p\>\n\t\t\s\s\<div class=""central""\> </pre>

                modified on Wednesday, January 27, 2010 6:14 PM

                U Offline
                U Offline
                uglyeyes
                wrote on last edited by
                #9

                not sure why my below regex fails in visual studio editor (?<=\<p\>   \</p\>\n\t+:b+).*(\n\t+:b+\<div class="central"\>) please note \t+ and :b+ are added because there is exactly 2 tab spaces and 2 white spaces in between the matching text. if I only use <p\>   \</p\>\n\t+:b+ it highlights the preceeding text of the matching text. not sure why by select between group is not working in visual studio. I am running out of ideas please help???

                U 1 Reply Last reply
                0
                • U uglyeyes

                  not sure why my below regex fails in visual studio editor (?<=\<p\>   \</p\>\n\t+:b+).*(\n\t+:b+\<div class="central"\>) please note \t+ and :b+ are added because there is exactly 2 tab spaces and 2 white spaces in between the matching text. if I only use <p\>   \</p\>\n\t+:b+ it highlights the preceeding text of the matching text. not sure why by select between group is not working in visual studio. I am running out of ideas please help???

                  U Offline
                  U Offline
                  uglyeyes
                  wrote on last edited by
                  #10

                  I tested using regexbuddy for text "apple" with my regex (?<=a).*?(?=e) returns "ppl" now i want to get text in between using below regex (?<=\<p\>   \</p\>).*?(?=\<div class="central"\>) <p>   </p> <p>red delicious.</p> <div class="central"><p>   </p> <FORM action="x.asp" method="post" > <INPUT type="hidden" name="oradvertiser" value="3"> <INPUT type="hidden" name="xx" value="test"> <INPUT type="hidden" name="xy" value="test"> <input type="image" src="./main/apple.png" value="Click here" onmouseout="this.style.border='5px solid silver';" /> </form><p>   </p> <p>   </p> <p>riped pear.</p> <div class="central"><p>   </p> <FORM action="x.asp" method="post" > <INPUT type="hidden" name="or" value="3"> <INPUT type="hidden" name="xx" value="test"> <INPUT type="hidden" name="xy" value="test"> <input type="image" src="./main/pear.png" value="Click here" onmouseout="this.style.border='5px solid silver';" /> </form><p>   </p> <p>dummy text</p> but its not giving me "red delicious" and "riped pear" could you please help?

                  U 1 Reply Last reply
                  0
                  • U uglyeyes

                    I tested using regexbuddy for text "apple" with my regex (?<=a).*?(?=e) returns "ppl" now i want to get text in between using below regex (?<=\<p\>   \</p\>).*?(?=\<div class="central"\>) <p>   </p> <p>red delicious.</p> <div class="central"><p>   </p> <FORM action="x.asp" method="post" > <INPUT type="hidden" name="oradvertiser" value="3"> <INPUT type="hidden" name="xx" value="test"> <INPUT type="hidden" name="xy" value="test"> <input type="image" src="./main/apple.png" value="Click here" onmouseout="this.style.border='5px solid silver';" /> </form><p>   </p> <p>   </p> <p>riped pear.</p> <div class="central"><p>   </p> <FORM action="x.asp" method="post" > <INPUT type="hidden" name="or" value="3"> <INPUT type="hidden" name="xx" value="test"> <INPUT type="hidden" name="xy" value="test"> <input type="image" src="./main/pear.png" value="Click here" onmouseout="this.style.border='5px solid silver';" /> </form><p>   </p> <p>dummy text</p> but its not giving me "red delicious" and "riped pear" could you please help?

                    U Offline
                    U Offline
                    uglyeyes
                    wrote on last edited by
                    #11

                    this works <p>&nbsp;&nbsp;&nbsp;</p>\s+<p>(?<content>.*?)</p>\s+<div class="central">

                    R 1 Reply Last reply
                    0
                    • U uglyeyes

                      this works <p>&nbsp;&nbsp;&nbsp;</p>\s+<p>(?<content>.*?)</p>\s+<div class="central">

                      R Offline
                      R Offline
                      Ravi Sant
                      wrote on last edited by
                      #12

                      good :thumbsup:

                      ♫ 99 little bugs in the code, 99 bugs in the code We fix a bug, compile it again 101 little bugs in the code ♫

                      1 Reply Last reply
                      0
                      • U uglyeyes

                        Hi thanks but the text i provided was just an example real text looks something like below and i need a regex for this one below <p>   </p> <p>red delicious.</p> <div class="central"><p>   </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/apple.png" value="Click here" /> </form><p>   </p> <p>   </p> <p>riped pear.</p> <div class="central"><p>   </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/pear.png" value="Click here" /> </form><p>   </p> this one is not working for me as its getting more text than i need <pre> (?<=\s\s\<p\>&nbsp;&nbsp;&nbsp;\</p\>\n\t\t\s\s\<p\>).*?(?=\\</p\>\n\t\t\s\s\<div class=""central""\> </pre>

                        modified on Wednesday, January 27, 2010 6:14 PM

                        A Offline
                        A Offline
                        ahmed_elshiwy
                        wrote on last edited by
                        #13

                        try to use labelname.refrsh() after the line u changed text property of the lable

                        1 Reply Last reply
                        0
                        Reply
                        • Reply as topic
                        Log in to reply
                        • Oldest to Newest
                        • Newest to Oldest
                        • Most Votes


                        • Login

                        • Don't have an account? Register

                        • Login or register to search.
                        • First post
                          Last post
                        0
                        • Categories
                        • Recent
                        • Tags
                        • Popular
                        • World
                        • Users
                        • Groups