Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. regex help

regex help

Scheduled Pinned Locked Moved C#
htmldatabaseregexhelp
13 Posts 7 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • U Offline
    U Offline
    uglyeyes
    wrote on last edited by
    #1

    Hi, I need to strip out text from a big file and my condition is: I need to strip out text from below <div class="a">apple</div> <p>  </p> <p>red delicious</p> <div class="b">banana</div> <p>  </p> <p>riped banana</p> <div class="c">chives</div> <p>  </p> <p>fresh green chives</p> to below 'apple', 'red delicious' 'banana', 'riped banana' 'chives', 'fresh green chives' so that i can enter each of them to database. I would really appreciate if you could please provide me a regex that could do this. thanks for your help!!! please note the content of the text is a concatination of multiple html pages.

    L V 2 Replies Last reply
    0
    • U uglyeyes

      Hi, I need to strip out text from a big file and my condition is: I need to strip out text from below <div class="a">apple</div> <p>  </p> <p>red delicious</p> <div class="b">banana</div> <p>  </p> <p>riped banana</p> <div class="c">chives</div> <p>  </p> <p>fresh green chives</p> to below 'apple', 'red delicious' 'banana', 'riped banana' 'chives', 'fresh green chives' so that i can enter each of them to database. I would really appreciate if you could please provide me a regex that could do this. thanks for your help!!! please note the content of the text is a concatination of multiple html pages.

      L Offline
      L Offline
      Luc Pattyn
      wrote on last edited by
      #2

      repost[^], and no PRE tags once again ==> no help X|

      Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]


      I only read code that is properly formatted, adding PRE tags is the easiest way to obtain that.
      [The QA section does it automatically now, I hope we soon get it on regular forums as well]


      1 Reply Last reply
      0
      • U uglyeyes

        Hi, I need to strip out text from a big file and my condition is: I need to strip out text from below <div class="a">apple</div> <p>  </p> <p>red delicious</p> <div class="b">banana</div> <p>  </p> <p>riped banana</p> <div class="c">chives</div> <p>  </p> <p>fresh green chives</p> to below 'apple', 'red delicious' 'banana', 'riped banana' 'chives', 'fresh green chives' so that i can enter each of them to database. I would really appreciate if you could please provide me a regex that could do this. thanks for your help!!! please note the content of the text is a concatination of multiple html pages.

        V Offline
        V Offline
        vivasaayi
        wrote on last edited by
        #3

        The string you presented is a valid XML (root element is missing). If performance is not an issue, first add a root element and then you can use XmlDocument or XmlReader to extract the information you needed.

        U 1 Reply Last reply
        0
        • V vivasaayi

          The string you presented is a valid XML (root element is missing). If performance is not an issue, first add a root element and then you can use XmlDocument or XmlReader to extract the information you needed.

          U Offline
          U Offline
          uglyeyes
          wrote on last edited by
          #4

          no only the text i have provided you is valid. the content of the entire csv files has text like ----1---- my contents.. ----2---- my content2.. ... ... so could you please help how to extract the data i want? i have this regex that seem to work to get text between div tag

          ?<=\<div class=""middlead""\>).*?(?=\</div\>

          but i need to get description of apple too. i tried to use below code

          string fName = @"data.txt";//path to text file
          StreamReader testTxt = new StreamReader(fName);
          string allRead = testTxt.ReadToEnd();//Reads the whole text file to the end
          testTxt.Close(); //Closes the text file after it is fully read.

                  //Regex rx = new Regex(@"(?<=\\<div class=""middlead""\\>).\*?(?=\\</div\\>)", RegexOptions.Singleline);
                  Regex rx1 = new Regex(@"(?<=\\<p\\>&nbsp;&nbsp;&nbsp;\\</p\\>).\*?(?=\\</p\\>)", RegexOptions.Singleline);
          
                      
          
                  //MatchCollection matches = rx.Matches(allRead);
                  MatchCollection matches1 = rx1.Matches(allRead);
          
                  StreamWriter sw = new StreamWriter(@"realdata.txt");
                  int count = 0;
                  foreach (Match match in matches1)
                  {
                      sw.WriteLine(count.ToString());
                      sw.WriteLine(match.ToString());
          
                      foreach (Match match1 in matches1)
                      {
                          sw.WriteLine(match1.ToString());
                      }
                      count++;
          
                  }
                  sw.Close();
          
          
                
                  
              }
          

          but some how regex rx1 is not only giving text that i want but its doing greedy matching and try to match everything that has

          could you please help as to how can i extract the description of those products.

          realJSOPR 1 Reply Last reply
          0
          • U uglyeyes

            no only the text i have provided you is valid. the content of the entire csv files has text like ----1---- my contents.. ----2---- my content2.. ... ... so could you please help how to extract the data i want? i have this regex that seem to work to get text between div tag

            ?<=\<div class=""middlead""\>).*?(?=\</div\>

            but i need to get description of apple too. i tried to use below code

            string fName = @"data.txt";//path to text file
            StreamReader testTxt = new StreamReader(fName);
            string allRead = testTxt.ReadToEnd();//Reads the whole text file to the end
            testTxt.Close(); //Closes the text file after it is fully read.

                    //Regex rx = new Regex(@"(?<=\\<div class=""middlead""\\>).\*?(?=\\</div\\>)", RegexOptions.Singleline);
                    Regex rx1 = new Regex(@"(?<=\\<p\\>&nbsp;&nbsp;&nbsp;\\</p\\>).\*?(?=\\</p\\>)", RegexOptions.Singleline);
            
                        
            
                    //MatchCollection matches = rx.Matches(allRead);
                    MatchCollection matches1 = rx1.Matches(allRead);
            
                    StreamWriter sw = new StreamWriter(@"realdata.txt");
                    int count = 0;
                    foreach (Match match in matches1)
                    {
                        sw.WriteLine(count.ToString());
                        sw.WriteLine(match.ToString());
            
                        foreach (Match match1 in matches1)
                        {
                            sw.WriteLine(match1.ToString());
                        }
                        count++;
            
                    }
                    sw.Close();
            
            
                  
                    
                }
            

            but some how regex rx1 is not only giving text that i want but its doing greedy matching and try to match everything that has

            could you please help as to how can i extract the description of those products.

            realJSOPR Offline
            realJSOPR Offline
            realJSOP
            wrote on last edited by
            #5

            Once again, html IS xml. Just use Linq-To-XML to parse it - using regex is a WASTE OF TIME. It's easy - really. All you have to do is man-up and do some frakking research. It's just a few lines of code.

            .45 ACP - because shooting twice is just silly
            -----
            "Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997
            -----
            "The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001

            U B 2 Replies Last reply
            0
            • realJSOPR realJSOP

              Once again, html IS xml. Just use Linq-To-XML to parse it - using regex is a WASTE OF TIME. It's easy - really. All you have to do is man-up and do some frakking research. It's just a few lines of code.

              .45 ACP - because shooting twice is just silly
              -----
              "Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997
              -----
              "The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001

              U Offline
              U Offline
              uglyeyes
              wrote on last edited by
              #6

              please note the descript content doesnt have ID so i cant really use dom. any suggestions as to have to get text inside

              with no id?

              1 Reply Last reply
              0
              • realJSOPR realJSOP

                Once again, html IS xml. Just use Linq-To-XML to parse it - using regex is a WASTE OF TIME. It's easy - really. All you have to do is man-up and do some frakking research. It's just a few lines of code.

                .45 ACP - because shooting twice is just silly
                -----
                "Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997
                -----
                "The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001

                B Offline
                B Offline
                basantakumar
                wrote on last edited by
                #7

                Hi, Please use below Regex Pattern to get your result. .*?>([a-zA-Z0-9].*?)< Which will return the collection of apple red delicious banana riped banana chives fresh green chives Please let me know if you have any doubt.

                U 1 Reply Last reply
                0
                • B basantakumar

                  Hi, Please use below Regex Pattern to get your result. .*?>([a-zA-Z0-9].*?)< Which will return the collection of apple red delicious banana riped banana chives fresh green chives Please let me know if you have any doubt.

                  U Offline
                  U Offline
                  uglyeyes
                  wrote on last edited by
                  #8

                  Hi thanks but the text i provided was just an example real text looks something like below and i need a regex for this one below <p>   </p> <p>red delicious.</p> <div class="central"><p>   </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/apple.png" value="Click here" /> </form><p>   </p> <p>   </p> <p>riped pear.</p> <div class="central"><p>   </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/pear.png" value="Click here" /> </form><p>   </p> this one is not working for me as its getting more text than i need <pre> (?<=\s\s\<p\>&nbsp;&nbsp;&nbsp;\</p\>\n\t\t\s\s\<p\>).*?(?=\\</p\>\n\t\t\s\s\<div class=""central""\> </pre>

                  modified on Wednesday, January 27, 2010 6:14 PM

                  U A 2 Replies Last reply
                  0
                  • U uglyeyes

                    Hi thanks but the text i provided was just an example real text looks something like below and i need a regex for this one below <p>   </p> <p>red delicious.</p> <div class="central"><p>   </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/apple.png" value="Click here" /> </form><p>   </p> <p>   </p> <p>riped pear.</p> <div class="central"><p>   </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/pear.png" value="Click here" /> </form><p>   </p> this one is not working for me as its getting more text than i need <pre> (?<=\s\s\<p\>&nbsp;&nbsp;&nbsp;\</p\>\n\t\t\s\s\<p\>).*?(?=\\</p\>\n\t\t\s\s\<div class=""central""\> </pre>

                    modified on Wednesday, January 27, 2010 6:14 PM

                    U Offline
                    U Offline
                    uglyeyes
                    wrote on last edited by
                    #9

                    not sure why my below regex fails in visual studio editor (?<=\<p\>   \</p\>\n\t+:b+).*(\n\t+:b+\<div class="central"\>) please note \t+ and :b+ are added because there is exactly 2 tab spaces and 2 white spaces in between the matching text. if I only use <p\>   \</p\>\n\t+:b+ it highlights the preceeding text of the matching text. not sure why by select between group is not working in visual studio. I am running out of ideas please help???

                    U 1 Reply Last reply
                    0
                    • U uglyeyes

                      not sure why my below regex fails in visual studio editor (?<=\<p\>   \</p\>\n\t+:b+).*(\n\t+:b+\<div class="central"\>) please note \t+ and :b+ are added because there is exactly 2 tab spaces and 2 white spaces in between the matching text. if I only use <p\>   \</p\>\n\t+:b+ it highlights the preceeding text of the matching text. not sure why by select between group is not working in visual studio. I am running out of ideas please help???

                      U Offline
                      U Offline
                      uglyeyes
                      wrote on last edited by
                      #10

                      I tested using regexbuddy for text "apple" with my regex (?<=a).*?(?=e) returns "ppl" now i want to get text in between using below regex (?<=\<p\>   \</p\>).*?(?=\<div class="central"\>) <p>   </p> <p>red delicious.</p> <div class="central"><p>   </p> <FORM action="x.asp" method="post" > <INPUT type="hidden" name="oradvertiser" value="3"> <INPUT type="hidden" name="xx" value="test"> <INPUT type="hidden" name="xy" value="test"> <input type="image" src="./main/apple.png" value="Click here" onmouseout="this.style.border='5px solid silver';" /> </form><p>   </p> <p>   </p> <p>riped pear.</p> <div class="central"><p>   </p> <FORM action="x.asp" method="post" > <INPUT type="hidden" name="or" value="3"> <INPUT type="hidden" name="xx" value="test"> <INPUT type="hidden" name="xy" value="test"> <input type="image" src="./main/pear.png" value="Click here" onmouseout="this.style.border='5px solid silver';" /> </form><p>   </p> <p>dummy text</p> but its not giving me "red delicious" and "riped pear" could you please help?

                      U 1 Reply Last reply
                      0
                      • U uglyeyes

                        I tested using regexbuddy for text "apple" with my regex (?<=a).*?(?=e) returns "ppl" now i want to get text in between using below regex (?<=\<p\>   \</p\>).*?(?=\<div class="central"\>) <p>   </p> <p>red delicious.</p> <div class="central"><p>   </p> <FORM action="x.asp" method="post" > <INPUT type="hidden" name="oradvertiser" value="3"> <INPUT type="hidden" name="xx" value="test"> <INPUT type="hidden" name="xy" value="test"> <input type="image" src="./main/apple.png" value="Click here" onmouseout="this.style.border='5px solid silver';" /> </form><p>   </p> <p>   </p> <p>riped pear.</p> <div class="central"><p>   </p> <FORM action="x.asp" method="post" > <INPUT type="hidden" name="or" value="3"> <INPUT type="hidden" name="xx" value="test"> <INPUT type="hidden" name="xy" value="test"> <input type="image" src="./main/pear.png" value="Click here" onmouseout="this.style.border='5px solid silver';" /> </form><p>   </p> <p>dummy text</p> but its not giving me "red delicious" and "riped pear" could you please help?

                        U Offline
                        U Offline
                        uglyeyes
                        wrote on last edited by
                        #11

                        this works <p>&nbsp;&nbsp;&nbsp;</p>\s+<p>(?<content>.*?)</p>\s+<div class="central">

                        R 1 Reply Last reply
                        0
                        • U uglyeyes

                          this works <p>&nbsp;&nbsp;&nbsp;</p>\s+<p>(?<content>.*?)</p>\s+<div class="central">

                          R Offline
                          R Offline
                          Ravi Sant
                          wrote on last edited by
                          #12

                          good :thumbsup:

                          ♫ 99 little bugs in the code, 99 bugs in the code We fix a bug, compile it again 101 little bugs in the code ♫

                          1 Reply Last reply
                          0
                          • U uglyeyes

                            Hi thanks but the text i provided was just an example real text looks something like below and i need a regex for this one below <p>   </p> <p>red delicious.</p> <div class="central"><p>   </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/apple.png" value="Click here" /> </form><p>   </p> <p>   </p> <p>riped pear.</p> <div class="central"><p>   </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/pear.png" value="Click here" /> </form><p>   </p> this one is not working for me as its getting more text than i need <pre> (?<=\s\s\<p\>&nbsp;&nbsp;&nbsp;\</p\>\n\t\t\s\s\<p\>).*?(?=\\</p\>\n\t\t\s\s\<div class=""central""\> </pre>

                            modified on Wednesday, January 27, 2010 6:14 PM

                            A Offline
                            A Offline
                            ahmed_elshiwy
                            wrote on last edited by
                            #13

                            try to use labelname.refrsh() after the line u changed text property of the lable

                            1 Reply Last reply
                            0
                            Reply
                            • Reply as topic
                            Log in to reply
                            • Oldest to Newest
                            • Newest to Oldest
                            • Most Votes


                            • Login

                            • Don't have an account? Register

                            • Login or register to search.
                            • First post
                              Last post
                            0
                            • Categories
                            • Recent
                            • Tags
                            • Popular
                            • World
                            • Users
                            • Groups