regex help
-
Hi, I need to strip out text from a big file and my condition is: I need to strip out text from below <div class="a">apple</div> <p> </p> <p>red delicious</p> <div class="b">banana</div> <p> </p> <p>riped banana</p> <div class="c">chives</div> <p> </p> <p>fresh green chives</p> to below 'apple', 'red delicious' 'banana', 'riped banana' 'chives', 'fresh green chives' so that i can enter each of them to database. I would really appreciate if you could please provide me a regex that could do this. thanks for your help!!! please note the content of the text is a concatination of multiple html pages.
-
The string you presented is a valid XML (root element is missing). If performance is not an issue, first add a root element and then you can use XmlDocument or XmlReader to extract the information you needed.
no only the text i have provided you is valid. the content of the entire csv files has text like ----1---- my contents.. ----2---- my content2.. ... ... so could you please help how to extract the data i want? i have this regex that seem to work to get text between div tag
?<=\<div class=""middlead""\>).*?(?=\</div\>
but i need to get description of apple too. i tried to use below code
string fName = @"data.txt";//path to text file
StreamReader testTxt = new StreamReader(fName);
string allRead = testTxt.ReadToEnd();//Reads the whole text file to the end
testTxt.Close(); //Closes the text file after it is fully read.//Regex rx = new Regex(@"(?<=\\<div class=""middlead""\\>).\*?(?=\\</div\\>)", RegexOptions.Singleline); Regex rx1 = new Regex(@"(?<=\\<p\\> \\</p\\>).\*?(?=\\</p\\>)", RegexOptions.Singleline); //MatchCollection matches = rx.Matches(allRead); MatchCollection matches1 = rx1.Matches(allRead); StreamWriter sw = new StreamWriter(@"realdata.txt"); int count = 0; foreach (Match match in matches1) { sw.WriteLine(count.ToString()); sw.WriteLine(match.ToString()); foreach (Match match1 in matches1) { sw.WriteLine(match1.ToString()); } count++; } sw.Close(); }
but some how regex rx1 is not only giving text that i want but its doing greedy matching and try to match everything that has
could you please help as to how can i extract the description of those products.
-
no only the text i have provided you is valid. the content of the entire csv files has text like ----1---- my contents.. ----2---- my content2.. ... ... so could you please help how to extract the data i want? i have this regex that seem to work to get text between div tag
?<=\<div class=""middlead""\>).*?(?=\</div\>
but i need to get description of apple too. i tried to use below code
string fName = @"data.txt";//path to text file
StreamReader testTxt = new StreamReader(fName);
string allRead = testTxt.ReadToEnd();//Reads the whole text file to the end
testTxt.Close(); //Closes the text file after it is fully read.//Regex rx = new Regex(@"(?<=\\<div class=""middlead""\\>).\*?(?=\\</div\\>)", RegexOptions.Singleline); Regex rx1 = new Regex(@"(?<=\\<p\\> \\</p\\>).\*?(?=\\</p\\>)", RegexOptions.Singleline); //MatchCollection matches = rx.Matches(allRead); MatchCollection matches1 = rx1.Matches(allRead); StreamWriter sw = new StreamWriter(@"realdata.txt"); int count = 0; foreach (Match match in matches1) { sw.WriteLine(count.ToString()); sw.WriteLine(match.ToString()); foreach (Match match1 in matches1) { sw.WriteLine(match1.ToString()); } count++; } sw.Close(); }
but some how regex rx1 is not only giving text that i want but its doing greedy matching and try to match everything that has
could you please help as to how can i extract the description of those products.
Once again, html IS xml. Just use Linq-To-XML to parse it - using regex is a WASTE OF TIME. It's easy - really. All you have to do is man-up and do some frakking research. It's just a few lines of code.
.45 ACP - because shooting twice is just silly
-----
"Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997
-----
"The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001 -
Once again, html IS xml. Just use Linq-To-XML to parse it - using regex is a WASTE OF TIME. It's easy - really. All you have to do is man-up and do some frakking research. It's just a few lines of code.
.45 ACP - because shooting twice is just silly
-----
"Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997
-----
"The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001 -
Once again, html IS xml. Just use Linq-To-XML to parse it - using regex is a WASTE OF TIME. It's easy - really. All you have to do is man-up and do some frakking research. It's just a few lines of code.
.45 ACP - because shooting twice is just silly
-----
"Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997
-----
"The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001Hi, Please use below Regex Pattern to get your result. .*?>([a-zA-Z0-9].*?)< Which will return the collection of apple red delicious banana riped banana chives fresh green chives Please let me know if you have any doubt.
-
Hi, Please use below Regex Pattern to get your result. .*?>([a-zA-Z0-9].*?)< Which will return the collection of apple red delicious banana riped banana chives fresh green chives Please let me know if you have any doubt.
Hi thanks but the text i provided was just an example real text looks something like below and i need a regex for this one below <p> </p> <p>red delicious.</p> <div class="central"><p> </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/apple.png" value="Click here" /> </form><p> </p> <p> </p> <p>riped pear.</p> <div class="central"><p> </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/pear.png" value="Click here" /> </form><p> </p> this one is not working for me as its getting more text than i need <pre> (?<=\s\s\<p\> \</p\>\n\t\t\s\s\<p\>).*?(?=\\</p\>\n\t\t\s\s\<div class=""central""\> </pre>
modified on Wednesday, January 27, 2010 6:14 PM
-
Hi thanks but the text i provided was just an example real text looks something like below and i need a regex for this one below <p> </p> <p>red delicious.</p> <div class="central"><p> </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/apple.png" value="Click here" /> </form><p> </p> <p> </p> <p>riped pear.</p> <div class="central"><p> </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/pear.png" value="Click here" /> </form><p> </p> this one is not working for me as its getting more text than i need <pre> (?<=\s\s\<p\> \</p\>\n\t\t\s\s\<p\>).*?(?=\\</p\>\n\t\t\s\s\<div class=""central""\> </pre>
modified on Wednesday, January 27, 2010 6:14 PM
not sure why my below regex fails in visual studio editor (?<=\<p\> \</p\>\n\t+:b+).*(\n\t+:b+\<div class="central"\>) please note \t+ and :b+ are added because there is exactly 2 tab spaces and 2 white spaces in between the matching text. if I only use <p\> \</p\>\n\t+:b+ it highlights the preceeding text of the matching text. not sure why by select between group is not working in visual studio. I am running out of ideas please help???
-
not sure why my below regex fails in visual studio editor (?<=\<p\> \</p\>\n\t+:b+).*(\n\t+:b+\<div class="central"\>) please note \t+ and :b+ are added because there is exactly 2 tab spaces and 2 white spaces in between the matching text. if I only use <p\> \</p\>\n\t+:b+ it highlights the preceeding text of the matching text. not sure why by select between group is not working in visual studio. I am running out of ideas please help???
I tested using regexbuddy for text "apple" with my regex (?<=a).*?(?=e) returns "ppl" now i want to get text in between using below regex (?<=\<p\> \</p\>).*?(?=\<div class="central"\>) <p> </p> <p>red delicious.</p> <div class="central"><p> </p> <FORM action="x.asp" method="post" > <INPUT type="hidden" name="oradvertiser" value="3"> <INPUT type="hidden" name="xx" value="test"> <INPUT type="hidden" name="xy" value="test"> <input type="image" src="./main/apple.png" value="Click here" onmouseout="this.style.border='5px solid silver';" /> </form><p> </p> <p> </p> <p>riped pear.</p> <div class="central"><p> </p> <FORM action="x.asp" method="post" > <INPUT type="hidden" name="or" value="3"> <INPUT type="hidden" name="xx" value="test"> <INPUT type="hidden" name="xy" value="test"> <input type="image" src="./main/pear.png" value="Click here" onmouseout="this.style.border='5px solid silver';" /> </form><p> </p> <p>dummy text</p> but its not giving me "red delicious" and "riped pear" could you please help?
-
I tested using regexbuddy for text "apple" with my regex (?<=a).*?(?=e) returns "ppl" now i want to get text in between using below regex (?<=\<p\> \</p\>).*?(?=\<div class="central"\>) <p> </p> <p>red delicious.</p> <div class="central"><p> </p> <FORM action="x.asp" method="post" > <INPUT type="hidden" name="oradvertiser" value="3"> <INPUT type="hidden" name="xx" value="test"> <INPUT type="hidden" name="xy" value="test"> <input type="image" src="./main/apple.png" value="Click here" onmouseout="this.style.border='5px solid silver';" /> </form><p> </p> <p> </p> <p>riped pear.</p> <div class="central"><p> </p> <FORM action="x.asp" method="post" > <INPUT type="hidden" name="or" value="3"> <INPUT type="hidden" name="xx" value="test"> <INPUT type="hidden" name="xy" value="test"> <input type="image" src="./main/pear.png" value="Click here" onmouseout="this.style.border='5px solid silver';" /> </form><p> </p> <p>dummy text</p> but its not giving me "red delicious" and "riped pear" could you please help?
-
Hi thanks but the text i provided was just an example real text looks something like below and i need a regex for this one below <p> </p> <p>red delicious.</p> <div class="central"><p> </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/apple.png" value="Click here" /> </form><p> </p> <p> </p> <p>riped pear.</p> <div class="central"><p> </p> <FORM action="product.asp" method="post" > <INPUT type="hidden" name="ss" value="3"> <INPUT type="hidden" name="xx" value="xx"> <INPUT type="hidden" name="yy" value="yy"> <input type="image" src="./main/pear.png" value="Click here" /> </form><p> </p> this one is not working for me as its getting more text than i need <pre> (?<=\s\s\<p\> \</p\>\n\t\t\s\s\<p\>).*?(?=\\</p\>\n\t\t\s\s\<div class=""central""\> </pre>
modified on Wednesday, January 27, 2010 6:14 PM
try to use labelname.refrsh() after the line u changed text property of the lable