xml regex (for php)
-
I'd like to extract data from blogspot feed. I've used regex only for Rainmeter, so this is what I came up with:
(?siU)</id><published>(.*)</published><updated>.*</updated><title type='text'>(.*)</title>.*<link rel='alternate' type='text/html' href='(.*)'
I assume the "(?siU)" part is wrong. What would be the correct format?
I've also heard about php's xml_parser, but I think regex is faster. Still, how would I extract same data as in above (broken) regex with xml_parser in php?
Thanks in advance!
-
I'd like to extract data from blogspot feed. I've used regex only for Rainmeter, so this is what I came up with:
(?siU)</id><published>(.*)</published><updated>.*</updated><title type='text'>(.*)</title>.*<link rel='alternate' type='text/html' href='(.*)'
I assume the "(?siU)" part is wrong. What would be the correct format?
I've also heard about php's xml_parser, but I think regex is faster. Still, how would I extract same data as in above (broken) regex with xml_parser in php?
Thanks in advance!
You should be a little more clear about exactly what you are trying to do with that question mark, but here are some things to keep in mind... If you have well formed XML, an XML parser is almost certainly the way to go. It might actually be faster than a regular expression. Unfortunately, I'm not familiar with PHP's XML parser, but you should take the time to familarize yourself with it. Also, the question mark means "the preceding item is optional". Since the question mark is after an opening paren, there is nothing preceeding it, so I'm not exactly sure what you're after there. Depending on the regular expression engine you use, you can use a similar syntax for positive and negative lookaheads and lookbehinds, and you can use them for named groups. Or if you put a backslash to the left of the question mark, you'll escape it so it matches a literal question mark. But I'm not really sure what you're trying to do here. For example, if you were trying to get the query string value out of a URL, you could use a named group to grab it:
http://www\.google\.com\?(?<QUERY_STRING>.*)
Notice I use the question mark twice. The first time as a literal question mark and the second time as part of a named group. Here is another example:
http://www\.google\.com(?=\?)
That is a positive lookahead that ensures the character following the "m" is a question mark. But it doesn't actually grab the question mark as part of the pattern, it only ensures that the URL will match if that question mark exists in the right location. And of course, there is this use of the question mark:
http://www\.google\.com\??
That means the last question mark is optional. And then there is one more use of question marks (lazy matching rather than greedy matching) that goes like this:
\<img\>.*?\</img\>
I'll leave it up to you to figure out what that does if you are interested. One more thing, the less than and greater than signs have a special meaning in regular expressions. You may want to escape them by putting a backslash to the left of them.
-
You should be a little more clear about exactly what you are trying to do with that question mark, but here are some things to keep in mind... If you have well formed XML, an XML parser is almost certainly the way to go. It might actually be faster than a regular expression. Unfortunately, I'm not familiar with PHP's XML parser, but you should take the time to familarize yourself with it. Also, the question mark means "the preceding item is optional". Since the question mark is after an opening paren, there is nothing preceeding it, so I'm not exactly sure what you're after there. Depending on the regular expression engine you use, you can use a similar syntax for positive and negative lookaheads and lookbehinds, and you can use them for named groups. Or if you put a backslash to the left of the question mark, you'll escape it so it matches a literal question mark. But I'm not really sure what you're trying to do here. For example, if you were trying to get the query string value out of a URL, you could use a named group to grab it:
http://www\.google\.com\?(?<QUERY_STRING>.*)
Notice I use the question mark twice. The first time as a literal question mark and the second time as part of a named group. Here is another example:
http://www\.google\.com(?=\?)
That is a positive lookahead that ensures the character following the "m" is a question mark. But it doesn't actually grab the question mark as part of the pattern, it only ensures that the URL will match if that question mark exists in the right location. And of course, there is this use of the question mark:
http://www\.google\.com\??
That means the last question mark is optional. And then there is one more use of question marks (lazy matching rather than greedy matching) that goes like this:
\<img\>.*?\</img\>
I'll leave it up to you to figure out what that does if you are interested. One more thing, the less than and greater than signs have a special meaning in regular expressions. You may want to escape them by putting a backslash to the left of them.
Thanks for the detailed explanation. After a more extensive searching I found out how to use xml_parser for blogspot feed. It certainly seems easier than regex.