Sometimes its a bullet item, sometimes it is not, sometimes it is multiple bullet items
-
I am trying to write some regex to pull out fields from a set of web pages. The information contained in them can vary for example they can have all or some of the fields (I think I have identified all the possibilities). and I think I can deal with this by including all the potential options and have data returned if the field is present as long as I can figure out how to make them absolute references. The other challenge is that sometimes these fields contain bullet lists which can have 1 or more bullet items which I don't know how to handle. Example is below and i am trying to identify the details associated with "Type of surveyor", "Works for", "Business type", "Surveying services", "Partners and directors", "Accreditations", "Registered valuer". If anyone can help that would be greatly appreciated
Patterson Surveying
Patterson Surveying is an independent surveying firm run by Paul Patterson
Type of surveyor
* Chartered Valuation Surveyor
Works for
* Residential customers * Commercial contracts
Business type
Private Practice
Surveying services
* Building surveying * RICS Home Survey – Level 2
Partners and Directors
* Mr P M Patterson MRICS <
-
I am trying to write some regex to pull out fields from a set of web pages. The information contained in them can vary for example they can have all or some of the fields (I think I have identified all the possibilities). and I think I can deal with this by including all the potential options and have data returned if the field is present as long as I can figure out how to make them absolute references. The other challenge is that sometimes these fields contain bullet lists which can have 1 or more bullet items which I don't know how to handle. Example is below and i am trying to identify the details associated with "Type of surveyor", "Works for", "Business type", "Surveying services", "Partners and directors", "Accreditations", "Registered valuer". If anyone can help that would be greatly appreciated
Patterson Surveying
Patterson Surveying is an independent surveying firm run by Paul Patterson
Type of surveyor
* Chartered Valuation Surveyor
Works for
* Residential customers * Commercial contracts
Business type
Private Practice
Surveying services
* Building surveying * RICS Home Survey – Level 2
Partners and Directors
* Mr P M Patterson MRICS <
Basically, don't use a Regex: HTML is notorious for being difficult to process effectively if you treat it as text - it pretty much needs a browser engine to mostly render the page before it can be parsed effectively as it contains so many different ways to do anything. Instead, I'd suggest you use an HTML parser (I use HTMLAgilityPack[^] in C#, but your language may need a different one) and scrape the sites that way - it's a lot easier to work with, and a whole load easier to change when the site admin alters the format, which happens a lot as features are added, removed, modified, or bugs are fixed. Doing it with a regex means it might work for a week, and then fail - and then the whole regex has to be re-written, re-tested, fixed, and released.
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt AntiTwitter: @DalekDave is now a follower!
-
Basically, don't use a Regex: HTML is notorious for being difficult to process effectively if you treat it as text - it pretty much needs a browser engine to mostly render the page before it can be parsed effectively as it contains so many different ways to do anything. Instead, I'd suggest you use an HTML parser (I use HTMLAgilityPack[^] in C#, but your language may need a different one) and scrape the sites that way - it's a lot easier to work with, and a whole load easier to change when the site admin alters the format, which happens a lot as features are added, removed, modified, or bugs are fixed. Doing it with a regex means it might work for a week, and then fail - and then the whole regex has to be re-written, re-tested, fixed, and released.
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt AntiTwitter: @DalekDave is now a follower!
Thanks Original Griff, That is beyond my technical know-how at this point but I am looking to learn. I am using this within Octoparse which from what I have learnt to date can only use regex to make the fields absolute / more accurate. So I think I am stuck with trying to make it work using regex. Unless anyone knows differently or can help with the regex please?