Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Regular Expressions
  4. Sometimes its a bullet item, sometimes it is not, sometimes it is multiple bullet items

Sometimes its a bullet item, sometimes it is not, sometimes it is multiple bullet items

Scheduled Pinned Locked Moved Regular Expressions
tutorialbusinessregexhelp
3 Posts 2 Posters 16 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Offline
    M Offline
    Member_15883893
    wrote on last edited by
    #1

    I am trying to write some regex to pull out fields from a set of web pages. The information contained in them can vary for example they can have all or some of the fields (I think I have identified all the possibilities). and I think I can deal with this by including all the potential options and have data returned if the field is present as long as I can figure out how to make them absolute references. The other challenge is that sometimes these fields contain bullet lists which can have 1 or more bullet items which I don't know how to handle. Example is below and i am trying to identify the details associated with "Type of surveyor", "Works for", "Business type", "Surveying services", "Partners and directors", "Accreditations", "Registered valuer". If anyone can help that would be greatly appreciated

    Patterson Surveying

    Patterson Surveying is an independent surveying firm run by Paul Patterson

    Type of surveyor

    *   Chartered Valuation Surveyor
    

    Works for

    *   Residential customers
    
    *   Commercial contracts
    

    Business type

    Private Practice 
    

    Surveying services

    *   Building surveying
    
    *   RICS Home Survey – Level 2
    

    Partners and Directors

    *   Mr P M Patterson MRICS
    
    
    
    
    <
    
    OriginalGriffO 1 Reply Last reply
    0
    • M Member_15883893

      I am trying to write some regex to pull out fields from a set of web pages. The information contained in them can vary for example they can have all or some of the fields (I think I have identified all the possibilities). and I think I can deal with this by including all the potential options and have data returned if the field is present as long as I can figure out how to make them absolute references. The other challenge is that sometimes these fields contain bullet lists which can have 1 or more bullet items which I don't know how to handle. Example is below and i am trying to identify the details associated with "Type of surveyor", "Works for", "Business type", "Surveying services", "Partners and directors", "Accreditations", "Registered valuer". If anyone can help that would be greatly appreciated

      Patterson Surveying

      Patterson Surveying is an independent surveying firm run by Paul Patterson

      Type of surveyor

      *   Chartered Valuation Surveyor
      

      Works for

      *   Residential customers
      
      *   Commercial contracts
      

      Business type

      Private Practice 
      

      Surveying services

      *   Building surveying
      
      *   RICS Home Survey – Level 2
      

      Partners and Directors

      *   Mr P M Patterson MRICS
      
      
      
      
      <
      
      OriginalGriffO Offline
      OriginalGriffO Offline
      OriginalGriff
      wrote on last edited by
      #2

      Basically, don't use a Regex: HTML is notorious for being difficult to process effectively if you treat it as text - it pretty much needs a browser engine to mostly render the page before it can be parsed effectively as it contains so many different ways to do anything. Instead, I'd suggest you use an HTML parser (I use HTMLAgilityPack[^] in C#, but your language may need a different one) and scrape the sites that way - it's a lot easier to work with, and a whole load easier to change when the site admin alters the format, which happens a lot as features are added, removed, modified, or bugs are fixed. Doing it with a regex means it might work for a week, and then fail - and then the whole regex has to be re-written, re-tested, fixed, and released.

      "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt AntiTwitter: @DalekDave is now a follower!

      "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
      "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt

      M 1 Reply Last reply
      0
      • OriginalGriffO OriginalGriff

        Basically, don't use a Regex: HTML is notorious for being difficult to process effectively if you treat it as text - it pretty much needs a browser engine to mostly render the page before it can be parsed effectively as it contains so many different ways to do anything. Instead, I'd suggest you use an HTML parser (I use HTMLAgilityPack[^] in C#, but your language may need a different one) and scrape the sites that way - it's a lot easier to work with, and a whole load easier to change when the site admin alters the format, which happens a lot as features are added, removed, modified, or bugs are fixed. Doing it with a regex means it might work for a week, and then fail - and then the whole regex has to be re-written, re-tested, fixed, and released.

        "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt AntiTwitter: @DalekDave is now a follower!

        M Offline
        M Offline
        Member_15883893
        wrote on last edited by
        #3

        Thanks Original Griff, That is beyond my technical know-how at this point but I am looking to learn. I am using this within Octoparse which from what I have learnt to date can only use regex to make the fields absolute / more accurate. So I think I am stuck with trying to make it work using regex. Unless anyone knows differently or can help with the regex please?

        1 Reply Last reply
        0
        Reply
        • Reply as topic
        Log in to reply
        • Oldest to Newest
        • Newest to Oldest
        • Most Votes


        • Login

        • Don't have an account? Register

        • Login or register to search.
        • First post
          Last post
        0
        • Categories
        • Recent
        • Tags
        • Popular
        • World
        • Users
        • Groups