Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. How to extract part of HTML source?

How to extract part of HTML source?

Scheduled Pinned Locked Moved C#
tutorialhtmldatabasecomxml
8 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Offline
    M Offline
    Member 569739
    wrote on last edited by
    #1

    I have a website that I am trying to extract some data from and then run some things on it. They do not seem to have any API's to do this. So take this site: abc if I view the source and look down to line 71 at this time I basically want this piece:

                \[{
                    "@context": "http://schema.org/",
                    "@type": "ItemList" ,
                    "itemlistElement":
                        \[ \[{
                                "@type": "ListItem",
                                "position" : 1,
                                "name" : "November 2020",
                                "item":  \[{
                                    "@type": "Thing",
                                    "name" : "General Hospital 11/03/20",
                                    "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-11/03-general-hospital-110320"
                                },{
                                    "@type": "Thing",
                                    "name" : "General Hospital 11/02/20",
                                    "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-11/02-general-hospital-110220"
                                }\]
                            }\],\[{
                                "@type": "ListItem",
                                "position" : 2,
                                "name" : "October 2020",
                                "item":  \[{
                                    "@type": "Thing",
                                    "name" : "General Hospital 10/30/20",
                                    "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/30-general-hospital-103020"
                                },{
                                    "@type": "Thing",
                                    "name" : "General Hospital 10/29/20",
                                    "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/29-general-hospital-102920"
                                },{
                                    "@type": "Thing",
                                    "name" : "General Hospital 10/28/20",
                                    "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/28-general-hospital-102820"
                                },{
                                    "@t</x-turndown>
    
    OriginalGriffO D 2 Replies Last reply
    0
    • M Member 569739

      I have a website that I am trying to extract some data from and then run some things on it. They do not seem to have any API's to do this. So take this site: abc if I view the source and look down to line 71 at this time I basically want this piece:

                  \[{
                      "@context": "http://schema.org/",
                      "@type": "ItemList" ,
                      "itemlistElement":
                          \[ \[{
                                  "@type": "ListItem",
                                  "position" : 1,
                                  "name" : "November 2020",
                                  "item":  \[{
                                      "@type": "Thing",
                                      "name" : "General Hospital 11/03/20",
                                      "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-11/03-general-hospital-110320"
                                  },{
                                      "@type": "Thing",
                                      "name" : "General Hospital 11/02/20",
                                      "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-11/02-general-hospital-110220"
                                  }\]
                              }\],\[{
                                  "@type": "ListItem",
                                  "position" : 2,
                                  "name" : "October 2020",
                                  "item":  \[{
                                      "@type": "Thing",
                                      "name" : "General Hospital 10/30/20",
                                      "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/30-general-hospital-103020"
                                  },{
                                      "@type": "Thing",
                                      "name" : "General Hospital 10/29/20",
                                      "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/29-general-hospital-102920"
                                  },{
                                      "@type": "Thing",
                                      "name" : "General Hospital 10/28/20",
                                      "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/28-general-hospital-102820"
                                  },{
                                      "@t</x-turndown>
      
      OriginalGriffO Offline
      OriginalGriffO Offline
      OriginalGriff
      wrote on last edited by
      #2

      Have a look at the HtmlAgilityPack: it makes scraping sites a whole load easier: Html Agility pack | Html Agility Pack[^] For example, you can extract all the links from a page with one line of code:

      foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
      {
      ...
      }

      It's pretty easy to use, and very powerful when you get used to it.

      "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt AntiTwitter: @DalekDave is now a follower!

      "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
      "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt

      M 1 Reply Last reply
      0
      • OriginalGriffO OriginalGriff

        Have a look at the HtmlAgilityPack: it makes scraping sites a whole load easier: Html Agility pack | Html Agility Pack[^] For example, you can extract all the links from a page with one line of code:

        foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
        {
        ...
        }

        It's pretty easy to use, and very powerful when you get used to it.

        "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt AntiTwitter: @DalekDave is now a follower!

        M Offline
        M Offline
        Member 569739
        wrote on last edited by
        #3

        Thanks for the reply. I believe I actually tried just that but the problem is they aren't href's, they are JSON in a script so I believe that returned nothing unless I'm missing it. So like this from it:

        "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/26-general-hospital-102620"

        There is no href in it so it doesn't retrieve it.

        Richard Andrew x64R 1 Reply Last reply
        0
        • M Member 569739

          Thanks for the reply. I believe I actually tried just that but the problem is they aren't href's, they are JSON in a script so I believe that returned nothing unless I'm missing it. So like this from it:

          "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/26-general-hospital-102620"

          There is no href in it so it doesn't retrieve it.

          Richard Andrew x64R Offline
          Richard Andrew x64R Offline
          Richard Andrew x64
          wrote on last edited by
          #4

          It looks to me as if that library parses the HTML into a Document Object Model that you can navigate and extract any component parts from. Have a look at the documentation. I'm sure there's a way to extract JSON, you just have to find it. Griff was only giving an example of how useful the library is.

          The difficult we do right away... ...the impossible takes slightly longer.

          1 Reply Last reply
          0
          • M Member 569739

            I have a website that I am trying to extract some data from and then run some things on it. They do not seem to have any API's to do this. So take this site: abc if I view the source and look down to line 71 at this time I basically want this piece:

                        \[{
                            "@context": "http://schema.org/",
                            "@type": "ItemList" ,
                            "itemlistElement":
                                \[ \[{
                                        "@type": "ListItem",
                                        "position" : 1,
                                        "name" : "November 2020",
                                        "item":  \[{
                                            "@type": "Thing",
                                            "name" : "General Hospital 11/03/20",
                                            "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-11/03-general-hospital-110320"
                                        },{
                                            "@type": "Thing",
                                            "name" : "General Hospital 11/02/20",
                                            "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-11/02-general-hospital-110220"
                                        }\]
                                    }\],\[{
                                        "@type": "ListItem",
                                        "position" : 2,
                                        "name" : "October 2020",
                                        "item":  \[{
                                            "@type": "Thing",
                                            "name" : "General Hospital 10/30/20",
                                            "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/30-general-hospital-103020"
                                        },{
                                            "@type": "Thing",
                                            "name" : "General Hospital 10/29/20",
                                            "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/29-general-hospital-102920"
                                        },{
                                            "@type": "Thing",
                                            "name" : "General Hospital 10/28/20",
                                            "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/28-general-hospital-102820"
                                        },{
                                            "@t</x-turndown>
            
            D Offline
            D Offline
            DerekT P
            wrote on last edited by
            #5

            Bear in mind that a large proportion of websites have T+Cs that explicitly stop you doing that. I used to do a lot of small web-scraping jobs for a variety of clients; first task was always scour the site T+Cs to check it was permitted. When I was doing it a lot there weren't any 3rd party tools available, so it's just a case of pulling out the parts of the string you want. In your case I'd search for

            Member 569739 wrote:

            type="application/ld+json"

            (either using a simple ".indexOf()" call or using RegEx), check how many such script blocks there were, and continue likewise to pull out the bits I need. Pretty basic string manipulation. In this case, though, since the whole of what you want is JSON, just load it up using Newtonsoft.JSON and then access the objects directly:

            JObject myJsonObject = JObject.Parse(myJsonText);

            Richard DeemingR 1 Reply Last reply
            0
            • D DerekT P

              Bear in mind that a large proportion of websites have T+Cs that explicitly stop you doing that. I used to do a lot of small web-scraping jobs for a variety of clients; first task was always scour the site T+Cs to check it was permitted. When I was doing it a lot there weren't any 3rd party tools available, so it's just a case of pulling out the parts of the string you want. In your case I'd search for

              Member 569739 wrote:

              type="application/ld+json"

              (either using a simple ".indexOf()" call or using RegEx), check how many such script blocks there were, and continue likewise to pull out the bits I need. Pretty basic string manipulation. In this case, though, since the whole of what you want is JSON, just load it up using Newtonsoft.JSON and then access the objects directly:

              JObject myJsonObject = JObject.Parse(myJsonText);

              Richard DeemingR Offline
              Richard DeemingR Offline
              Richard Deeming
              wrote on last edited by
              #6

              The application/ld+json block is meant to be scraped by other sites - at the very least by search engines: JSON-LD - JSON for Linking Data[^] I think it's unlikely that the site would include that block if they didn't want anyone to be able to use it. :)


              "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

              "These people looked deep within my soul and assigned me a number based on the order in which I joined" - Homer

              D 1 Reply Last reply
              0
              • Richard DeemingR Richard Deeming

                The application/ld+json block is meant to be scraped by other sites - at the very least by search engines: JSON-LD - JSON for Linking Data[^] I think it's unlikely that the site would include that block if they didn't want anyone to be able to use it. :)


                "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

                D Offline
                D Offline
                DerekT P
                wrote on last edited by
                #7

                "meant to be scraped" - yes. By whom? Always check the T+Cs. :-)

                M 1 Reply Last reply
                0
                • D DerekT P

                  "meant to be scraped" - yes. By whom? Always check the T+Cs. :-)

                  M Offline
                  M Offline
                  Member 569739
                  wrote on last edited by
                  #8

                  Thanks folks. yeah it gets more complicated by the fact that there are multiple of those same blocks other than right now the one I want is the third one. That said, after hours of digging, I did find API's that return JSON for this so coding around that as obviously it's going to be a lot cleaner and more reliable. Haven't run the code yet but fingers crossed this works better. Not sure why ABC hides references to their API's like they do.

                  1 Reply Last reply
                  0
                  Reply
                  • Reply as topic
                  Log in to reply
                  • Oldest to Newest
                  • Newest to Oldest
                  • Most Votes


                  • Login

                  • Don't have an account? Register

                  • Login or register to search.
                  • First post
                    Last post
                  0
                  • Categories
                  • Recent
                  • Tags
                  • Popular
                  • World
                  • Users
                  • Groups