How to extract part of HTML source?

Member 569739

I have a website that I am trying to extract some data from and then run some things on it. They do not seem to have any API's to do this. So take this site: abc if I view the source and look down to line 71 at this time I basically want this piece:

            \[{
                "@context": "http://schema.org/",
                "@type": "ItemList" ,
                "itemlistElement":
                    \[ \[{
                            "@type": "ListItem",
                            "position" : 1,
                            "name" : "November 2020",
                            "item":  \[{
                                "@type": "Thing",
                                "name" : "General Hospital 11/03/20",
                                "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-11/03-general-hospital-110320"
                            },{
                                "@type": "Thing",
                                "name" : "General Hospital 11/02/20",
                                "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-11/02-general-hospital-110220"
                            }\]
                        }\],\[{
                            "@type": "ListItem",
                            "position" : 2,
                            "name" : "October 2020",
                            "item":  \[{
                                "@type": "Thing",
                                "name" : "General Hospital 10/30/20",
                                "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/30-general-hospital-103020"
                            },{
                                "@type": "Thing",
                                "name" : "General Hospital 10/29/20",
                                "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/29-general-hospital-102920"
                            },{
                                "@type": "Thing",
                                "name" : "General Hospital 10/28/20",
                                "url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/28-general-hospital-102820"
                            },{
                                "@t</x-turndown>

OriginalGriff

Have a look at the HtmlAgilityPack: it makes scraping sites a whole load easier: Html Agility pack | Html Agility Pack[^] For example, you can extract all the links from a page with one line of code:

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
...
}

It's pretty easy to use, and very powerful when you get used to it.

"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt AntiTwitter: @DalekDave is now a follower!

Member 569739

Thanks for the reply. I believe I actually tried just that but the problem is they aren't href's, they are JSON in a script so I believe that returned nothing unless I'm missing it. So like this from it:

"url" : "www.abc.com/shows/general-hospital/episode-guide/2020-10/26-general-hospital-102620"

There is no href in it so it doesn't retrieve it.

Richard Andrew x64

It looks to me as if that library parses the HTML into a Document Object Model that you can navigate and extract any component parts from. Have a look at the documentation. I'm sure there's a way to extract JSON, you just have to find it. Griff was only giving an example of how useful the library is.

The difficult we do right away... ...the impossible takes slightly longer.

DerekT P

Bear in mind that a large proportion of websites have T+Cs that explicitly stop you doing that. I used to do a lot of small web-scraping jobs for a variety of clients; first task was always scour the site T+Cs to check it was permitted. When I was doing it a lot there weren't any 3rd party tools available, so it's just a case of pulling out the parts of the string you want. In your case I'd search for

Member 569739 wrote:

type="application/ld+json"

(either using a simple ".indexOf()" call or using RegEx), check how many such script blocks there were, and continue likewise to pull out the bits I need. Pretty basic string manipulation. In this case, though, since the whole of what you want is JSON, just load it up using Newtonsoft.JSON and then access the objects directly:

JObject myJsonObject = JObject.Parse(myJsonText);

Richard Deeming

The application/ld+json block is meant to be scraped by other sites - at the very least by search engines: JSON-LD - JSON for Linking Data[^] I think it's unlikely that the site would include that block if they didn't want anyone to be able to use it. :)

"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

DerekT P

"meant to be scraped" - yes. By whom? Always check the T+Cs. :-)

Member 569739

Thanks folks. yeah it gets more complicated by the fact that there are multiple of those same blocks other than right now the one I want is the third one. That said, after hours of digging, I did find API's that return JSON for this so coding around that as obviously it's going to be a lot cleaner and more reliable. Haven't run the code yet but fingers crossed this works better. Not sure why ABC hides references to their API's like they do.