Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. [C#.NET 2008] Screen-scraping a HTML Page

[C#.NET 2008] Screen-scraping a HTML Page

Scheduled Pinned Locked Moved C#
csharphelphtmlsecurityquestion
9 Posts 4 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • D Offline
    D Offline
    Dimitri Backaert
    wrote on last edited by
    #1

    Hi, I'm trying to process the HTML source code from a certain web page. This web page has security enabled, using a Username & a password. I'm using the following method to try to access the page:

    public static string GetHtmlPageSource(string url, string username, string password)
    {

    WebClient wc = new WebClient();
    
    wc.Credentials = new NetworkCredential(username, password);
    
    try
    {
    	using (Stream stream = wc.OpenRead(new Uri(url)))
    	{
    		using (StreamReader reader = new StreamReader(stream))
    		{
    			return reader.ReadToEnd();
    		}
    	}
    }
    catch (WebException e)
    {
    	//Error handeling 
    	return e.ToString();
    } 
    

    }

    This doesn't work however, I seem to be stuck at the logon page. I'm not able to pass the user security. Anyone has an idea?

    H W 2 Replies Last reply
    0
    • D Dimitri Backaert

      Hi, I'm trying to process the HTML source code from a certain web page. This web page has security enabled, using a Username & a password. I'm using the following method to try to access the page:

      public static string GetHtmlPageSource(string url, string username, string password)
      {

      WebClient wc = new WebClient();
      
      wc.Credentials = new NetworkCredential(username, password);
      
      try
      {
      	using (Stream stream = wc.OpenRead(new Uri(url)))
      	{
      		using (StreamReader reader = new StreamReader(stream))
      		{
      			return reader.ReadToEnd();
      		}
      	}
      }
      catch (WebException e)
      {
      	//Error handeling 
      	return e.ToString();
      } 
      

      }

      This doesn't work however, I seem to be stuck at the logon page. I'm not able to pass the user security. Anyone has an idea?

      H Offline
      H Offline
      Henry Minute
      wrote on last edited by
      #2

      Do you not think that the site owners put the security there for a reason?

      Henry Minute Do not read medical books! You could die of a misprint. - Mark Twain Girl: (staring) "Why do you need an icy cucumber?" “I want to report a fraud. The government is lying to us all.”

      D 1 Reply Last reply
      0
      • H Henry Minute

        Do you not think that the site owners put the security there for a reason?

        Henry Minute Do not read medical books! You could die of a misprint. - Mark Twain Girl: (staring) "Why do you need an icy cucumber?" “I want to report a fraud. The government is lying to us all.”

        D Offline
        D Offline
        Dimitri Backaert
        wrote on last edited by
        #3

        Yes I do, it's to prevent unauthorised access to the page... :laugh: However, this is a Web page that exists on the intranet of our company network. It displays certain measures which we want to display on a Dashboard application. The idea is to have an overview of these values at the blink of an eye (the dashboard will be projected on a whiteboard). Normally, we would use a read account on the database for this, but (since the database is being managed by the Netherlands and I am a Belgian employee) the Netherlands are refusing to give us a read account. So this is the only way how we can succeed in building the dashboard.

        L 1 Reply Last reply
        0
        • D Dimitri Backaert

          Yes I do, it's to prevent unauthorised access to the page... :laugh: However, this is a Web page that exists on the intranet of our company network. It displays certain measures which we want to display on a Dashboard application. The idea is to have an overview of these values at the blink of an eye (the dashboard will be projected on a whiteboard). Normally, we would use a read account on the database for this, but (since the database is being managed by the Netherlands and I am a Belgian employee) the Netherlands are refusing to give us a read account. So this is the only way how we can succeed in building the dashboard.

          L Offline
          L Offline
          Lost User
          wrote on last edited by
          #4

          Dimitri Backaert wrote:

          Normally, we would use a read account on the database for this, but (since the database is being managed by the Netherlands and I am a Belgian employee) the Netherlands are refusing to give us a read account.

          :confused: Does the chairman of the company think this is a good thing?

          H 1 Reply Last reply
          0
          • L Lost User

            Dimitri Backaert wrote:

            Normally, we would use a read account on the database for this, but (since the database is being managed by the Netherlands and I am a Belgian employee) the Netherlands are refusing to give us a read account.

            :confused: Does the chairman of the company think this is a good thing?

            H Offline
            H Offline
            Henry Minute
            wrote on last edited by
            #5

            Doncha just luuuv office politics? :laugh: :laugh:

            Henry Minute Do not read medical books! You could die of a misprint. - Mark Twain Girl: (staring) "Why do you need an icy cucumber?" “I want to report a fraud. The government is lying to us all.”

            L 1 Reply Last reply
            0
            • H Henry Minute

              Doncha just luuuv office politics? :laugh: :laugh:

              Henry Minute Do not read medical books! You could die of a misprint. - Mark Twain Girl: (staring) "Why do you need an icy cucumber?" “I want to report a fraud. The government is lying to us all.”

              L Offline
              L Offline
              Lost User
              wrote on last edited by
              #6

              Henry Minute wrote:

              Doncha just luuuv office politics?

              I just wonder why the Dutch don't trust the Belgians? :laugh:

              D 1 Reply Last reply
              0
              • D Dimitri Backaert

                Hi, I'm trying to process the HTML source code from a certain web page. This web page has security enabled, using a Username & a password. I'm using the following method to try to access the page:

                public static string GetHtmlPageSource(string url, string username, string password)
                {

                WebClient wc = new WebClient();
                
                wc.Credentials = new NetworkCredential(username, password);
                
                try
                {
                	using (Stream stream = wc.OpenRead(new Uri(url)))
                	{
                		using (StreamReader reader = new StreamReader(stream))
                		{
                			return reader.ReadToEnd();
                		}
                	}
                }
                catch (WebException e)
                {
                	//Error handeling 
                	return e.ToString();
                } 
                

                }

                This doesn't work however, I seem to be stuck at the logon page. I'm not able to pass the user security. Anyone has an idea?

                W Offline
                W Offline
                WBurgMo
                wrote on last edited by
                #7

                Dimitri Backaert wrote:

                This doesn't work however, I seem to be stuck at the logon page. I'm not able to pass the user security. Anyone has an idea?

                It would seem that WebClient is not recognizing this page as an authentication request. You will most likely have to manually format the correct response and send it to the server. I would use "WireShark" to trace a manual session with the server. This should let you see what the server expects for an authentication response. James Johnson

                D 1 Reply Last reply
                0
                • L Lost User

                  Henry Minute wrote:

                  Doncha just luuuv office politics?

                  I just wonder why the Dutch don't trust the Belgians? :laugh:

                  D Offline
                  D Offline
                  Dimitri Backaert
                  wrote on last edited by
                  #8

                  This might all seem very funny, I know, but it doesn't solve my problem.... :doh: Concerning the chairman, A request has been sent by my manager. But this might take a while to achieve (office / political war between Belgium / Netherlands). The main reason why the Netherlands are unwilling to release information, is that they were formerly the only group that maintained all the ICT infrastructure for the whole Benelux company group. However, since the beginning of 2009, Belgium and Luxemburg splitted from the Netherlands, and created their own ICT division. I think it's some sort of Job Protection... Politics, money, it's all involved. And it's - excuse my language - a real pain in the ass...

                  1 Reply Last reply
                  0
                  • W WBurgMo

                    Dimitri Backaert wrote:

                    This doesn't work however, I seem to be stuck at the logon page. I'm not able to pass the user security. Anyone has an idea?

                    It would seem that WebClient is not recognizing this page as an authentication request. You will most likely have to manually format the correct response and send it to the server. I would use "WireShark" to trace a manual session with the server. This should let you see what the server expects for an authentication response. James Johnson

                    D Offline
                    D Offline
                    Dimitri Backaert
                    wrote on last edited by
                    #9

                    James, Thank you for your answer. This could do the trick. I'll try it out. I already was thinking about changing the HttpWebRequest in another method, because in my opinion a WebRequest is the equivalent of a GET. What I need is a POST, so I'm thinking of using an HttpWebResponse...

                    1 Reply Last reply
                    0
                    Reply
                    • Reply as topic
                    Log in to reply
                    • Oldest to Newest
                    • Newest to Oldest
                    • Most Votes


                    • Login

                    • Don't have an account? Register

                    • Login or register to search.
                    • First post
                      Last post
                    0
                    • Categories
                    • Recent
                    • Tags
                    • Popular
                    • World
                    • Users
                    • Groups