Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. Reading HTML File

Reading HTML File

Scheduled Pinned Locked Moved C#
html
5 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • T Offline
    T Offline
    t4ure4n
    wrote on last edited by
    #1

    I have to read HTML files in a project. I am using streamReader to do that but when i read the document I get all the tags etc with it. Is there any way to only read the data (which gets displayed when u view the page in web browser) rather than the whole source.

    o O º(`'·.,(`'·., ☆,.·''),.·'')º O o° »·'"`»* *☆ t4ure4n ☆* *«·'"`« °o O º(,.·''(,.·'' ☆`'·.,)`'·.,)º O o°

    M 1 Reply Last reply
    0
    • T t4ure4n

      I have to read HTML files in a project. I am using streamReader to do that but when i read the document I get all the tags etc with it. Is there any way to only read the data (which gets displayed when u view the page in web browser) rather than the whole source.

      o O º(`'·.,(`'·., ☆,.·''),.·'')º O o° »·'"`»* *☆ t4ure4n ☆* *«·'"`« °o O º(,.·''(,.·'' ☆`'·.,)`'·.,)º O o°

      M Offline
      M Offline
      Manas Bhardwaj
      wrote on last edited by
      #2

      May be this would help you :rose:

      public string StripHTML(ref string source)
             {
                 string result;
                 result = source.Replace("\r", " ");
                 result = result.Replace("\n", " ");
                 result = System.Text.RegularExpressions.Regex.Replace(result,
                             @"<( )*script([^>])*>", "",
                             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
                 result = System.Text.RegularExpressions.Regex.Replace(result,
                          @"(<( )*(/)( )*script( )*>)", "",
                          System.Text.RegularExpressions.RegexOptions.IgnoreCase);
                 result = System.Text.RegularExpressions.Regex.Replace(result, @"()[^>]*()", "");
                 result = System.Text.RegularExpressions.Regex.Replace(result, "<[^>]*>", "");
                 for (int count = 0; count < technicalStopWordArrayList.Count; count++)
                 {
                     result = result.Replace(technicalStopWordArrayList[count].ToString(), " ");
                 }
                 result = result.Replace("&", " ");
      
                 return result.Trim();
             }
      
      T 1 Reply Last reply
      0
      • M Manas Bhardwaj

        May be this would help you :rose:

        public string StripHTML(ref string source)
               {
                   string result;
                   result = source.Replace("\r", " ");
                   result = result.Replace("\n", " ");
                   result = System.Text.RegularExpressions.Regex.Replace(result,
                               @"<( )*script([^>])*>", "",
                               System.Text.RegularExpressions.RegexOptions.IgnoreCase);
                   result = System.Text.RegularExpressions.Regex.Replace(result,
                            @"(<( )*(/)( )*script( )*>)", "",
                            System.Text.RegularExpressions.RegexOptions.IgnoreCase);
                   result = System.Text.RegularExpressions.Regex.Replace(result, @"()[^>]*()", "");
                   result = System.Text.RegularExpressions.Regex.Replace(result, "<[^>]*>", "");
                   for (int count = 0; count < technicalStopWordArrayList.Count; count++)
                   {
                       result = result.Replace(technicalStopWordArrayList[count].ToString(), " ");
                   }
                   result = result.Replace("&", " ");
        
                   return result.Trim();
               }
        
        T Offline
        T Offline
        t4ure4n
        wrote on last edited by
        #3

        Thanks It works fine but I have 1 more question... Is is possible to preserve href's I could have tried it my self but I don't know any thing about regular expressions so I have to rely on u. I just commented this because I don't know what it is...

        for (int count = 0; count < technicalStopWordArrayList.Count; count++)
        {
        result = result.Replace(technicalStopWordArrayList[count].ToString(), " ");
        }
        

        o O º(`'·.,(`'·., ☆,.·''),.·'')º O o° »·'"`»* *☆ t4ure4n ☆* *«·'"`« °o O º(,.·''(,.·'' ☆`'·.,)`'·.,)º O o°

        M 1 Reply Last reply
        0
        • T t4ure4n

          Thanks It works fine but I have 1 more question... Is is possible to preserve href's I could have tried it my self but I don't know any thing about regular expressions so I have to rely on u. I just commented this because I don't know what it is...

          for (int count = 0; count < technicalStopWordArrayList.Count; count++)
          {
          result = result.Replace(technicalStopWordArrayList[count].ToString(), " ");
          }
          

          o O º(`'·.,(`'·., ☆,.·''),.·'')º O o° »·'"`»* *☆ t4ure4n ☆* *«·'"`« °o O º(,.·''(,.·'' ☆`'·.,)`'·.,)º O o°

          M Offline
          M Offline
          Manas Bhardwaj
          wrote on last edited by
          #4

          Oops!!! Sorry, this was my code which i used it. You dont need it.Comment it out ;)

          T 1 Reply Last reply
          0
          • M Manas Bhardwaj

            Oops!!! Sorry, this was my code which i used it. You dont need it.Comment it out ;)

            T Offline
            T Offline
            t4ure4n
            wrote on last edited by
            #5

            Thanks... Jus 1 question Is is possible to preserve href's (Hyperlinks) if yes? How

            o O º(`'·.,(`'·., ☆,.·''),.·'')º O o° »·'"`»* *☆ t4ure4n ☆* *«·'"`« °o O º(,.·''(,.·'' ☆`'·.,)`'·.,)º O o°

            1 Reply Last reply
            0
            Reply
            • Reply as topic
            Log in to reply
            • Oldest to Newest
            • Newest to Oldest
            • Most Votes


            • Login

            • Don't have an account? Register

            • Login or register to search.
            • First post
              Last post
            0
            • Categories
            • Recent
            • Tags
            • Popular
            • World
            • Users
            • Groups