Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. Parsing a web page to get just the <p> inner text.

Parsing a web page to get just the <p> inner text.

Scheduled Pinned Locked Moved C#
javascriptcssdebuggingjsoncode-review
4 Posts 3 Posters 1 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • D Offline
    D Offline
    David C Hobbyist
    wrote on last edited by
    #1

    I have a method that should extract text between "paragraph" tags. But I am getting css text and javascript code also. Here is my code(be kind I am self taught).

    private static string GetParagraphs(string webPage)
    {
    string subWebPage = webPage;
    int subWebPageStartIndex = 0;
    string paragraph = "";
    string paragraphs = "";
    int startIndex = 0;
    int endIndex = 0;
    while (subWebPageStartIndex < webPage.LastIndexOf("

    "))
    {
    subWebPage = webPage.Substring(subWebPageStartIndex);
    startIndex = subWebPage.IndexOf("

    ") + 3 + subWebPageStartIndex;
    endIndex = subWebPage.IndexOf("

    ") + subWebPageStartIndex;
    paragraph = webPage.Substring(startIndex, endIndex);
    paragraphs = paragraphs + " " + paragraph; //TODO: Refactor to use StringBuilder class.
    subWebPageStartIndex = endIndex + 4;
    Debug.WriteLine(paragraph);
    }
    return paragraphs;
    }

    Maybe You can see where I have messed up.:confused: Thank You for taking the time to read this. Frazzle the name say's it all

    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.

    John F. Woods

    N 1 Reply Last reply
    0
    • D David C Hobbyist

      I have a method that should extract text between "paragraph" tags. But I am getting css text and javascript code also. Here is my code(be kind I am self taught).

      private static string GetParagraphs(string webPage)
      {
      string subWebPage = webPage;
      int subWebPageStartIndex = 0;
      string paragraph = "";
      string paragraphs = "";
      int startIndex = 0;
      int endIndex = 0;
      while (subWebPageStartIndex < webPage.LastIndexOf("

      "))
      {
      subWebPage = webPage.Substring(subWebPageStartIndex);
      startIndex = subWebPage.IndexOf("

      ") + 3 + subWebPageStartIndex;
      endIndex = subWebPage.IndexOf("

      ") + subWebPageStartIndex;
      paragraph = webPage.Substring(startIndex, endIndex);
      paragraphs = paragraphs + " " + paragraph; //TODO: Refactor to use StringBuilder class.
      subWebPageStartIndex = endIndex + 4;
      Debug.WriteLine(paragraph);
      }
      return paragraphs;
      }

      Maybe You can see where I have messed up.:confused: Thank You for taking the time to read this. Frazzle the name say's it all

      Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.

      John F. Woods

      N Offline
      N Offline
      N a v a n e e t h
      wrote on last edited by
      #2

      What you have done is BAD. Ideal way to handle this is to use a HTML parser and traverse the DOM to get the text that you need. Look at HTML Agility[^] project. If you are sure that you will always have a wellformed input, you could easily do this with regular expressions. Here is a working example.

      public static List<string> GetAllParagraphValues(string input)
      {
      List<string> values = new List<string>();
      Regex r = new Regex("<p[^>]*>(?<value>.*?)</p>", RegexOptions.IgnoreCase);
      foreach (Match match in r.Matches(input))
      {
      values.Add(match.Groups["value"].Value);
      }
      return values;
      }

      Best wishes, Navaneeth

      D 1 Reply Last reply
      0
      • N N a v a n e e t h

        What you have done is BAD. Ideal way to handle this is to use a HTML parser and traverse the DOM to get the text that you need. Look at HTML Agility[^] project. If you are sure that you will always have a wellformed input, you could easily do this with regular expressions. Here is a working example.

        public static List<string> GetAllParagraphValues(string input)
        {
        List<string> values = new List<string>();
        Regex r = new Regex("<p[^>]*>(?<value>.*?)</p>", RegexOptions.IgnoreCase);
        foreach (Match match in r.Matches(input))
        {
        values.Add(match.Groups["value"].Value);
        }
        return values;
        }

        Best wishes, Navaneeth

        D Offline
        D Offline
        David C Hobbyist
        wrote on last edited by
        #3

        N a v a n e e t h wrote:

        What you have done is BAD.

        I knew this, it looks bad and did not work. :( Slowly I am learning now as for style that will come in time.

        N a v a n e e t h wrote:

        If you are sure that you will always have a wellformed input, you could easily do this with regular expressions

        Where can I learn about regular expressions? Thank You

        Frazzle the name say's it all

        Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.

        John F. Woods

        L 1 Reply Last reply
        0
        • D David C Hobbyist

          N a v a n e e t h wrote:

          What you have done is BAD.

          I knew this, it looks bad and did not work. :( Slowly I am learning now as for style that will come in time.

          N a v a n e e t h wrote:

          If you are sure that you will always have a wellformed input, you could easily do this with regular expressions

          Where can I learn about regular expressions? Thank You

          Frazzle the name say's it all

          Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.

          John F. Woods

          L Offline
          L Offline
          Lost User
          wrote on last edited by
          #4

          frazzle-me wrote:

          Where can I learn about regular expressions?

          Lots of places, Google would be a good start.

          1 Reply Last reply
          0
          Reply
          • Reply as topic
          Log in to reply
          • Oldest to Newest
          • Newest to Oldest
          • Most Votes


          • Login

          • Don't have an account? Register

          • Login or register to search.
          • First post
            Last post
          0
          • Categories
          • Recent
          • Tags
          • Popular
          • World
          • Users
          • Groups