Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. Parsing Text.. Beginner Question.

Parsing Text.. Beginner Question.

Scheduled Pinned Locked Moved C / C++ / MFC
questionhtmljsontutoriallearning
5 Posts 4 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R Offline
    R Offline
    RobJones
    wrote on last edited by
    #1

    Hello, Sorry for this lengthy posting, I am trying to figure out the best way to parse text. I created a program that pulls the source HTML from a provided URL and dumps it into a text file. I am parsing through the text file then putting the parsed text into CStrings. I was wondering if there is an easier/better/cleaner way to parse text for values? Here is a sample of the code I’m using to parse. Any ideas to make this easier or more efficient would be greatly appreciated.

    //This parses the html for the month
    int nMyIndexMoPDT, nFirstIndexMoPDT, nSecondIndexMoPDT;
    nMyIndexMoPDT = strPDT.Find(_T("the time will be..."));
    nFirstIndexMoPDT = strPDT.Find(_T(","), nMyIndexMoPDT);
    nSecondIndexMoPDT = strPDT.Find(_T(","), nFirstIndexMoPDT+1);
    strPDTMo = strPDT.Mid(nFirstIndexMoPDT+1, nSecondIndexMoPDT-nFirstIndexMoPDT-1);
    strPDTMo.TrimLeft(_T(" "));
    strMonthsPDT = strPDTMo.Left(3);

    	//This parses the html for the Current Time PDT
    	//Vaule should look like 00:00:00 after this part
    	int nMyIndexPDT, nFirstIndexPDT, nSecondIndexPDT;
    	nMyIndexPDT = strPDT.Find(\_T("tone"));
    	nFirstIndexPDT = strPDT.Find(\_T(">"), nMyIndexPDT);
    	nSecondIndexPDT = strPDT.Find(\_T("<"), nFirstIndexPDT+1);
    	strPDT1 = strPDT.Mid(nFirstIndexPDT, nSecondIndexPDT-3-nFirstIndexPDT-1);
    	strPDT = strPDT1.Right(8);
    	
    	//This parses the html for the current day 2 parts
    	int nMyIndexPDTDays, nFirstIndexPDTDays, nSecondIndexPDTDays;
    	nMyIndexPDTDays = strPDT.Find(\_T("At the tone, the time will be..."));
    	nFirstIndexPDTDays = strPDT.Find(\_T("b"), nMyIndexPDTDays);
    	nSecondIndexPDTDays = strPDT.Find(\_T("<"), nFirstIndexPDTDays);
    	strPDTDays = strPDT.Mid(nFirstIndexPDTDays+3, nSecondIndexPDTDays-nFirstIndexPDTDays-3);
    	
    	//At this point the string should be "Wednesday, Jul 10, 2001 00:00:00 PDT"	
    	//The next bit of code pulls the day out the string should be "10" for example
    	int nMyIndexPDTDaysTemp = strPDTDays.Find(\_T(","));
    	int nFirstIndexPDTDaysTemp = strPDTDays.Find(\_T(","));
    	int nSecondIndexPDTDaysTemp = strPDTDays.Find(\_T(","), nMyIndexPDTDaysTemp+1);
    	CString strDaysP1 = strPDTDays.Mid(nFirstIndexPDTDaysTemp+1,
    

    nSecondIndexPDTDaysTemp-nFirstIndexPDTDaysTemp-1);
    CString strDaysP2 = strDaysP1.Right(2);
    iPDTD = atoi(strDaysP2);

    	//Are the hours 1 or 2 digits? Hours Extraction
    	//This part looks at the 00:00:00 and extracts the Left 2 digits
    	strPDTHours = strPDT.Left(2);
    	if(strPDTHours.Right(1) == \_T(":"))
    	{
    		strPDTHours = strPDT.Left(1);
    	}
    	in
    
    E 1 Reply Last reply
    0
    • R RobJones

      Hello, Sorry for this lengthy posting, I am trying to figure out the best way to parse text. I created a program that pulls the source HTML from a provided URL and dumps it into a text file. I am parsing through the text file then putting the parsed text into CStrings. I was wondering if there is an easier/better/cleaner way to parse text for values? Here is a sample of the code I’m using to parse. Any ideas to make this easier or more efficient would be greatly appreciated.

      //This parses the html for the month
      int nMyIndexMoPDT, nFirstIndexMoPDT, nSecondIndexMoPDT;
      nMyIndexMoPDT = strPDT.Find(_T("the time will be..."));
      nFirstIndexMoPDT = strPDT.Find(_T(","), nMyIndexMoPDT);
      nSecondIndexMoPDT = strPDT.Find(_T(","), nFirstIndexMoPDT+1);
      strPDTMo = strPDT.Mid(nFirstIndexMoPDT+1, nSecondIndexMoPDT-nFirstIndexMoPDT-1);
      strPDTMo.TrimLeft(_T(" "));
      strMonthsPDT = strPDTMo.Left(3);

      	//This parses the html for the Current Time PDT
      	//Vaule should look like 00:00:00 after this part
      	int nMyIndexPDT, nFirstIndexPDT, nSecondIndexPDT;
      	nMyIndexPDT = strPDT.Find(\_T("tone"));
      	nFirstIndexPDT = strPDT.Find(\_T(">"), nMyIndexPDT);
      	nSecondIndexPDT = strPDT.Find(\_T("<"), nFirstIndexPDT+1);
      	strPDT1 = strPDT.Mid(nFirstIndexPDT, nSecondIndexPDT-3-nFirstIndexPDT-1);
      	strPDT = strPDT1.Right(8);
      	
      	//This parses the html for the current day 2 parts
      	int nMyIndexPDTDays, nFirstIndexPDTDays, nSecondIndexPDTDays;
      	nMyIndexPDTDays = strPDT.Find(\_T("At the tone, the time will be..."));
      	nFirstIndexPDTDays = strPDT.Find(\_T("b"), nMyIndexPDTDays);
      	nSecondIndexPDTDays = strPDT.Find(\_T("<"), nFirstIndexPDTDays);
      	strPDTDays = strPDT.Mid(nFirstIndexPDTDays+3, nSecondIndexPDTDays-nFirstIndexPDTDays-3);
      	
      	//At this point the string should be "Wednesday, Jul 10, 2001 00:00:00 PDT"	
      	//The next bit of code pulls the day out the string should be "10" for example
      	int nMyIndexPDTDaysTemp = strPDTDays.Find(\_T(","));
      	int nFirstIndexPDTDaysTemp = strPDTDays.Find(\_T(","));
      	int nSecondIndexPDTDaysTemp = strPDTDays.Find(\_T(","), nMyIndexPDTDaysTemp+1);
      	CString strDaysP1 = strPDTDays.Mid(nFirstIndexPDTDaysTemp+1,
      

      nSecondIndexPDTDaysTemp-nFirstIndexPDTDaysTemp-1);
      CString strDaysP2 = strDaysP1.Right(2);
      iPDTD = atoi(strDaysP2);

      	//Are the hours 1 or 2 digits? Hours Extraction
      	//This part looks at the 00:00:00 and extracts the Left 2 digits
      	strPDTHours = strPDT.Left(2);
      	if(strPDTHours.Right(1) == \_T(":"))
      	{
      		strPDTHours = strPDT.Left(1);
      	}
      	in
      
      E Offline
      E Offline
      Erik Thompson
      wrote on last edited by
      #2

      Regular Expressions would be better and possibly more efficent. less code and easier to read. try www.boost.org for a template based regular expression set. Cheers, -Erik ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ My thoughts are my own and reflect on no other.

      N 1 Reply Last reply
      0
      • E Erik Thompson

        Regular Expressions would be better and possibly more efficent. less code and easier to read. try www.boost.org for a template based regular expression set. Cheers, -Erik ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ My thoughts are my own and reflect on no other.

        N Offline
        N Offline
        Nemanja Trifunovic
        wrote on last edited by
        #3

        I've worked a lot with boost::RegExp, and it is great as long as you don't work with unicode (a bug in VC 6.0, not in the boost library). However, I'm not sure, I would reccomend boost to a beginner. I vote pro drink :beer:

        R 1 Reply Last reply
        0
        • N Nemanja Trifunovic

          I've worked a lot with boost::RegExp, and it is great as long as you don't work with unicode (a bug in VC 6.0, not in the boost library). However, I'm not sure, I would reccomend boost to a beginner. I vote pro drink :beer:

          R Offline
          R Offline
          RobJones
          wrote on last edited by
          #4

          Thanks everyone.. I'll take a look at your suggestions, I am pretty new at programming so I'll have to see how hard this looks.. One more question.. Can I do a .TrimLeft(_T"<") on a string and then on the next line do another trim and keep triming until all the usless values are gone? Thanks, Rob

          A 1 Reply Last reply
          0
          • R RobJones

            Thanks everyone.. I'll take a look at your suggestions, I am pretty new at programming so I'll have to see how hard this looks.. One more question.. Can I do a .TrimLeft(_T"<") on a string and then on the next line do another trim and keep triming until all the usless values are gone? Thanks, Rob

            A Offline
            A Offline
            A R 0
            wrote on last edited by
            #5

            Your last question is related to the features of Regular Expressions, so to give you an idea on how they works, open your HTML in any editor that has the capability of search/replace regular expressions (like CodeWright). Then execute a replace like the next one: Search For: <[^>]*> Replace With: And check the "regular expression" check box, as well as the "replace all" check box. After that, you will be able to judge the advantages of using regular expression parsers.

            1 Reply Last reply
            0
            Reply
            • Reply as topic
            Log in to reply
            • Oldest to Newest
            • Newest to Oldest
            • Most Votes


            • Login

            • Don't have an account? Register

            • Login or register to search.
            • First post
              Last post
            0
            • Categories
            • Recent
            • Tags
            • Popular
            • World
            • Users
            • Groups