Parsing Text.. Beginner Question.

RobJones

Hello, Sorry for this lengthy posting, I am trying to figure out the best way to parse text. I created a program that pulls the source HTML from a provided URL and dumps it into a text file. I am parsing through the text file then putting the parsed text into CStrings. I was wondering if there is an easier/better/cleaner way to parse text for values? Here is a sample of the code I’m using to parse. Any ideas to make this easier or more efficient would be greatly appreciated.

//This parses the html for the month
int nMyIndexMoPDT, nFirstIndexMoPDT, nSecondIndexMoPDT;
nMyIndexMoPDT = strPDT.Find(_T("the time will be..."));
nFirstIndexMoPDT = strPDT.Find(_T(","), nMyIndexMoPDT);
nSecondIndexMoPDT = strPDT.Find(_T(","), nFirstIndexMoPDT+1);
strPDTMo = strPDT.Mid(nFirstIndexMoPDT+1, nSecondIndexMoPDT-nFirstIndexMoPDT-1);
strPDTMo.TrimLeft(_T(" "));
strMonthsPDT = strPDTMo.Left(3);

	//This parses the html for the Current Time PDT
	//Vaule should look like 00:00:00 after this part
	int nMyIndexPDT, nFirstIndexPDT, nSecondIndexPDT;
	nMyIndexPDT = strPDT.Find(\_T("tone"));
	nFirstIndexPDT = strPDT.Find(\_T(">"), nMyIndexPDT);
	nSecondIndexPDT = strPDT.Find(\_T("<"), nFirstIndexPDT+1);
	strPDT1 = strPDT.Mid(nFirstIndexPDT, nSecondIndexPDT-3-nFirstIndexPDT-1);
	strPDT = strPDT1.Right(8);
	
	//This parses the html for the current day 2 parts
	int nMyIndexPDTDays, nFirstIndexPDTDays, nSecondIndexPDTDays;
	nMyIndexPDTDays = strPDT.Find(\_T("At the tone, the time will be..."));
	nFirstIndexPDTDays = strPDT.Find(\_T("b"), nMyIndexPDTDays);
	nSecondIndexPDTDays = strPDT.Find(\_T("<"), nFirstIndexPDTDays);
	strPDTDays = strPDT.Mid(nFirstIndexPDTDays+3, nSecondIndexPDTDays-nFirstIndexPDTDays-3);
	
	//At this point the string should be "Wednesday, Jul 10, 2001 00:00:00 PDT"	
	//The next bit of code pulls the day out the string should be "10" for example
	int nMyIndexPDTDaysTemp = strPDTDays.Find(\_T(","));
	int nFirstIndexPDTDaysTemp = strPDTDays.Find(\_T(","));
	int nSecondIndexPDTDaysTemp = strPDTDays.Find(\_T(","), nMyIndexPDTDaysTemp+1);
	CString strDaysP1 = strPDTDays.Mid(nFirstIndexPDTDaysTemp+1,

nSecondIndexPDTDaysTemp-nFirstIndexPDTDaysTemp-1);
CString strDaysP2 = strDaysP1.Right(2);
iPDTD = atoi(strDaysP2);

	//Are the hours 1 or 2 digits? Hours Extraction
	//This part looks at the 00:00:00 and extracts the Left 2 digits
	strPDTHours = strPDT.Left(2);
	if(strPDTHours.Right(1) == \_T(":"))
	{
		strPDTHours = strPDT.Left(1);
	}
	in

Erik Thompson

Regular Expressions would be better and possibly more efficent. less code and easier to read. try www.boost.org for a template based regular expression set. Cheers, -Erik ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ My thoughts are my own and reflect on no other.

Nemanja Trifunovic

I've worked a lot with boost::RegExp, and it is great as long as you don't work with unicode (a bug in VC 6.0, not in the boost library). However, I'm not sure, I would reccomend boost to a beginner. I vote pro drink :beer:

RobJones

Thanks everyone.. I'll take a look at your suggestions, I am pretty new at programming so I'll have to see how hard this looks.. One more question.. Can I do a .TrimLeft(_T"<") on a string and then on the next line do another trim and keep triming until all the usless values are gone? Thanks, Rob

A R 0

Your last question is related to the features of Regular Expressions, so to give you an idea on how they works, open your HTML in any editor that has the capability of search/replace regular expressions (like CodeWright). Then execute a replace like the next one: Search For: <[^>]*> Replace With: And check the "regular expression" check box, as well as the "replace all" check box. After that, you will be able to judge the advantages of using regular expression parsers.