Email Address Extraction
-
Sorry, I posted this a couple of days ago then disappeared. I didn't post much info either (the problem then was me being totally dippy) so I thought I'd risk the wrath of you lads and post it again with more info, save the flogging for later please ;P Anyway, short story is that I have to extract email addresses from a lot of files which have been exported from a database. There is no general format to these files, so its a hack & slash search to find them. The only things I have guarenteed is that there will be a space before & after the address and that the @ character will only be in the email addresses, not in any other fields. A sample of the file would be something like this but a lot bigger (roughly 2000 characters each) and they can contain more than one email address. asd98a7098a70d98as abc-def@hotmail.com as8709-898 Dundee Geffen oiu7098 What I'm trying to do just now is write a ParseEmail(CString wholeFile) function that gets the file passed to it, and it works through picking out the email addresses. My problem is that CString isn't behaving how I would expect
while(whoIs.Find("@",0) != -1) { // Found a @ character, only usually in email addresses. // Get the start and end of the character. int index = whoIs.Find("@",0); TRACE("@ character at position %d\\n",index); // Find the trailing space. int end = whoIs.Find(' ',index); TRACE("end = %d\\n",end); int pos = index-1; CString ch; ch = whoIs.GetAt(pos); while(ch.GetAt(pos) != ' ') { pos--; TRACE("ch = %s\\n",ch); } CString email = whoIs.Mid(pos,end); TRACE("email address = %s\\n",email); whoIs.Delete(0,end); }
The problem after all this long windedness that the finding of the space before and after the @ doesn't work as I expect. It usually ignores spaces and other characters and returns me 50 letter email addresses :) Does anyone have an idea of how I could improve this function so that it actually works? Cheers and sorry for the 10,000 word essay :)
-
Sorry, I posted this a couple of days ago then disappeared. I didn't post much info either (the problem then was me being totally dippy) so I thought I'd risk the wrath of you lads and post it again with more info, save the flogging for later please ;P Anyway, short story is that I have to extract email addresses from a lot of files which have been exported from a database. There is no general format to these files, so its a hack & slash search to find them. The only things I have guarenteed is that there will be a space before & after the address and that the @ character will only be in the email addresses, not in any other fields. A sample of the file would be something like this but a lot bigger (roughly 2000 characters each) and they can contain more than one email address. asd98a7098a70d98as abc-def@hotmail.com as8709-898 Dundee Geffen oiu7098 What I'm trying to do just now is write a ParseEmail(CString wholeFile) function that gets the file passed to it, and it works through picking out the email addresses. My problem is that CString isn't behaving how I would expect
while(whoIs.Find("@",0) != -1) { // Found a @ character, only usually in email addresses. // Get the start and end of the character. int index = whoIs.Find("@",0); TRACE("@ character at position %d\\n",index); // Find the trailing space. int end = whoIs.Find(' ',index); TRACE("end = %d\\n",end); int pos = index-1; CString ch; ch = whoIs.GetAt(pos); while(ch.GetAt(pos) != ' ') { pos--; TRACE("ch = %s\\n",ch); } CString email = whoIs.Mid(pos,end); TRACE("email address = %s\\n",email); whoIs.Delete(0,end); }
The problem after all this long windedness that the finding of the space before and after the @ doesn't work as I expect. It usually ignores spaces and other characters and returns me 50 letter email addresses :) Does anyone have an idea of how I could improve this function so that it actually works? Cheers and sorry for the 10,000 word essay :)
If you want to get it right you better start off without CString. I've done RFC compliant mailparsing and you can believe me when I tell you that a working parser takes more that 200 lines of C++ code. just as an example: <"Duh:my=mail-home"@[100.99.98.1]> is a perfectly valid mail address and yummy: "Jonny\"s Dumb" (who's he anyway) jd@dumb.com; too
Holy Sh*t! I'm speechless. (hey, that's a first) Marc Clifton, The Lounge
-
Why not?
-
looks like a spammers tool to me ;) feed it some files (webpages, usenet dump) and extract away ; bryce
ack! i hope not. :omg:
-
If you want to get it right you better start off without CString. I've done RFC compliant mailparsing and you can believe me when I tell you that a working parser takes more that 200 lines of C++ code. just as an example: <"Duh:my=mail-home"@[100.99.98.1]> is a perfectly valid mail address and yummy: "Jonny\"s Dumb" (who's he anyway) jd@dumb.com; too
Holy Sh*t! I'm speechless. (hey, that's a first) Marc Clifton, The Lounge
Thanks for the answers Larry and the advice Andreas, the reason for using C++ is that this is a small part of a much larger project that I've been working on. To be honest I only need basic email parsing just now because this is mainly for a demonstration next week for the company buying it so after 20 revisions of the interface and other features, they can get their hands on a demo so we can do a final usability test and start the intended users off with a good run through of the program. Once this is done I can get down to the nitty gritty of trying to get proper RFC compliance rather than the quick bodge I'm trying to do just now. As I said, the email parsing is a really small part of the project that just now is getting pushed aside in favour of working on getting other things up to scratch. I've got a full week pencilled in for working on this and another thing next month but until then its bodgey bodgey for me. blehhh, to many big posts in one day make my head hurt ;P Thanks for the hints & tips guys
-
Why not?
-
What is that? Perl? Perhaps you are correct. I'm not a big Perl fan. :~
-
ack! i hope not. :omg:
eh, don't mean to be rude but f**k no I wouldn't work on something like that. Anyone who works on them deserves to have all of the spammers in the world thrown into a room with them and nuked alongside the . Its all part of a contract I've been working on with an accountancy firm. Its a whole bigass project but the thing with the email address relates to them needing a central database with all the email addresses of their clients/partners/co-workers in the Far East and USA. The email part of it is pretty insignificant in the scheme of the whole project hence the bodge job before the demo next week :-D
-
eh, don't mean to be rude but f**k no I wouldn't work on something like that. Anyone who works on them deserves to have all of the spammers in the world thrown into a room with them and nuked alongside the . Its all part of a contract I've been working on with an accountancy firm. Its a whole bigass project but the thing with the email address relates to them needing a central database with all the email addresses of their clients/partners/co-workers in the Far East and USA. The email part of it is pretty insignificant in the scheme of the whole project hence the bodge job before the demo next week :-D
Don't shoot me, shoot bryce. :-D
-
Don't shoot me, shoot bryce. :-D
/me gets out the gun :D Nah, I just think that helping people spam more crap about getting a bigger ehhhhhh member :-O or sending your cash to a king fleeing his country so he can pay you back double is just asking to be shot. Spammers should have their testicle hairs pulled out by a lion, the people who write the mass-senders just need a good slap and to have their internet connections taken away.
-
Strange, this had a 1.0 rating. Anyway... Perhaps try CTokenizer: http://www.codeproject.com/string/tokenizer.asp[^] The code would be something like this:
CTokenizer tok( YOUR_INPUT_STRING, " " );
CString str;while( tok.Next( str ) )
{
if( str.Find("@") != -1 )
{
// assuming fields are delimited by spaces, the e-mail
// address is simple, and no other fields contain @, then
// the e-mail address is now in str
}
}Larry Antram wrote: Strange, this had a 1.0 rating. One of the great things about having a high post count is that my vote one a post counts for more than most people ( possibly everyone except Nish). I gave it a 5 to bring it back up. I don't get why some people vote posts the way they do. Christian No offense, but I don't really want to encourage the creation of another VB developer. - Larry Antram 22 Oct 2002
C# will attract all comers, where VB is for IT Journalists and managers - Michael P Butler 05-12-2002
Again, you can screw up a C/C++ program just as easily as a VB program. OK, maybe not as easily, but it's certainly doable. - Jamie Nordmeyer - 15-Nov-2002 -
So what ? C# would also do it quickly with a regular expression. That does not help much if you're doing work on data from a data source in C++, now does it ? Christian No offense, but I don't really want to encourage the creation of another VB developer. - Larry Antram 22 Oct 2002
C# will attract all comers, where VB is for IT Journalists and managers - Michael P Butler 05-12-2002
Again, you can screw up a C/C++ program just as easily as a VB program. OK, maybe not as easily, but it's certainly doable. - Jamie Nordmeyer - 15-Nov-2002