Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. Email Address Extraction

Email Address Extraction

Scheduled Pinned Locked Moved C / C++ / MFC
databasecomdebugginghelpquestion
15 Posts 6 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • C carrie

    Sorry, I posted this a couple of days ago then disappeared. I didn't post much info either (the problem then was me being totally dippy) so I thought I'd risk the wrath of you lads and post it again with more info, save the flogging for later please ;P Anyway, short story is that I have to extract email addresses from a lot of files which have been exported from a database. There is no general format to these files, so its a hack & slash search to find them. The only things I have guarenteed is that there will be a space before & after the address and that the @ character will only be in the email addresses, not in any other fields. A sample of the file would be something like this but a lot bigger (roughly 2000 characters each) and they can contain more than one email address. asd98a7098a70d98as abc-def@hotmail.com as8709-898 Dundee Geffen oiu7098 What I'm trying to do just now is write a ParseEmail(CString wholeFile) function that gets the file passed to it, and it works through picking out the email addresses. My problem is that CString isn't behaving how I would expect

    while(whoIs.Find("@",0) != -1)
    {
    	// Found a @ character, only usually in email addresses.
    	// Get the start and end of the character.
    	int index = whoIs.Find("@",0);
    	TRACE("@ character at position %d\\n",index);
    
    	// Find the trailing space.
    	int end = whoIs.Find(' ',index);
    	TRACE("end = %d\\n",end);
    
    	int pos = index-1;
    	CString ch;
    	ch = whoIs.GetAt(pos);
    	while(ch.GetAt(pos) != ' ')
    	{
    		pos--;
    		TRACE("ch = %s\\n",ch);
    	}
    
    	CString email = whoIs.Mid(pos,end);
    	TRACE("email address = %s\\n",email);
    	whoIs.Delete(0,end);
    }
    

    The problem after all this long windedness that the finding of the space before and after the @ doesn't work as I expect. It usually ignores spaces and other characters and returns me 50 letter email addresses :) Does anyone have an idea of how I could improve this function so that it actually works? Cheers and sorry for the 10,000 word essay :)

    A Offline
    A Offline
    Anonymous
    wrote on last edited by
    #4

    Why use C++ for this? :confused:

    L 1 Reply Last reply
    0
    • C carrie

      Sorry, I posted this a couple of days ago then disappeared. I didn't post much info either (the problem then was me being totally dippy) so I thought I'd risk the wrath of you lads and post it again with more info, save the flogging for later please ;P Anyway, short story is that I have to extract email addresses from a lot of files which have been exported from a database. There is no general format to these files, so its a hack & slash search to find them. The only things I have guarenteed is that there will be a space before & after the address and that the @ character will only be in the email addresses, not in any other fields. A sample of the file would be something like this but a lot bigger (roughly 2000 characters each) and they can contain more than one email address. asd98a7098a70d98as abc-def@hotmail.com as8709-898 Dundee Geffen oiu7098 What I'm trying to do just now is write a ParseEmail(CString wholeFile) function that gets the file passed to it, and it works through picking out the email addresses. My problem is that CString isn't behaving how I would expect

      while(whoIs.Find("@",0) != -1)
      {
      	// Found a @ character, only usually in email addresses.
      	// Get the start and end of the character.
      	int index = whoIs.Find("@",0);
      	TRACE("@ character at position %d\\n",index);
      
      	// Find the trailing space.
      	int end = whoIs.Find(' ',index);
      	TRACE("end = %d\\n",end);
      
      	int pos = index-1;
      	CString ch;
      	ch = whoIs.GetAt(pos);
      	while(ch.GetAt(pos) != ' ')
      	{
      		pos--;
      		TRACE("ch = %s\\n",ch);
      	}
      
      	CString email = whoIs.Mid(pos,end);
      	TRACE("email address = %s\\n",email);
      	whoIs.Delete(0,end);
      }
      

      The problem after all this long windedness that the finding of the space before and after the @ doesn't work as I expect. It usually ignores spaces and other characters and returns me 50 letter email addresses :) Does anyone have an idea of how I could improve this function so that it actually works? Cheers and sorry for the 10,000 word essay :)

      A Offline
      A Offline
      Andreas Saurwein
      wrote on last edited by
      #5

      If you want to get it right you better start off without CString. I've done RFC compliant mailparsing and you can believe me when I tell you that a working parser takes more that 200 lines of C++ code. just as an example: <"Duh:my=mail-home"@[100.99.98.1]> is a perfectly valid mail address and yummy: "Jonny\"s Dumb" (who's he anyway) jd@dumb.com; too


      Holy Sh*t! I'm speechless. (hey, that's a first) Marc Clifton, The Lounge

      C 1 Reply Last reply
      0
      • L Larry Antram

        Why not?

        B Offline
        B Offline
        bryce
        wrote on last edited by
        #6

        looks like a spammers tool to me ;) feed it some files (webpages, usenet dump) and extract away ; bryce

        L 1 Reply Last reply
        0
        • B bryce

          looks like a spammers tool to me ;) feed it some files (webpages, usenet dump) and extract away ; bryce

          L Offline
          L Offline
          Larry Antram
          wrote on last edited by
          #7

          ack! i hope not. :omg:

          C 1 Reply Last reply
          0
          • A Andreas Saurwein

            If you want to get it right you better start off without CString. I've done RFC compliant mailparsing and you can believe me when I tell you that a working parser takes more that 200 lines of C++ code. just as an example: <"Duh:my=mail-home"@[100.99.98.1]> is a perfectly valid mail address and yummy: "Jonny\"s Dumb" (who's he anyway) jd@dumb.com; too


            Holy Sh*t! I'm speechless. (hey, that's a first) Marc Clifton, The Lounge

            C Offline
            C Offline
            carrie
            wrote on last edited by
            #8

            Thanks for the answers Larry and the advice Andreas, the reason for using C++ is that this is a small part of a much larger project that I've been working on. To be honest I only need basic email parsing just now because this is mainly for a demonstration next week for the company buying it so after 20 revisions of the interface and other features, they can get their hands on a demo so we can do a final usability test and start the intended users off with a good run through of the program. Once this is done I can get down to the nitty gritty of trying to get proper RFC compliance rather than the quick bodge I'm trying to do just now. As I said, the email parsing is a really small part of the project that just now is getting pushed aside in favour of working on getting other things up to scratch. I've got a full week pencilled in for working on this and another thing next month but until then its bodgey bodgey for me. blehhh, to many big posts in one day make my head hurt ;P Thanks for the hints & tips guys

            1 Reply Last reply
            0
            • L Larry Antram

              Why not?

              A Offline
              A Offline
              Anonymous
              wrote on last edited by
              #9

              because ... print "$1\n" if /(\S+\@\S+)/;

              L C 2 Replies Last reply
              0
              • A Anonymous

                because ... print "$1\n" if /(\S+\@\S+)/;

                L Offline
                L Offline
                Larry Antram
                wrote on last edited by
                #10

                What is that? Perl? Perhaps you are correct. I'm not a big Perl fan. :~

                1 Reply Last reply
                0
                • L Larry Antram

                  ack! i hope not. :omg:

                  C Offline
                  C Offline
                  carrie
                  wrote on last edited by
                  #11

                  eh, don't mean to be rude but f**k no I wouldn't work on something like that. Anyone who works on them deserves to have all of the spammers in the world thrown into a room with them and nuked alongside the . Its all part of a contract I've been working on with an accountancy firm. Its a whole bigass project but the thing with the email address relates to them needing a central database with all the email addresses of their clients/partners/co-workers in the Far East and USA. The email part of it is pretty insignificant in the scheme of the whole project hence the bodge job before the demo next week :-D

                  L 1 Reply Last reply
                  0
                  • C carrie

                    eh, don't mean to be rude but f**k no I wouldn't work on something like that. Anyone who works on them deserves to have all of the spammers in the world thrown into a room with them and nuked alongside the . Its all part of a contract I've been working on with an accountancy firm. Its a whole bigass project but the thing with the email address relates to them needing a central database with all the email addresses of their clients/partners/co-workers in the Far East and USA. The email part of it is pretty insignificant in the scheme of the whole project hence the bodge job before the demo next week :-D

                    L Offline
                    L Offline
                    Larry Antram
                    wrote on last edited by
                    #12

                    Don't shoot me, shoot bryce. :-D

                    C 1 Reply Last reply
                    0
                    • L Larry Antram

                      Don't shoot me, shoot bryce. :-D

                      C Offline
                      C Offline
                      carrie
                      wrote on last edited by
                      #13

                      /me gets out the gun :D Nah, I just think that helping people spam more crap about getting a bigger ehhhhhh member :-O or sending your cash to a king fleeing his country so he can pay you back double is just asking to be shot. Spammers should have their testicle hairs pulled out by a lion, the people who write the mass-senders just need a good slap and to have their internet connections taken away.

                      1 Reply Last reply
                      0
                      • L Larry Antram

                        Strange, this had a 1.0 rating. Anyway... Perhaps try CTokenizer: http://www.codeproject.com/string/tokenizer.asp[^] The code would be something like this:

                        CTokenizer tok( YOUR_INPUT_STRING, " " );
                        CString str;

                        while( tok.Next( str ) )
                        {
                        if( str.Find("@") != -1 )
                        {
                        // assuming fields are delimited by spaces, the e-mail
                        // address is simple, and no other fields contain @, then
                        // the e-mail address is now in str
                        }
                        }

                        C Offline
                        C Offline
                        Christian Graus
                        wrote on last edited by
                        #14

                        Larry Antram wrote: Strange, this had a 1.0 rating. One of the great things about having a high post count is that my vote one a post counts for more than most people ( possibly everyone except Nish). I gave it a 5 to bring it back up. I don't get why some people vote posts the way they do. Christian No offense, but I don't really want to encourage the creation of another VB developer. - Larry Antram 22 Oct 2002
                        C# will attract all comers, where VB is for IT Journalists and managers - Michael P Butler 05-12-2002
                        Again, you can screw up a C/C++ program just as easily as a VB program. OK, maybe not as easily, but it's certainly doable. - Jamie Nordmeyer - 15-Nov-2002

                        1 Reply Last reply
                        0
                        • A Anonymous

                          because ... print "$1\n" if /(\S+\@\S+)/;

                          C Offline
                          C Offline
                          Christian Graus
                          wrote on last edited by
                          #15

                          So what ? C# would also do it quickly with a regular expression. That does not help much if you're doing work on data from a data source in C++, now does it ? Christian No offense, but I don't really want to encourage the creation of another VB developer. - Larry Antram 22 Oct 2002
                          C# will attract all comers, where VB is for IT Journalists and managers - Michael P Butler 05-12-2002
                          Again, you can screw up a C/C++ program just as easily as a VB program. OK, maybe not as easily, but it's certainly doable. - Jamie Nordmeyer - 15-Nov-2002

                          1 Reply Last reply
                          0
                          Reply
                          • Reply as topic
                          Log in to reply
                          • Oldest to Newest
                          • Newest to Oldest
                          • Most Votes


                          • Login

                          • Don't have an account? Register

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • World
                          • Users
                          • Groups