Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Regular Expressions
  4. learning regex isn't easy :-)

learning regex isn't easy :-)

Scheduled Pinned Locked Moved Regular Expressions
regexcomlearning
7 Posts 5 Posters 41 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • K Offline
    K Offline
    Kardock
    wrote on last edited by
    #1

    hi all, so, I'm new to regex, trying to understand and i admit i'm lost. here's what i need right now; i have a list of string where i wish to extract the email address of users, each line looks like this: DisplayName;Surname;Givenname;Mail;Company which gives me something like: $line = 'jsmith;john;smith;john.smith@someemail.com;acme' since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing. $line -match '\w+' = True $line -match '\w+;' = true $line -math '\w+;\w+;' = true $line -match '\w+;\w+;\w+' = true $line -match '\w+;\w+;\w+;' = false $line -match '\w+;\w+;\w+;\.*' = false at first i thought that this regex would give me the email but it fails. $regex = '\w+;\w+;\w+;(\w+@\w+);\w+' thanks for helping me.

    Richard DeemingR K J T 5 Replies Last reply
    0
    • K Kardock

      hi all, so, I'm new to regex, trying to understand and i admit i'm lost. here's what i need right now; i have a list of string where i wish to extract the email address of users, each line looks like this: DisplayName;Surname;Givenname;Mail;Company which gives me something like: $line = 'jsmith;john;smith;john.smith@someemail.com;acme' since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing. $line -match '\w+' = True $line -match '\w+;' = true $line -math '\w+;\w+;' = true $line -match '\w+;\w+;\w+' = true $line -match '\w+;\w+;\w+;' = false $line -match '\w+;\w+;\w+;\.*' = false at first i thought that this regex would give me the email but it fails. $regex = '\w+;\w+;\w+;(\w+@\w+);\w+' thanks for helping me.

      Richard DeemingR Offline
      Richard DeemingR Offline
      Richard Deeming
      wrote on last edited by
      #2

      Based on your description, you want to extract the fourth field from each line:

      ^([^;]*;){3}([^;]+);

      Demo[^] However, depending on the source of the data, you may need to consider how it would "escape" a semicolon embedded in one of the field values. For example, given a display name of j;smith, would that end up as j\;smith? j;;smith? Something else? Or would it just corrupt the entire line? Once you start having to account for "escaped" separators, parsing the line becomes much harder.


      "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

      "These people looked deep within my soul and assigned me a number based on the order in which I joined" - Homer

      1 Reply Last reply
      0
      • K Kardock

        hi all, so, I'm new to regex, trying to understand and i admit i'm lost. here's what i need right now; i have a list of string where i wish to extract the email address of users, each line looks like this: DisplayName;Surname;Givenname;Mail;Company which gives me something like: $line = 'jsmith;john;smith;john.smith@someemail.com;acme' since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing. $line -match '\w+' = True $line -match '\w+;' = true $line -math '\w+;\w+;' = true $line -match '\w+;\w+;\w+' = true $line -match '\w+;\w+;\w+;' = false $line -match '\w+;\w+;\w+;\.*' = false at first i thought that this regex would give me the email but it fails. $regex = '\w+;\w+;\w+;(\w+@\w+);\w+' thanks for helping me.

        K Offline
        K Offline
        k5054
        wrote on last edited by
        #3

        If you know that you do not have any embedded semi-colons in your input text, then maybe a simple split would work for you instead of a regex. e.g fields[] = split(line, ';') (or however your base language does that). This is far simpler, and should be much quicker that applying a regex and extracting a match. However, as Richard points out, if you do have embedded semi-colons you'll need to know how they're escaped in the string. In which case it is probably still faster to write a parser that will extract the fields to an array or as struct or class of some sort. In a related note, you might be tempted to apply a regex to validate the email address, but that is not as simple and straight forward as it might seem. See this discussion from stack-overflow : [https://stackoverflow.com/a/201378\](https://stackoverflow.com/a/201378) The next response on that SO page may also be useful, if you're using C#, which refers to the MailAddress class.

        Keep Calm and Carry On

        1 Reply Last reply
        0
        • K Kardock

          hi all, so, I'm new to regex, trying to understand and i admit i'm lost. here's what i need right now; i have a list of string where i wish to extract the email address of users, each line looks like this: DisplayName;Surname;Givenname;Mail;Company which gives me something like: $line = 'jsmith;john;smith;john.smith@someemail.com;acme' since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing. $line -match '\w+' = True $line -match '\w+;' = true $line -math '\w+;\w+;' = true $line -match '\w+;\w+;\w+' = true $line -match '\w+;\w+;\w+;' = false $line -match '\w+;\w+;\w+;\.*' = false at first i thought that this regex would give me the email but it fails. $regex = '\w+;\w+;\w+;(\w+@\w+);\w+' thanks for helping me.

          J Offline
          J Offline
          jschell
          wrote on last edited by
          #4

          Kardock wrote:

          each line looks like this:

          Which suggests that it is CSV data. Although 'CSV' stands for 'comma separated value' in general usage the separator can be other types including a semi-colon. So the best solution is to find a CSV library and use that rather than attempting to roll your own. You should look to see how the library handles bad data (ill formed CSV).

          K 1 Reply Last reply
          0
          • K Kardock

            hi all, so, I'm new to regex, trying to understand and i admit i'm lost. here's what i need right now; i have a list of string where i wish to extract the email address of users, each line looks like this: DisplayName;Surname;Givenname;Mail;Company which gives me something like: $line = 'jsmith;john;smith;john.smith@someemail.com;acme' since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing. $line -match '\w+' = True $line -match '\w+;' = true $line -math '\w+;\w+;' = true $line -match '\w+;\w+;\w+' = true $line -match '\w+;\w+;\w+;' = false $line -match '\w+;\w+;\w+;\.*' = false at first i thought that this regex would give me the email but it fails. $regex = '\w+;\w+;\w+;(\w+@\w+);\w+' thanks for helping me.

            T Offline
            T Offline
            trønderen
            wrote on last edited by
            #5

            Make sure to learn one lesson about regex: Don't overuse it. I've seen numerous regex problems where solving the task using an algorithmic language (such as C#) would be straightforward and simple - and flexible enough to handle with ease all the exceptions and special cases that really can give you a headache trying to do it as a regex. And there is Geek&Poke: Yesterdays regex[^] Disclaimer: The only pattern matching language I liked was SNOBOL, but I haven't seen it is use for a few decades now. SNOBOL is (/was) sort of a crossover between predicates and algorithmic programming - you could see it as a different kind of bool expression evaluation, in an otherwise algorithmic programming language. Especially the predicates were written in a way more readable format than in traditional regex. (I am not holding my breath waiting for SNOBOL to raise to a new stardom, though!)

            1 Reply Last reply
            0
            • J jschell

              Kardock wrote:

              each line looks like this:

              Which suggests that it is CSV data. Although 'CSV' stands for 'comma separated value' in general usage the separator can be other types including a semi-colon. So the best solution is to find a CSV library and use that rather than attempting to roll your own. You should look to see how the library handles bad data (ill formed CSV).

              K Offline
              K Offline
              Kardock
              wrote on last edited by
              #6

              you're right but that gave me a chance to try to understand regex.

              1 Reply Last reply
              0
              • K Kardock

                hi all, so, I'm new to regex, trying to understand and i admit i'm lost. here's what i need right now; i have a list of string where i wish to extract the email address of users, each line looks like this: DisplayName;Surname;Givenname;Mail;Company which gives me something like: $line = 'jsmith;john;smith;john.smith@someemail.com;acme' since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing. $line -match '\w+' = True $line -match '\w+;' = true $line -math '\w+;\w+;' = true $line -match '\w+;\w+;\w+' = true $line -match '\w+;\w+;\w+;' = false $line -match '\w+;\w+;\w+;\.*' = false at first i thought that this regex would give me the email but it fails. $regex = '\w+;\w+;\w+;(\w+@\w+);\w+' thanks for helping me.

                J Offline
                J Offline
                jschell
                wrote on last edited by
                #7

                So given that you just want to mess around with regex.

                Kardock wrote:

                but it fails.

                Presumably you mean it runs but it does not successfully match. The problem is '\w' is not an expression that could ever match an email. So you need to look up what it does match. The other problem that you will find is that attempting to actually match a valid email is very difficult. The regex to do it is about 1000 characters long. You can google that both to see what a long regex looks like and to educate yourself what a 'valid' email actually is. (I do it every couple of years to remind myself especially when someone says they want to 'validate' an email.) However you don't need to match an email. What you need to match is the fourth value in the list. So the way to match that is the following

                [^;]+

                You should probably in fact match all of the columns that way. So you should study that expression to figure out what it does. And then answer for yourself why the other posters comment about embedded semi-colons being a problem.

                1 Reply Last reply
                0
                Reply
                • Reply as topic
                Log in to reply
                • Oldest to Newest
                • Newest to Oldest
                • Most Votes


                • Login

                • Don't have an account? Register

                • Login or register to search.
                • First post
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • World
                • Users
                • Groups