learning regex isn't easy :-)
-
hi all, so, I'm new to regex, trying to understand and i admit i'm lost. here's what i need right now; i have a list of string where i wish to extract the email address of users, each line looks like this: DisplayName;Surname;Givenname;Mail;Company which gives me something like: $line = 'jsmith;john;smith;john.smith@someemail.com;acme' since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing. $line -match '\w+' = True $line -match '\w+;' = true $line -math '\w+;\w+;' = true $line -match '\w+;\w+;\w+' = true $line -match '\w+;\w+;\w+;' = false $line -match '\w+;\w+;\w+;\.*' = false at first i thought that this regex would give me the email but it fails. $regex = '\w+;\w+;\w+;(\w+@\w+);\w+' thanks for helping me.
-
hi all, so, I'm new to regex, trying to understand and i admit i'm lost. here's what i need right now; i have a list of string where i wish to extract the email address of users, each line looks like this: DisplayName;Surname;Givenname;Mail;Company which gives me something like: $line = 'jsmith;john;smith;john.smith@someemail.com;acme' since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing. $line -match '\w+' = True $line -match '\w+;' = true $line -math '\w+;\w+;' = true $line -match '\w+;\w+;\w+' = true $line -match '\w+;\w+;\w+;' = false $line -match '\w+;\w+;\w+;\.*' = false at first i thought that this regex would give me the email but it fails. $regex = '\w+;\w+;\w+;(\w+@\w+);\w+' thanks for helping me.
Based on your description, you want to extract the fourth field from each line:
^([^;]*;){3}([^;]+);
Demo[^] However, depending on the source of the data, you may need to consider how it would "escape" a semicolon embedded in one of the field values. For example, given a display name of
j;smith
, would that end up asj\;smith
?j;;smith
? Something else? Or would it just corrupt the entire line? Once you start having to account for "escaped" separators, parsing the line becomes much harder.
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
-
hi all, so, I'm new to regex, trying to understand and i admit i'm lost. here's what i need right now; i have a list of string where i wish to extract the email address of users, each line looks like this: DisplayName;Surname;Givenname;Mail;Company which gives me something like: $line = 'jsmith;john;smith;john.smith@someemail.com;acme' since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing. $line -match '\w+' = True $line -match '\w+;' = true $line -math '\w+;\w+;' = true $line -match '\w+;\w+;\w+' = true $line -match '\w+;\w+;\w+;' = false $line -match '\w+;\w+;\w+;\.*' = false at first i thought that this regex would give me the email but it fails. $regex = '\w+;\w+;\w+;(\w+@\w+);\w+' thanks for helping me.
If you know that you do not have any embedded semi-colons in your input text, then maybe a simple split would work for you instead of a regex. e.g
fields[] = split(line, ';')
(or however your base language does that). This is far simpler, and should be much quicker that applying a regex and extracting a match. However, as Richard points out, if you do have embedded semi-colons you'll need to know how they're escaped in the string. In which case it is probably still faster to write a parser that will extract the fields to an array or as struct or class of some sort. In a related note, you might be tempted to apply a regex to validate the email address, but that is not as simple and straight forward as it might seem. See this discussion from stack-overflow : [https://stackoverflow.com/a/201378\](https://stackoverflow.com/a/201378) The next response on that SO page may also be useful, if you're using C#, which refers to theMailAddress
class.Keep Calm and Carry On
-
hi all, so, I'm new to regex, trying to understand and i admit i'm lost. here's what i need right now; i have a list of string where i wish to extract the email address of users, each line looks like this: DisplayName;Surname;Givenname;Mail;Company which gives me something like: $line = 'jsmith;john;smith;john.smith@someemail.com;acme' since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing. $line -match '\w+' = True $line -match '\w+;' = true $line -math '\w+;\w+;' = true $line -match '\w+;\w+;\w+' = true $line -match '\w+;\w+;\w+;' = false $line -match '\w+;\w+;\w+;\.*' = false at first i thought that this regex would give me the email but it fails. $regex = '\w+;\w+;\w+;(\w+@\w+);\w+' thanks for helping me.
Kardock wrote:
each line looks like this:
Which suggests that it is CSV data. Although 'CSV' stands for 'comma separated value' in general usage the separator can be other types including a semi-colon. So the best solution is to find a CSV library and use that rather than attempting to roll your own. You should look to see how the library handles bad data (ill formed CSV).
-
hi all, so, I'm new to regex, trying to understand and i admit i'm lost. here's what i need right now; i have a list of string where i wish to extract the email address of users, each line looks like this: DisplayName;Surname;Givenname;Mail;Company which gives me something like: $line = 'jsmith;john;smith;john.smith@someemail.com;acme' since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing. $line -match '\w+' = True $line -match '\w+;' = true $line -math '\w+;\w+;' = true $line -match '\w+;\w+;\w+' = true $line -match '\w+;\w+;\w+;' = false $line -match '\w+;\w+;\w+;\.*' = false at first i thought that this regex would give me the email but it fails. $regex = '\w+;\w+;\w+;(\w+@\w+);\w+' thanks for helping me.
Make sure to learn one lesson about regex: Don't overuse it. I've seen numerous regex problems where solving the task using an algorithmic language (such as C#) would be straightforward and simple - and flexible enough to handle with ease all the exceptions and special cases that really can give you a headache trying to do it as a regex. And there is Geek&Poke: Yesterdays regex[^] Disclaimer: The only pattern matching language I liked was SNOBOL, but I haven't seen it is use for a few decades now. SNOBOL is (/was) sort of a crossover between predicates and algorithmic programming - you could see it as a different kind of bool expression evaluation, in an otherwise algorithmic programming language. Especially the predicates were written in a way more readable format than in traditional regex. (I am not holding my breath waiting for SNOBOL to raise to a new stardom, though!)
-
Kardock wrote:
each line looks like this:
Which suggests that it is CSV data. Although 'CSV' stands for 'comma separated value' in general usage the separator can be other types including a semi-colon. So the best solution is to find a CSV library and use that rather than attempting to roll your own. You should look to see how the library handles bad data (ill formed CSV).
-
hi all, so, I'm new to regex, trying to understand and i admit i'm lost. here's what i need right now; i have a list of string where i wish to extract the email address of users, each line looks like this: DisplayName;Surname;Givenname;Mail;Company which gives me something like: $line = 'jsmith;john;smith;john.smith@someemail.com;acme' since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing. $line -match '\w+' = True $line -match '\w+;' = true $line -math '\w+;\w+;' = true $line -match '\w+;\w+;\w+' = true $line -match '\w+;\w+;\w+;' = false $line -match '\w+;\w+;\w+;\.*' = false at first i thought that this regex would give me the email but it fails. $regex = '\w+;\w+;\w+;(\w+@\w+);\w+' thanks for helping me.
So given that you just want to mess around with regex.
Kardock wrote:
but it fails.
Presumably you mean it runs but it does not successfully match. The problem is '\w' is not an expression that could ever match an email. So you need to look up what it does match. The other problem that you will find is that attempting to actually match a valid email is very difficult. The regex to do it is about 1000 characters long. You can google that both to see what a long regex looks like and to educate yourself what a 'valid' email actually is. (I do it every couple of years to remind myself especially when someone says they want to 'validate' an email.) However you don't need to match an email. What you need to match is the fourth value in the list. So the way to match that is the following
[^;]+
You should probably in fact match all of the columns that way. So you should study that expression to figure out what it does. And then answer for yourself why the other posters comment about embedded semi-colons being a problem.