MySQL Select query that chooses something dot something between certain underscores

Lost User

I have a MySQL column with underscores and between those underscores a domain name (which is anything with a dot, except number.number), and I like to use REGEXP_SUBSTR to extract that domain, but only if there are at least 2 preceding underscores before it. Examples: 1. Something_Else_Type_XY_Z_26.04.23_website.com => website.com 2. Something_Else_Type_XY_Z_website.com_26.04.23 => website.com 3. Something_Else_Type_XY_Z_26.04.23_website.com_Comment => website.com 4. Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having 2 preceding underscores first) 5. Something_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having something dot something that isn't number.number) 6. Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU => website.com Which regex could do so instead of ... in SELECT REGEXP_SUBSTR(full_name, ...) as domain_name?

RedDk

One of these then: \b([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}\b ^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}$ \b((xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}\b \b((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b \b((?=[a-z0-9-]{1,63}\.)[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b Footnote: these make no sense of leading BLANK as your requirement states but I throw them out there anyway for your testing pleasure.

jschell

You should keep in mind that this is going to be really slow when compared to other types of searches. So if you are using large volumes of data and/or many searches you should keep that in mind. You should also keep in mind that there are valid domain names that will have more than one dot (period).

Lost User

Unfortunately none of them worked at all. REGEXP_SUBSTR(field_name, '^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}$') immediately timed out (#3699 - Timeout exceeded in regular expression match), and the rest just returned NULL for everything. Any idea why?

RedDk

Well, first of all those regex were std::regx not MySQL. What exactly have timing out and error #3699 to do with this "SELECT" query? I'm not sure why none of them work because I'm not sure how you're leveraging this database. REGEX_SUBSTR? Is that a function? Probably Oracle, he said. I know nothing about Oracle, sorry.

Lost User

There's nothing about Oracle here other than the company name. It's just a standard MySQL command, see https://dev.mysql.com/doc/refman/8.0/en/regexp.html#function_regexp-substr[^] and MySQL regexp_substr() function - javatpoint[^]. All I asked was which input to type there to get the domain name from a column that puts the domain between underscores (or just end of line) and after at least 2 underscores (examples in the OP).

Lost User

I just tried a quick test at RegExr: Learn, Build, & Test RegEx[^] and the following regex finds all the web addresses in the set below (shown in bold). The only thing I cannot correct is that it includes the leading underscore. However it may be a starting point to help you (I have not actually tried it with MySQL).

_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z].

Something_Else_Type_XY_Z_26.04.23**_website.com**
Something_Else_Type_XY_Z**_website.co.uk**_26.04.23
Something_Else_Type_XY_Z_26.04.23**_website.com**_Comment
Something**_website.com**_Else_Type_XY_Z_26.04.23_Comment
Something_Else_Type_XY_Z_26.04.23_Comment
Something_Else_Type_XY_dfd6869_3_21.12.22**_website.com**_ZU

Lost User

Unfortunately REGEXP_SUBSTR(field_name, '_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z]') just gives me the first 2 letters of the field_name ("So" in this case).

Lost User

Yes, I got similar results, which does not make a lot of sense. However, on reading the documentation (https://dev.mysql.com/doc/refman/8.0/en/regexp.html#function_regexp-substr[^]) closely I discovered why. Hidden down near the bottom of the page is the following:

Quote:

To use a literal instance of a special character in a regular expression, precede it by two backslash (\) characters. The MySQL parser interprets one of the backslashes, and the regular expression library interprets the other. For example, to match the string 1+2 that contains the special + character, only the last of the following regular expressions is the correct one:

So I tried with that, and a slight modification to my original pattern, and successfully extracted the fields as shown in the following results:

The NULL entries are records that do not contain a web address. [EDIT] I just tried without the leading underscore in the regular expression and added some numbers into the website names and got the following:

mysql> SELECT REGEXP_SUBSTR(Title, '[a-zA-Z][[:alnum:]]+(\\.[a-zA-Z]+)+') from Books;
+-------------------------------------------------------------+
| REGEXP_SUBSTR(Title, '[a-zA-Z][[:alnum:]]+(\\.[a-zA-Z]+)+') |
+-----------------------------------------------

RedDk

Ok, (didn't read the doc link but thanks for that). Using your new regex:

_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z]

on the sample of your OP:

1. Something_Else_Type_XY_Z_26.04.23_website.com => _website.com
2. Something_Else_Type_XY_Z_website.com_26.04.23 => _website.com
3. Something_Else_Type_XY_Z_26.04.23_website.com_Comment => _website.com
4. Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL
5. Something_Else_Type_XY_Z_26.04.23_Comment => NULL
6. Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU = _website.com

Which isn't quite what you wanted (because of that leading "_"). And as for RM's take on it, as-written (without any _[a-zA-Z]*\.) THAT string returns no matches here either (I use a tool called RegexBuddy and after yesterday's experience, with MySql selected as the input language not std::regex). So although I can confirm today's discovery is close it'll take me some more sleuthing to run down that leading character ascii 95 ... {EDIT] I'm seeing a warning when I try to tackle that underscore using "shorthand character classes" \w to the tune of "MySql doesn't support blah blah blah" so ... this could be a while. [END EDIT]

Lost User

See my update to yesterdays answer above.

Lost User

Did you test according to my examples? What about:

Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having 2 preceding underscores first)

Your regex returns website.com there instead of NULL.

Lost User

No, I am afraid I did not work that one out; I leave that as an exercise for you.

RedDk

Concur here, using RegexBuddy and this string: [a-zA-Z][[:alnum:]]*\.[a-zA-Z]*|\.[a-zA-Z] Nothing but net (colloquial expression eliding formal disdain without turning away in disgust from self-realized talentlessness). Thanks for this.

Lost User

MySQL's RE engine has one or two quirks of its own(after all, why make it easy?). The only part I could not figure out is how to ensure at least two underscores in the earlier part of the string. I bet @OriginalGriff or @RichardDeeming could do it with their eyes closed.

RedDk

Using RegexBuddy (yeah, I guess I am falling into a hole here) the POSIX notation is perfectly acceptable, hence your [:alnum:]. I can substitute [:alpha:] there as well and I suspect others in the list are good actors. My brain infarct occurs though when I use Herr Gevertz' table of translations and try to replace the POSIX with his ASCII ([a-zA-Z0-9]) or for that matter UNICODE ([\p{L}\p{Nl}\p{Nd}]), neither of which work in his interface. Grand tool, RegexBuddy ... it's one of those feature-rich interfaces which swim with example/sample in the help volume. My only complaint is that searching through documentation for words has no highlight of the found set so I've got to print out the page and use my pdf search to locate all instances!