I need help with a complex regex
-
Challenge: I have a file with genealogy information which I would like to extract (in Google Sheets) using regex. Data: One cell contains text information. Basically it is four main parts, two of which are optional and can have slightly different formats and contents First comes always a number followed by a period. (This is the generation number.) Second comes the name. It consists of one or more first and last names These two are always there They can be followed by birth and/or death information If there is birth information, it always comes directly after the name and starts with "b. ". It can have a date, and or a location The date can be preceded by "circa", "before" and "before circa". It is then followed by either a 4 digit year, or more commonly by the month name, date, and year. Example: "March 4, 1888" After the year might follow a location (free text) If there is death information, it starts with "d. " and can contain the same information as above, i.e. a date and/or a location. My best shot is close, but not handling the special cases of "before" etc too well:
=ARRAYFORMULA(IFERROR(SPLIT(REGEXREPLACE(A:A,"^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$","$1|$2|$3|$4|$5|$6|$7|$8|$9"),"|")))
So the regex part of it is:
^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$
It works well for entries like this one:
2. Gunnar Helg Andersson b. October 22, 1921, Ormöga No. 3, Bredsättra, Kalmar, Sweden; d. January 1, 2021, Köpingsvik
But not for entries like:
7. Kierstin Danielsdotter b. before circa 1706
9. Lussa Elofsdotter b. circa 1680; d. May 16, 1758, Bredsättra
7. Olof Jönsson b. 1742, Sverige (Sweden); d. September 4, 1811
9. Nils Knutsson b. circa 1676, Istad, Alböke; d. circa April 17, 1729 -
Challenge: I have a file with genealogy information which I would like to extract (in Google Sheets) using regex. Data: One cell contains text information. Basically it is four main parts, two of which are optional and can have slightly different formats and contents First comes always a number followed by a period. (This is the generation number.) Second comes the name. It consists of one or more first and last names These two are always there They can be followed by birth and/or death information If there is birth information, it always comes directly after the name and starts with "b. ". It can have a date, and or a location The date can be preceded by "circa", "before" and "before circa". It is then followed by either a 4 digit year, or more commonly by the month name, date, and year. Example: "March 4, 1888" After the year might follow a location (free text) If there is death information, it starts with "d. " and can contain the same information as above, i.e. a date and/or a location. My best shot is close, but not handling the special cases of "before" etc too well:
=ARRAYFORMULA(IFERROR(SPLIT(REGEXREPLACE(A:A,"^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$","$1|$2|$3|$4|$5|$6|$7|$8|$9"),"|")))
So the regex part of it is:
^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$
It works well for entries like this one:
2. Gunnar Helg Andersson b. October 22, 1921, Ormöga No. 3, Bredsättra, Kalmar, Sweden; d. January 1, 2021, Köpingsvik
But not for entries like:
7. Kierstin Danielsdotter b. before circa 1706
9. Lussa Elofsdotter b. circa 1680; d. May 16, 1758, Bredsättra
7. Olof Jönsson b. 1742, Sverige (Sweden); d. September 4, 1811
9. Nils Knutsson b. circa 1676, Istad, Alböke; d. circa April 17, 1729I have tried to decipher what your intent is. I can see you hope to get 9 fields by dividing the original information, but I fail to see where the different parts of the "born" and "death" fields occur. What I have done thus far is to create a regex which gets the "record" number, the "name", the "birth" info if it exists and the "death" info if it exists. These last 2 fields can be further defined (and divided) if only I knew what your intent was. Perhaps you can explain what should be in each of the 9 fields (if they exist). Perhaps show a "fully filled" out record as an example, then show what the result should look like. But here is what I have thus far (this has been formulated on Notepad++):
^(\d+\.\s*)(.+?)(?=(?:b|d)\.)(b\.\s*.+?(?=(?:d\.|$)))?(d\.\s*.+?(?=$))?
To explain it we have:^(\d+\.\s*)
- start of line followed by number(s), a period and possible spaces(.+?)
- gather characters (as few as possible) until...(?=(?:b|d)\.)
- next character should be either a "b" or a "d" followed by a period. The(?:
refers to a non-capturing group.(b\.\s*.+?(?=(?:d\.|$)))?
- gather characters until either a "d." follows or end of line.(d\.\s*.+?(?=$))?
- similar to previous line but for the "d." field. This assumes the "d." field will always be last. Maybe it can give you some more inspiration. At the very least you can see how splitting the problem into smaller chunks may be beneficial. Even if you then have to further divide the "b." and "d." fields in a later step it may still be easier to define them. Terry -
Challenge: I have a file with genealogy information which I would like to extract (in Google Sheets) using regex. Data: One cell contains text information. Basically it is four main parts, two of which are optional and can have slightly different formats and contents First comes always a number followed by a period. (This is the generation number.) Second comes the name. It consists of one or more first and last names These two are always there They can be followed by birth and/or death information If there is birth information, it always comes directly after the name and starts with "b. ". It can have a date, and or a location The date can be preceded by "circa", "before" and "before circa". It is then followed by either a 4 digit year, or more commonly by the month name, date, and year. Example: "March 4, 1888" After the year might follow a location (free text) If there is death information, it starts with "d. " and can contain the same information as above, i.e. a date and/or a location. My best shot is close, but not handling the special cases of "before" etc too well:
=ARRAYFORMULA(IFERROR(SPLIT(REGEXREPLACE(A:A,"^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$","$1|$2|$3|$4|$5|$6|$7|$8|$9"),"|")))
So the regex part of it is:
^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$
It works well for entries like this one:
2. Gunnar Helg Andersson b. October 22, 1921, Ormöga No. 3, Bredsättra, Kalmar, Sweden; d. January 1, 2021, Köpingsvik
But not for entries like:
7. Kierstin Danielsdotter b. before circa 1706
9. Lussa Elofsdotter b. circa 1680; d. May 16, 1758, Bredsättra
7. Olof Jönsson b. 1742, Sverige (Sweden); d. September 4, 1811
9. Nils Knutsson b. circa 1676, Istad, Alböke; d. circa April 17, 1729 -
Richard MacCutchan wrote:
It would be easier to write a string parsing routine of your own.
I strongly agree with this. It is going to be easier to understand, easier to debug and quite possibly faster to run. And just is case you think I have a bias I have been using regexes for 40 years extensively (via perl). Which is why I understand both their advantages and disadvantages.
-
Richard MacCutchan wrote:
It would be easier to write a string parsing routine of your own.
I strongly agree with this. It is going to be easier to understand, easier to debug and quite possibly faster to run. And just is case you think I have a bias I have been using regexes for 40 years extensively (via perl). Which is why I understand both their advantages and disadvantages.
-
jschell wrote:
is case you think I have a bias
Nothing would be further from my mind, even if you advocated a Regex. I respect everyone's opinions here; after all most people know lots of things that I do not.
-
Richard MacCutchan wrote:
It would be easier to write a string parsing routine of your own.
Hi Richard, Thanks for the tip. Although I don't fully understand what you mean with "string parsing routine", I solved the issue by writing a regex for each column needed instead of a "catch-all" regex. Perhaps that is what you meant.
-
Richard MacCutchan wrote:
It would be easier to write a string parsing routine of your own.
Hi Richard, Thanks for the tip. Although I don't fully understand what you mean with "string parsing routine", I solved the issue by writing a regex for each column needed instead of a "catch-all" regex. Perhaps that is what you meant.
No, my suggestion was to abandon the use of Regex patterns. You can easily split the string into an array of strings separated by spaces. All words before an entry of
"b."
are parts of the name. All words after the"b."
and before"d."
or the end of the text, relate to the birth date. All items after"d."
relate to the date of death. And apart from anything else it makes your code much clearer.