Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Regular Expressions
  4. I need help with a complex regex

I need help with a complex regex

Scheduled Pinned Locked Moved Regular Expressions
regexhelptutorialquestion
8 Posts 4 Posters 33 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • U Offline
    U Offline
    User 12712953
    wrote on last edited by
    #1

    Challenge: I have a file with genealogy information which I would like to extract (in Google Sheets) using regex. Data: One cell contains text information. Basically it is four main parts, two of which are optional and can have slightly different formats and contents First comes always a number followed by a period. (This is the generation number.) Second comes the name. It consists of one or more first and last names These two are always there They can be followed by birth and/or death information If there is birth information, it always comes directly after the name and starts with "b. ". It can have a date, and or a location The date can be preceded by "circa", "before" and "before circa". It is then followed by either a 4 digit year, or more commonly by the month name, date, and year. Example: "March 4, 1888" After the year might follow a location (free text) If there is death information, it starts with "d. " and can contain the same information as above, i.e. a date and/or a location. My best shot is close, but not handling the special cases of "before" etc too well:

    =ARRAYFORMULA(IFERROR(SPLIT(REGEXREPLACE(A:A,"^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$","$1|$2|$3|$4|$5|$6|$7|$8|$9"),"|")))

    So the regex part of it is:

    ^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$

    It works well for entries like this one:

    2. Gunnar Helg Andersson b. October 22, 1921, Ormöga No. 3, Bredsättra, Kalmar, Sweden; d. January 1, 2021, Köpingsvik

    But not for entries like:

    7. Kierstin Danielsdotter b. before circa 1706
    9. Lussa Elofsdotter b. circa 1680; d. May 16, 1758, Bredsättra
    7. Olof Jönsson b. 1742, Sverige (Sweden); d. September 4, 1811
    9. Nils Knutsson b. circa 1676, Istad, Alböke; d. circa April 17, 1729

    T L 2 Replies Last reply
    0
    • U User 12712953

      Challenge: I have a file with genealogy information which I would like to extract (in Google Sheets) using regex. Data: One cell contains text information. Basically it is four main parts, two of which are optional and can have slightly different formats and contents First comes always a number followed by a period. (This is the generation number.) Second comes the name. It consists of one or more first and last names These two are always there They can be followed by birth and/or death information If there is birth information, it always comes directly after the name and starts with "b. ". It can have a date, and or a location The date can be preceded by "circa", "before" and "before circa". It is then followed by either a 4 digit year, or more commonly by the month name, date, and year. Example: "March 4, 1888" After the year might follow a location (free text) If there is death information, it starts with "d. " and can contain the same information as above, i.e. a date and/or a location. My best shot is close, but not handling the special cases of "before" etc too well:

      =ARRAYFORMULA(IFERROR(SPLIT(REGEXREPLACE(A:A,"^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$","$1|$2|$3|$4|$5|$6|$7|$8|$9"),"|")))

      So the regex part of it is:

      ^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$

      It works well for entries like this one:

      2. Gunnar Helg Andersson b. October 22, 1921, Ormöga No. 3, Bredsättra, Kalmar, Sweden; d. January 1, 2021, Köpingsvik

      But not for entries like:

      7. Kierstin Danielsdotter b. before circa 1706
      9. Lussa Elofsdotter b. circa 1680; d. May 16, 1758, Bredsättra
      7. Olof Jönsson b. 1742, Sverige (Sweden); d. September 4, 1811
      9. Nils Knutsson b. circa 1676, Istad, Alböke; d. circa April 17, 1729

      T Offline
      T Offline
      Terry R 2023
      wrote on last edited by
      #2

      I have tried to decipher what your intent is. I can see you hope to get 9 fields by dividing the original information, but I fail to see where the different parts of the "born" and "death" fields occur. What I have done thus far is to create a regex which gets the "record" number, the "name", the "birth" info if it exists and the "death" info if it exists. These last 2 fields can be further defined (and divided) if only I knew what your intent was. Perhaps you can explain what should be in each of the 9 fields (if they exist). Perhaps show a "fully filled" out record as an example, then show what the result should look like. But here is what I have thus far (this has been formulated on Notepad++): ^(\d+\.\s*)(.+?)(?=(?:b|d)\.)(b\.\s*.+?(?=(?:d\.|$)))?(d\.\s*.+?(?=$))? To explain it we have: ^(\d+\.\s*) - start of line followed by number(s), a period and possible spaces (.+?) - gather characters (as few as possible) until... (?=(?:b|d)\.) - next character should be either a "b" or a "d" followed by a period. The (?: refers to a non-capturing group. (b\.\s*.+?(?=(?:d\.|$)))? - gather characters until either a "d." follows or end of line. (d\.\s*.+?(?=$))? - similar to previous line but for the "d." field. This assumes the "d." field will always be last. Maybe it can give you some more inspiration. At the very least you can see how splitting the problem into smaller chunks may be beneficial. Even if you then have to further divide the "b." and "d." fields in a later step it may still be easier to define them. Terry

      1 Reply Last reply
      0
      • U User 12712953

        Challenge: I have a file with genealogy information which I would like to extract (in Google Sheets) using regex. Data: One cell contains text information. Basically it is four main parts, two of which are optional and can have slightly different formats and contents First comes always a number followed by a period. (This is the generation number.) Second comes the name. It consists of one or more first and last names These two are always there They can be followed by birth and/or death information If there is birth information, it always comes directly after the name and starts with "b. ". It can have a date, and or a location The date can be preceded by "circa", "before" and "before circa". It is then followed by either a 4 digit year, or more commonly by the month name, date, and year. Example: "March 4, 1888" After the year might follow a location (free text) If there is death information, it starts with "d. " and can contain the same information as above, i.e. a date and/or a location. My best shot is close, but not handling the special cases of "before" etc too well:

        =ARRAYFORMULA(IFERROR(SPLIT(REGEXREPLACE(A:A,"^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$","$1|$2|$3|$4|$5|$6|$7|$8|$9"),"|")))

        So the regex part of it is:

        ^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$

        It works well for entries like this one:

        2. Gunnar Helg Andersson b. October 22, 1921, Ormöga No. 3, Bredsättra, Kalmar, Sweden; d. January 1, 2021, Köpingsvik

        But not for entries like:

        7. Kierstin Danielsdotter b. before circa 1706
        9. Lussa Elofsdotter b. circa 1680; d. May 16, 1758, Bredsättra
        7. Olof Jönsson b. 1742, Sverige (Sweden); d. September 4, 1811
        9. Nils Knutsson b. circa 1676, Istad, Alböke; d. circa April 17, 1729

        L Offline
        L Offline
        Lost User
        wrote on last edited by
        #3

        It would be easier to write a string parsing routine of your own.

        J U 2 Replies Last reply
        0
        • L Lost User

          It would be easier to write a string parsing routine of your own.

          J Offline
          J Offline
          jschell
          wrote on last edited by
          #4

          Richard MacCutchan wrote:

          It would be easier to write a string parsing routine of your own.

          I strongly agree with this. It is going to be easier to understand, easier to debug and quite possibly faster to run. And just is case you think I have a bias I have been using regexes for 40 years extensively (via perl). Which is why I understand both their advantages and disadvantages.

          L 1 Reply Last reply
          0
          • J jschell

            Richard MacCutchan wrote:

            It would be easier to write a string parsing routine of your own.

            I strongly agree with this. It is going to be easier to understand, easier to debug and quite possibly faster to run. And just is case you think I have a bias I have been using regexes for 40 years extensively (via perl). Which is why I understand both their advantages and disadvantages.

            L Offline
            L Offline
            Lost User
            wrote on last edited by
            #5

            jschell wrote:

            is case you think I have a bias

            Nothing would be further from my mind, even if you advocated a Regex. I respect everyone's opinions here; after all most people know lots of things that I do not.

            J 1 Reply Last reply
            0
            • L Lost User

              jschell wrote:

              is case you think I have a bias

              Nothing would be further from my mind, even if you advocated a Regex. I respect everyone's opinions here; after all most people know lots of things that I do not.

              J Offline
              J Offline
              jschell
              wrote on last edited by
              #6

              Richard MacCutchan wrote:

              Nothing would be further from my mind

              My post was phrased poorly since that part was not actually intended for you. It was directed at the OP and/or other readers who might come across my comment.

              1 Reply Last reply
              0
              • L Lost User

                It would be easier to write a string parsing routine of your own.

                U Offline
                U Offline
                User 12712953
                wrote on last edited by
                #7

                Richard MacCutchan wrote:

                It would be easier to write a string parsing routine of your own.

                Hi Richard, Thanks for the tip. Although I don't fully understand what you mean with "string parsing routine", I solved the issue by writing a regex for each column needed instead of a "catch-all" regex. Perhaps that is what you meant.

                L 1 Reply Last reply
                0
                • U User 12712953

                  Richard MacCutchan wrote:

                  It would be easier to write a string parsing routine of your own.

                  Hi Richard, Thanks for the tip. Although I don't fully understand what you mean with "string parsing routine", I solved the issue by writing a regex for each column needed instead of a "catch-all" regex. Perhaps that is what you meant.

                  L Offline
                  L Offline
                  Lost User
                  wrote on last edited by
                  #8

                  No, my suggestion was to abandon the use of Regex patterns. You can easily split the string into an array of strings separated by spaces. All words before an entry of "b." are parts of the name. All words after the "b." and before "d." or the end of the text, relate to the birth date. All items after "d." relate to the date of death. And apart from anything else it makes your code much clearer.

                  1 Reply Last reply
                  0
                  Reply
                  • Reply as topic
                  Log in to reply
                  • Oldest to Newest
                  • Newest to Oldest
                  • Most Votes


                  • Login

                  • Don't have an account? Register

                  • Login or register to search.
                  • First post
                    Last post
                  0
                  • Categories
                  • Recent
                  • Tags
                  • Popular
                  • World
                  • Users
                  • Groups