Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Regular Expressions
  4. regular expression to find tax file numbers

regular expression to find tax file numbers

Scheduled Pinned Locked Moved Regular Expressions
algorithmsregextutorialquestion
8 Posts 4 Posters 11 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • C Offline
    C Offline
    Copalol
    wrote on last edited by
    #1

    Hi all, I'm new to regular expressions and what I want to do seems a bit advanced for me. I'd like to create a regular expression to locate valid Australian tax file numbers. Here's the regular expression I've come up with so far: (\d{8,9})|(\d\d\d[ ]\d\d\d[ ]\d\d\d)|(\d\d\d[-]\d\d\d[-]\d\d\d) Tax file numbers can be either 8 or 9 digits and this string successfully finds them, however, it also picks up numbers like mobile phone numbers. I also tried to incorporate a few different ways people generally type out tax file numbers which is why I've added in a - and also white space. There is a formula to detect if a tax file number is valid and this is what id like to add to the string to remove the false positives. From wikipedia: Tax file number - Wikipedia[^] As is the case with many identification numbers, the TFN includes a check digit for detecting erroneous numbers. The algorithm is based on simple modulo 11 arithmetic per many other digit checksum schemes. Example[edit] The validity of the example TFN '123456782' can be checked by the following process The sum of the numbers is 253 (1 + 8 + 9 + 28 + 25 + 48 + 42 + 72 + 20 = 253). 253 is a multiple of 11 (11 × 23 = 253). Therefore, the number is valid. Can it be done? Can someone assist?

    P G J 3 Replies Last reply
    0
    • C Copalol

      Hi all, I'm new to regular expressions and what I want to do seems a bit advanced for me. I'd like to create a regular expression to locate valid Australian tax file numbers. Here's the regular expression I've come up with so far: (\d{8,9})|(\d\d\d[ ]\d\d\d[ ]\d\d\d)|(\d\d\d[-]\d\d\d[-]\d\d\d) Tax file numbers can be either 8 or 9 digits and this string successfully finds them, however, it also picks up numbers like mobile phone numbers. I also tried to incorporate a few different ways people generally type out tax file numbers which is why I've added in a - and also white space. There is a formula to detect if a tax file number is valid and this is what id like to add to the string to remove the false positives. From wikipedia: Tax file number - Wikipedia[^] As is the case with many identification numbers, the TFN includes a check digit for detecting erroneous numbers. The algorithm is based on simple modulo 11 arithmetic per many other digit checksum schemes. Example[edit] The validity of the example TFN '123456782' can be checked by the following process The sum of the numbers is 253 (1 + 8 + 9 + 28 + 25 + 48 + 42 + 72 + 20 = 253). 253 is a multiple of 11 (11 × 23 = 253). Therefore, the number is valid. Can it be done? Can someone assist?

      P Offline
      P Offline
      Peter_in_2780
      wrote on last edited by
      #2

      You'd be pushing it uphill with a sharp stick to write a regex to validate the check digit. Best to use a regex to get the basic format right, then feed it into a bit of code to do the checksum.

      Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012

      J 1 Reply Last reply
      0
      • C Copalol

        Hi all, I'm new to regular expressions and what I want to do seems a bit advanced for me. I'd like to create a regular expression to locate valid Australian tax file numbers. Here's the regular expression I've come up with so far: (\d{8,9})|(\d\d\d[ ]\d\d\d[ ]\d\d\d)|(\d\d\d[-]\d\d\d[-]\d\d\d) Tax file numbers can be either 8 or 9 digits and this string successfully finds them, however, it also picks up numbers like mobile phone numbers. I also tried to incorporate a few different ways people generally type out tax file numbers which is why I've added in a - and also white space. There is a formula to detect if a tax file number is valid and this is what id like to add to the string to remove the false positives. From wikipedia: Tax file number - Wikipedia[^] As is the case with many identification numbers, the TFN includes a check digit for detecting erroneous numbers. The algorithm is based on simple modulo 11 arithmetic per many other digit checksum schemes. Example[edit] The validity of the example TFN '123456782' can be checked by the following process The sum of the numbers is 253 (1 + 8 + 9 + 28 + 25 + 48 + 42 + 72 + 20 = 253). 253 is a multiple of 11 (11 × 23 = 253). Therefore, the number is valid. Can it be done? Can someone assist?

        G Offline
        G Offline
        George Jonsson
        wrote on last edited by
        #3

        I agree with Peter_in_2780 that you should separate the matching of the number and the validation calculation. Just create two methods, one where you check the format and the other to calculate and validate the check digit. You didn't specify any variants of the text you want to match, so I just guessed what it could look like. For the actual regular expression you could do like this:

        Input: TFN '123456782'
        Regex: ^TFN\s*(')?(?\d{3}\s*\d{3}\s*\d{3})(')?\s*$

        It will get these variants:

        TFN '123456782'
        TFN'123 456 782'
        TFN123456782
        TFN 123 456 782

        Explanation:

        ^ Start of the string
        $ End of the string
        \s* Consumes 0 or more white space characters. It will make sure you match TFN123 and TFN 123
        (')? Optional quotation mark
        (? ...) Named group, makes it easier to extract the actual number

        If necessary, you will have to remove the spaces in a second step. Hope it helps.

        C 1 Reply Last reply
        0
        • P Peter_in_2780

          You'd be pushing it uphill with a sharp stick to write a regex to validate the check digit. Best to use a regex to get the basic format right, then feed it into a bit of code to do the checksum.

          Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012

          J Offline
          J Offline
          jschell
          wrote on last edited by
          #4

          Peter_in_2780 wrote:

          You'd be pushing it uphill with a sharp stick to write a regex to validate the check digit

          Very true

          1 Reply Last reply
          0
          • C Copalol

            Hi all, I'm new to regular expressions and what I want to do seems a bit advanced for me. I'd like to create a regular expression to locate valid Australian tax file numbers. Here's the regular expression I've come up with so far: (\d{8,9})|(\d\d\d[ ]\d\d\d[ ]\d\d\d)|(\d\d\d[-]\d\d\d[-]\d\d\d) Tax file numbers can be either 8 or 9 digits and this string successfully finds them, however, it also picks up numbers like mobile phone numbers. I also tried to incorporate a few different ways people generally type out tax file numbers which is why I've added in a - and also white space. There is a formula to detect if a tax file number is valid and this is what id like to add to the string to remove the false positives. From wikipedia: Tax file number - Wikipedia[^] As is the case with many identification numbers, the TFN includes a check digit for detecting erroneous numbers. The algorithm is based on simple modulo 11 arithmetic per many other digit checksum schemes. Example[edit] The validity of the example TFN '123456782' can be checked by the following process The sum of the numbers is 253 (1 + 8 + 9 + 28 + 25 + 48 + 42 + 72 + 20 = 253). 253 is a multiple of 11 (11 × 23 = 253). Therefore, the number is valid. Can it be done? Can someone assist?

            J Offline
            J Offline
            jschell
            wrote on last edited by
            #5

            Member 13555386 wrote:

            TFN includes a check digit for detecting erroneous numbers. ... Can it be done?

            No. Although if I was using Perl one can create a "regex" that call a method as part of the regular expression check itself. But I still wouldn't suggest doing that. There is no real standard for "regular expressions" so first you would need to define exactly what regular expression engine you are using. If you are using perl or java then there is a boundary match which might or might not be appropriate for your actual content.

            Member 13555386 wrote:

            (\d{8,9})|(\d\d\d[ ]\d\d\d[ ]\d\d\d)|(\d\d\d[-]\d\d\d[-]\d\d\d)

            Following provides a single range of 8-9 digits and then a range of nine digits with spaces or dashes.

            (\d{8,9})|(\d\d\d[- ]\d\d\d[- ]\d\d\d)

            You could make one of those digits optional by adding a '?' after it. I do not know which one that should be. Other suggestions really require knowing what regex engine you are using (rather than going through every possibility.)

            C 1 Reply Last reply
            0
            • G George Jonsson

              I agree with Peter_in_2780 that you should separate the matching of the number and the validation calculation. Just create two methods, one where you check the format and the other to calculate and validate the check digit. You didn't specify any variants of the text you want to match, so I just guessed what it could look like. For the actual regular expression you could do like this:

              Input: TFN '123456782'
              Regex: ^TFN\s*(')?(?\d{3}\s*\d{3}\s*\d{3})(')?\s*$

              It will get these variants:

              TFN '123456782'
              TFN'123 456 782'
              TFN123456782
              TFN 123 456 782

              Explanation:

              ^ Start of the string
              $ End of the string
              \s* Consumes 0 or more white space characters. It will make sure you match TFN123 and TFN 123
              (')? Optional quotation mark
              (? ...) Named group, makes it easier to extract the actual number

              If necessary, you will have to remove the spaces in a second step. Hope it helps.

              C Offline
              C Offline
              Copalol
              wrote on last edited by
              #6

              Thanks everyone, really appreciate the responses. Unfortunately I think the only way I can do it is via a regular expression as I am applying it to pre-defined field within a cloud based email security gateway. Sorry I should have been more detailed in my post. In answer to one of your questions, the email gateway supports two types of regex syntax Java and Perl. In the regex I don't need to include looking for the words "TFN" or "Tax File Number" as I can do this via the word / phrase match list on the email gateway. https://community.mimecast.com/docs/DOC-1613#jive\_content\_id\_Regular\_Expressions\_Text\_Matches In summary, it will match the regex string defined and match the words TFN or Tax file number and then flag it for the user to review. I assigned a value for the trigger otherwise known as an activation score which is currently configured as "2" the regex is worth 1 activation point and the words “TFN” and/or “Tax File Number” are both worth another activation point thus triggering the rule if the regex string is matched + either of the words. From the email gateway. # search for TFNs 1 regex (\d{8,9})|(\d\d\d[ ]\d\d\d[ ]\d\d\d)|(\d\d\d[-]\d\d\d[-]\d\d\d) # search for words "TFN" and/or "Tax File Number" 1 "Tax File Number" 1 "TFN" The three formats I’ve configured to look for the tfns are 123 456 782 123456782 123-456-782

              J 1 Reply Last reply
              0
              • J jschell

                Member 13555386 wrote:

                TFN includes a check digit for detecting erroneous numbers. ... Can it be done?

                No. Although if I was using Perl one can create a "regex" that call a method as part of the regular expression check itself. But I still wouldn't suggest doing that. There is no real standard for "regular expressions" so first you would need to define exactly what regular expression engine you are using. If you are using perl or java then there is a boundary match which might or might not be appropriate for your actual content.

                Member 13555386 wrote:

                (\d{8,9})|(\d\d\d[ ]\d\d\d[ ]\d\d\d)|(\d\d\d[-]\d\d\d[-]\d\d\d)

                Following provides a single range of 8-9 digits and then a range of nine digits with spaces or dashes.

                (\d{8,9})|(\d\d\d[- ]\d\d\d[- ]\d\d\d)

                You could make one of those digits optional by adding a '?' after it. I do not know which one that should be. Other suggestions really require knowing what regex engine you are using (rather than going through every possibility.)

                C Offline
                C Offline
                Copalol
                wrote on last edited by
                #7

                See my response to George Jonsson above :)

                1 Reply Last reply
                0
                • C Copalol

                  Thanks everyone, really appreciate the responses. Unfortunately I think the only way I can do it is via a regular expression as I am applying it to pre-defined field within a cloud based email security gateway. Sorry I should have been more detailed in my post. In answer to one of your questions, the email gateway supports two types of regex syntax Java and Perl. In the regex I don't need to include looking for the words "TFN" or "Tax File Number" as I can do this via the word / phrase match list on the email gateway. https://community.mimecast.com/docs/DOC-1613#jive\_content\_id\_Regular\_Expressions\_Text\_Matches In summary, it will match the regex string defined and match the words TFN or Tax file number and then flag it for the user to review. I assigned a value for the trigger otherwise known as an activation score which is currently configured as "2" the regex is worth 1 activation point and the words “TFN” and/or “Tax File Number” are both worth another activation point thus triggering the rule if the regex string is matched + either of the words. From the email gateway. # search for TFNs 1 regex (\d{8,9})|(\d\d\d[ ]\d\d\d[ ]\d\d\d)|(\d\d\d[-]\d\d\d[-]\d\d\d) # search for words "TFN" and/or "Tax File Number" 1 "Tax File Number" 1 "TFN" The three formats I’ve configured to look for the tfns are 123 456 782 123456782 123-456-782

                  J Offline
                  J Offline
                  jschell
                  wrote on last edited by
                  #8

                  Java/perl provide the following

                  \b

                  That represents a 'boundary' however you should read up on that to insure that is what you really want. Might also keep in mind that Java/Perl are aggressive in that they look for the best match not the first match. That means that it will keep trying until it is sure. That can result in a lot of processing - sometimes leading to days or even infinite searches. Although your current formats should not do that. Anchoring to anything will optimize the search.

                  1 Reply Last reply
                  0
                  Reply
                  • Reply as topic
                  Log in to reply
                  • Oldest to Newest
                  • Newest to Oldest
                  • Most Votes


                  • Login

                  • Don't have an account? Register

                  • Login or register to search.
                  • First post
                    Last post
                  0
                  • Categories
                  • Recent
                  • Tags
                  • Popular
                  • World
                  • Users
                  • Groups