This is an example of why I always insist on tabs as a separator (not commas nor semi-colons.)
Guus2005 wrote:
Remove all double quotes not directly preceded or directly followed by a semicolon
You are trying to solve this incorrectly.
Guus2005 wrote:
1;200;345;"Apotheker "Blue tongue"";"Apeldoorn";12;"ABCD12"
You do not want to "remove" the double quotes because they are part of the value. The following is the correct value from the above.
Apotheker "Blue tongue"
The pattern for the CSV is as follows 1. Semi-colon separates values. 2. Some values are quoted (double quotes.) For processing for the second case the following applies for the value (not the line but just a value from the line.) 1. The double quotes MUST be at both the end and start of the value. It is ignored if both are not true. 2. The double quotes in that case are removed. Double quotes internal are not impacted. Additionally you need to deal with the potential that there is a semi-colon in the middle of a value. If there is a semi-colon in a value then I doubt you should be using a regex to parse lines. Certainly if I was doing it I would not use a regex. Rather I would build a parser/tokenizer since the rules would be easier to see (and debug). Additionally it would probably be faster also. The tokenizer makes the case with the semi-colon much easier to deal with. The tokenizer rule would be in general 1. Find a semi-colon (start at semi-colon.) 2. If the next character is a double quote, flag a rule that it must look for quote then semi-colon as next break. 3. If the next character is not a double quote, flag a rule that it must look for a semi-colon as next break.