RegEx remove duplicate need help

manoo88

Hi, not good at RegEx and trying to remove some duplicate values. data is in CSV, part of it looks like this: "/content/7/66345/images/590009.jpg , "/content/7/66345/images/590009.jpg , "/attachments/fe519c1e91c5e4983a70a2512fd5788b.jpg , "/content/7/66345/images/590009.jpg , "/content/7/66345/images/590009.jpg , "/attachments/4956e4fe56b59135c086605c9gyye.png "/content/1/3968663/images/856609.jpg , "/attachments/086605c7c6e4fe56b59135c11b.jpg , "/content/1/3968663/images/856609.jpg , "/attachments/086605c7c6e4fe56b59135c11b.jpg "/content/1/1458767/images/856657.jpg "/content/1/1448511/images/856373.jpg I am trying the following using Notepad++: \w+\.+jpg|\w+\.+png(?:^|\G)(\b\w+\b),?(?=.*\1) it does select one image by one image when clicking find, but I am not sure how to delete duplicates, when I replace it with empty, it removes all, I want to let the first image remain and delete the duplicates, Anyone can help me with the code, please? I don't want to remove the line, because it could mess with the CSV file, removing the extension of the duplicate image is OK, Thanks for your help.

Member 10601191

Hi, requesting some clarification: Given the sample input, please post the expected output, that's to avoid any misunderstanding by me. I could guess but prefer not too. Also, must this be done on Notepad++ (if yes - why ?), and on which OS ?. thks

Andre Oosthuizen

The regex you provided is close, but there are a few modifications needed to achieve your desired result. Here's the correct regex and how you can use it in Notepad++ -

("\/.*?\.(?:jpg|png))\s*,\s*(?=.*\1)

I have tested this regex in Notepad++ using the following steps - 1) Open your CSV file in Notepad++. 2) Press Ctrl + H to open the "Find" dialog. 3) In the "Find what" field, enter the regex: ("\/.*?\.(?:jpg|png))\s*,\s*(?=.*\1). 4) Leave the "Replace with" field empty. 5) In the "Search Mode" section, select "Regular expression". 6) Click on "Replace All". Make sure to have a backup of your data before performing any find and replace operations, just in case...

jschell

Presumably your expectation is the following 1. The entire row is duplicated. 2. The duplicated row immediately follows the first row. Otherwise I doubt regex is the way to go.

trønderen

If the ordering of the rows is insignificant you can simply sort the lines to collect the duplicate rows together. (And if necessary, sort again on a key field after you have completed.) But if you want to remove entire duplicated rows (lines), are you serious about using a regex to compare entire text lines for being identical? That can't be! But from the OP's first post, I cannot see what he intends to compare, and what he intends to remove.

jschell

trønderen wrote:

are you serious about using a regex to compare entire text lines for being identical

Myself? No I would not have attempted it with regex at all. I probably would have created a one shot perl script, not for the regex capabilities, but rather because reading files is easier to set up. And running it for iteration testing is easier also. And I would note that the editor I use does have a fairly decent regex. So the lack of that would not have impacted my decision.