Remove duplicate nodes from xml using c#

ipstefan

I an having some trouble with removing duplicate entries from an xml file. I am using mostly Linq to XML and C# to build the list.So I would like a Linq to Xml aproach further too. Example(before): <SYNSET> <ID>new_id</ID> <SYNONYM> <LITERAL>word1<SENSE>II.♦</SENSE></LITERAL> </SYNONYM> <DEF> definition1 </DEF> </SYNSET> <SYNSET> <ID>new_id</ID> <SYNONYM> <LITERAL>word2<SENSE>I</SENSE></LITERAL> </SYNONYM> <DEF> definition2 </DEF> </SYNSET> <SYNSET> <ID>new_id</ID> <SYNONYM> <LITERAL>word1<SENSE>II.♦</SENSE></LITERAL> </SYNONYM> <DEF> definition1 </DEF> </SYNSET> After it should be: <SYNSET> <ID>new_id</ID> <SYNONYM> <LITERAL>word1<SENSE>II.♦</SENSE></LITERAL> </SYNONYM> <DEF> definition1 </DEF> </SYNSET> <SYNSET> <ID>new_id</ID> <SYNONYM> <LITERAL>word2<SENSE>I</SENSE></LITERAL> </SYNONYM> <DEF> definition2 </DEF> </SYNSET> The xml database should have around 100k entries like this, so I need a fast method to remove duplicates. Thanks in advance

musefan

Probably be best to use a hash table... have a look at the Dictionary[^] class. Then read you xml file. Loop all the entries. Build a string[^] of all entry values. Create a hash[^] of that string. Check if the hash value is in the dictionary (with its Contains function) If its not there then add it to the dictionary then export the entry row to the new xml file. End of loop. If you skip the hash step and simply add the built string then it will use more memory but the process speed may be faster. You will have to test and decide yourself.

Life goes very fast. Tomorrow, today is already yesterday.

ipstefan

I dont think it will really work. As I said to you the xml database has around 100.000 entries. What I wanna do is check if an entry appears 2 times in that database and remove the duplicates. I read the dictionary class a bit, not really what I am looking for.

musefan

Well I have used the same process to remove duplicate lines from a text file (100MB+) so I know it will work. How do you expect to check if a line is a duplicate? you have to check each line against all other lines in the file right? you could skip the dictionary part and check all lines in the new file each time you read one in from the original file. That would certainly improve memory consumption but your processing time would greatly be reduced.

Life goes very fast. Tomorrow, today is already yesterday.

ipstefan

so the method works for not adding duplicates in a file? And how do you remove it the duplicate in the file.

musefan

Yeah, but that is how you will need to do it. You cant just duplicates from a file, unfortunately there is not a function such as File.RemoveDuplicates(); Of course you could read a whole xml file into a datatable and then remove duplicates and re-write it back to the file but you don't want to have to load all the data into memory at once. So the best way is to create a reader for you xml file. Then read it one entry at the time and concatenate all the values into one string. Then you either use some sort of collection to store unique entries which you can use to check for duplicates. Or you can write straight to a new file and just check that file for duplicates each time. if a duplicate is found in which ever method you choose then you simply ignore that entry and read the next entry. Then you have either a file will unique entries, or you have a collection of unique entries which you can then write to a new file. Fairly straight forward concept really.

Life goes very fast. Tomorrow, today is already yesterday.

ipstefan

that worked out good for me..thank you... The code:<pre> XElement duplicate = XElement.Load("wnrom2.xml"); StringBuilder build = new StringBuilder(); Dictionary<string, string> dict = new Dictionary<string, string>(); foreach (XElement dup in duplicate.Descendants("SYNSET")) { build.Append(dup.ToString()); } dict.Add("a", build.ToString()); if (dict.ContainsValue(synsets.Descendants("SYNSET").Last().ToString()) == false) { synsets.Save("wnrom.xml"); synsets.Save("wnrom2.xml"); } </pre>