Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. Remove duplicate nodes from xml using c#

Remove duplicate nodes from xml using c#

Scheduled Pinned Locked Moved C#
csharpdatabaselinqxmltutorial
7 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • I Offline
    I Offline
    ipstefan
    wrote on last edited by
    #1

    I an having some trouble with removing duplicate entries from an xml file. I am using mostly Linq to XML and C# to build the list.So I would like a Linq to Xml aproach further too. Example(before):    <SYNSET>       <ID>new_id</ID>       <SYNONYM>          <LITERAL>word1<SENSE>II.♦</SENSE></LITERAL>       </SYNONYM>       <DEF> definition1 </DEF>    </SYNSET>    <SYNSET>       <ID>new_id</ID>       <SYNONYM>          <LITERAL>word2<SENSE>I</SENSE></LITERAL>       </SYNONYM>       <DEF> definition2 </DEF>    </SYNSET>    <SYNSET>       <ID>new_id</ID>       <SYNONYM>          <LITERAL>word1<SENSE>II.♦</SENSE></LITERAL>       </SYNONYM>       <DEF> definition1 </DEF>    </SYNSET> After it should be:    <SYNSET>       <ID>new_id</ID>       <SYNONYM>          <LITERAL>word1<SENSE>II.♦</SENSE></LITERAL>       </SYNONYM>       <DEF> definition1 </DEF>    </SYNSET>    <SYNSET>       <ID>new_id</ID>       <SYNONYM>          <LITERAL>word2<SENSE>I</SENSE></LITERAL>       </SYNONYM>       <DEF> definition2 </DEF>    </SYNSET> The xml database should have around 100k entries like this, so I need a fast method to remove duplicates. Thanks in advance

    M 1 Reply Last reply
    0
    • I ipstefan

      I an having some trouble with removing duplicate entries from an xml file. I am using mostly Linq to XML and C# to build the list.So I would like a Linq to Xml aproach further too. Example(before):    <SYNSET>       <ID>new_id</ID>       <SYNONYM>          <LITERAL>word1<SENSE>II.♦</SENSE></LITERAL>       </SYNONYM>       <DEF> definition1 </DEF>    </SYNSET>    <SYNSET>       <ID>new_id</ID>       <SYNONYM>          <LITERAL>word2<SENSE>I</SENSE></LITERAL>       </SYNONYM>       <DEF> definition2 </DEF>    </SYNSET>    <SYNSET>       <ID>new_id</ID>       <SYNONYM>          <LITERAL>word1<SENSE>II.♦</SENSE></LITERAL>       </SYNONYM>       <DEF> definition1 </DEF>    </SYNSET> After it should be:    <SYNSET>       <ID>new_id</ID>       <SYNONYM>          <LITERAL>word1<SENSE>II.♦</SENSE></LITERAL>       </SYNONYM>       <DEF> definition1 </DEF>    </SYNSET>    <SYNSET>       <ID>new_id</ID>       <SYNONYM>          <LITERAL>word2<SENSE>I</SENSE></LITERAL>       </SYNONYM>       <DEF> definition2 </DEF>    </SYNSET> The xml database should have around 100k entries like this, so I need a fast method to remove duplicates. Thanks in advance

      M Offline
      M Offline
      musefan
      wrote on last edited by
      #2

      Probably be best to use a hash table... have a look at the Dictionary[^] class. Then read you xml file. Loop all the entries. Build a string[^] of all entry values. Create a hash[^] of that string. Check if the hash value is in the dictionary (with its Contains function) If its not there then add it to the dictionary then export the entry row to the new xml file. End of loop. If you skip the hash step and simply add the built string then it will use more memory but the process speed may be faster. You will have to test and decide yourself.

      Life goes very fast. Tomorrow, today is already yesterday.

      I 1 Reply Last reply
      0
      • M musefan

        Probably be best to use a hash table... have a look at the Dictionary[^] class. Then read you xml file. Loop all the entries. Build a string[^] of all entry values. Create a hash[^] of that string. Check if the hash value is in the dictionary (with its Contains function) If its not there then add it to the dictionary then export the entry row to the new xml file. End of loop. If you skip the hash step and simply add the built string then it will use more memory but the process speed may be faster. You will have to test and decide yourself.

        Life goes very fast. Tomorrow, today is already yesterday.

        I Offline
        I Offline
        ipstefan
        wrote on last edited by
        #3

        I dont think it will really work. As I said to you the xml database has around 100.000 entries. What I wanna do is check if an entry appears 2 times in that database and remove the duplicates. I read the dictionary   class a bit, not really what I am looking for.

        M 1 Reply Last reply
        0
        • I ipstefan

          I dont think it will really work. As I said to you the xml database has around 100.000 entries. What I wanna do is check if an entry appears 2 times in that database and remove the duplicates. I read the dictionary   class a bit, not really what I am looking for.

          M Offline
          M Offline
          musefan
          wrote on last edited by
          #4

          Well I have used the same process to remove duplicate lines from a text file (100MB+) so I know it will work. How do you expect to check if a line is a duplicate? you have to check each line against all other lines in the file right? you could skip the dictionary part and check all lines in the new file each time you read one in from the original file. That would certainly improve memory consumption but your processing time would greatly be reduced.

          Life goes very fast. Tomorrow, today is already yesterday.

          I 1 Reply Last reply
          0
          • M musefan

            Well I have used the same process to remove duplicate lines from a text file (100MB+) so I know it will work. How do you expect to check if a line is a duplicate? you have to check each line against all other lines in the file right? you could skip the dictionary part and check all lines in the new file each time you read one in from the original file. That would certainly improve memory consumption but your processing time would greatly be reduced.

            Life goes very fast. Tomorrow, today is already yesterday.

            I Offline
            I Offline
            ipstefan
            wrote on last edited by
            #5

            so the method works for not adding duplicates in a file? And how do you remove it the duplicate in the file.

            M 1 Reply Last reply
            0
            • I ipstefan

              so the method works for not adding duplicates in a file? And how do you remove it the duplicate in the file.

              M Offline
              M Offline
              musefan
              wrote on last edited by
              #6

              Yeah, but that is how you will need to do it. You cant just duplicates from a file, unfortunately there is not a function such as File.RemoveDuplicates(); Of course you could read a whole xml file into a datatable and then remove duplicates and re-write it back to the file but you don't want to have to load all the data into memory at once. So the best way is to create a reader for you xml file. Then read it one entry at the time and concatenate all the values into one string. Then you either use some sort of collection to store unique entries which you can use to check for duplicates. Or you can write straight to a new file and just check that file for duplicates each time. if a duplicate is found in which ever method you choose then you simply ignore that entry and read the next entry. Then you have either a file will unique entries, or you have a collection of unique entries which you can then write to a new file. Fairly straight forward concept really.

              Life goes very fast. Tomorrow, today is already yesterday.

              I 1 Reply Last reply
              0
              • M musefan

                Yeah, but that is how you will need to do it. You cant just duplicates from a file, unfortunately there is not a function such as File.RemoveDuplicates(); Of course you could read a whole xml file into a datatable and then remove duplicates and re-write it back to the file but you don't want to have to load all the data into memory at once. So the best way is to create a reader for you xml file. Then read it one entry at the time and concatenate all the values into one string. Then you either use some sort of collection to store unique entries which you can use to check for duplicates. Or you can write straight to a new file and just check that file for duplicates each time. if a duplicate is found in which ever method you choose then you simply ignore that entry and read the next entry. Then you have either a file will unique entries, or you have a collection of unique entries which you can then write to a new file. Fairly straight forward concept really.

                Life goes very fast. Tomorrow, today is already yesterday.

                I Offline
                I Offline
                ipstefan
                wrote on last edited by
                #7

                that worked out good for me..thank you... The code:<pre> XElement duplicate = XElement.Load("wnrom2.xml"); StringBuilder build = new StringBuilder(); Dictionary<string, string> dict = new Dictionary<string, string>(); foreach (XElement dup in duplicate.Descendants("SYNSET")) {       build.Append(dup.ToString()); } dict.Add("a", build.ToString()); if (dict.ContainsValue(synsets.Descendants("SYNSET").Last().ToString()) == false) {          synsets.Save("wnrom.xml");          synsets.Save("wnrom2.xml"); } </pre>

                1 Reply Last reply
                0
                Reply
                • Reply as topic
                Log in to reply
                • Oldest to Newest
                • Newest to Oldest
                • Most Votes


                • Login

                • Don't have an account? Register

                • Login or register to search.
                • First post
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • World
                • Users
                • Groups