Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. Removing stopwords

Removing stopwords

Scheduled Pinned Locked Moved C#
databasedata-structureshelp
8 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R Offline
    R Offline
    Rizwan Rathore
    wrote on last edited by
    #1

    Hi all, i need help in removing stop words i mean i ve an ArrayList wich contains about 10000 words each word is stored at each index of ArrayList and i ve to remove all the occurences of some 300 words wich are also stored in an array list well i m trying to do this like this

    for(int i = 0; i < stopWords.Count; i ++)
    {
    while(totalWords.Contains(stopWords[i]))
    totalWords.Remove(stopWords[i]);
    }

    i have also tried to do this through this way

    for(int i = 0; i < totalWords.Count ; i++)
    {
    for(int j = 0; j < stopWords.Count; j++)
    {

    	if(totalWords\[i\].Equals(stopWords\[j\]))
    	{
    		totalWords.Remove(totalWords\[i\]);							
    		i--;
    	}
    }
    

    }

    but both of these methods r taking ages to complete :( .... so plz anyone tell me some better and efficient appraoch then this....in the above code totalWords is the arraylist wich contains all the words and stopWords is the arraylist that contains the words wich r to b removed lookin forward for help Regards, -- modified at 8:09 Thursday 18th May, 2006

    R 1 Reply Last reply
    0
    • R Rizwan Rathore

      Hi all, i need help in removing stop words i mean i ve an ArrayList wich contains about 10000 words each word is stored at each index of ArrayList and i ve to remove all the occurences of some 300 words wich are also stored in an array list well i m trying to do this like this

      for(int i = 0; i < stopWords.Count; i ++)
      {
      while(totalWords.Contains(stopWords[i]))
      totalWords.Remove(stopWords[i]);
      }

      i have also tried to do this through this way

      for(int i = 0; i < totalWords.Count ; i++)
      {
      for(int j = 0; j < stopWords.Count; j++)
      {

      	if(totalWords\[i\].Equals(stopWords\[j\]))
      	{
      		totalWords.Remove(totalWords\[i\]);							
      		i--;
      	}
      }
      

      }

      but both of these methods r taking ages to complete :( .... so plz anyone tell me some better and efficient appraoch then this....in the above code totalWords is the arraylist wich contains all the words and stopWords is the arraylist that contains the words wich r to b removed lookin forward for help Regards, -- modified at 8:09 Thursday 18th May, 2006

      R Offline
      R Offline
      Robert Rohde
      wrote on last edited by
      #2

      Its probably a combination of the fact that your doing a linear search through a large list (which might lead to 300*10000 comparisons) and that the ArrayList is getting reorganzied each time you remove an element. Two things to improve: 1. Use BinarySearch (log n instead of n comparisons) a) Sort the list with: totalWords.Sort(); b) Call BinarySearch(stopWords[i]) to get the index of the first found item (you'll get -1 if not found) 2. Instead of removing from the existing list create a new one and add the elements which are not in the stopword list. This will reduce reorganization overhead of the ArrayList.

      R 1 Reply Last reply
      0
      • R Robert Rohde

        Its probably a combination of the fact that your doing a linear search through a large list (which might lead to 300*10000 comparisons) and that the ArrayList is getting reorganzied each time you remove an element. Two things to improve: 1. Use BinarySearch (log n instead of n comparisons) a) Sort the list with: totalWords.Sort(); b) Call BinarySearch(stopWords[i]) to get the index of the first found item (you'll get -1 if not found) 2. Instead of removing from the existing list create a new one and add the elements which are not in the stopword list. This will reduce reorganization overhead of the ArrayList.

        R Offline
        R Offline
        Rizwan Rathore
        wrote on last edited by
        #3

        thxx sir, but the problem is that i cant sort the arraylist bcz it contains the words from different docutments and if i sort it now i cant keep track that wich word occurs in wich document so as i cant arrange the array list i cant apply the binary search :( so plzz tell me anyother solution looking forward for help Regards,

        R 1 Reply Last reply
        0
        • R Rizwan Rathore

          thxx sir, but the problem is that i cant sort the arraylist bcz it contains the words from different docutments and if i sort it now i cant keep track that wich word occurs in wich document so as i cant arrange the array list i cant apply the binary search :( so plzz tell me anyother solution looking forward for help Regards,

          R Offline
          R Offline
          Robert Rohde
          wrote on last edited by
          #4

          Then at least follow the other point:

          ArrayList newTotalWords = new ArrayList(totalWords.Count);
          for(int i = 0; i < totalWords.Count ; i++)
          {
          if (!stopWords.Contains(totalWords[i]))
          {
          newTotalWords.Add(totalWords[i]);
          }
          }

          If you can sort the stopwords list then you can even use BinarySearch here:

          stopWords.Sort();
          ArrayList newTotalWords = new ArrayList(totalWords.Count);
          for(int i = 0; i < totalWords.Count ; i++)
          {
          if (stopWords.BinarySearch(totalWords[i]) < 0)
          {
          newTotalWords.Add(totalWords[i]);
          }
          }

          R 1 Reply Last reply
          0
          • R Robert Rohde

            Then at least follow the other point:

            ArrayList newTotalWords = new ArrayList(totalWords.Count);
            for(int i = 0; i < totalWords.Count ; i++)
            {
            if (!stopWords.Contains(totalWords[i]))
            {
            newTotalWords.Add(totalWords[i]);
            }
            }

            If you can sort the stopwords list then you can even use BinarySearch here:

            stopWords.Sort();
            ArrayList newTotalWords = new ArrayList(totalWords.Count);
            for(int i = 0; i < totalWords.Count ; i++)
            {
            if (stopWords.BinarySearch(totalWords[i]) < 0)
            {
            newTotalWords.Add(totalWords[i]);
            }
            }

            R Offline
            R Offline
            Rizwan Rathore
            wrote on last edited by
            #5

            Thanks alot sir, its working really fine even better than my expectations i ve used the 2nd option of sorting the stopwords list and then applying binary search over it..... Sir now i have another similar sort of problem....after removing the stopwords i hve to make an inverted index of the remaining words i.e to keep the record that how documents contain a certain word and how many times this word occurs in that particular file.......i ve done that but again the time is the major problem it takes lots of time i m wrting the code down wich i m using to do this..... temp is to keep the record of the current document number wIndex keeps the record of the objects of Teminology class each object of this class keeps track of all the info about a certain term.

            for(int i = 0; i < wordList.Count; i++)
            {
            //the control will come here when all the words of doc are compared
            if (wordList[i].ToString().Equals(EOF))
            {
            temp++;
            continue;
            }
            word[wIndex] = new Terminology();
            word[wIndex].term = wordList[i].ToString();
            termCount = 1;
            cDocNo = temp;
            jtemp = 0;
            for(int j = i+1; j < wordList.Count; j++)
            {
            //control will come here when a term is compared with all
            //the words of a document
            if(i == j)
            continue;

            	if(cDocNo >= 1 && jtemp == 0 && temp >=1 )
            	{
            		jtemp++;
            		for(int k = 0; k< cDocNo;k++)
            		{
                                        //saving the term and document frequency of the word.(i.e in how many documents it occurs and how many times in each doc)
            
            			word\[wIndex\].tf.Add(0);
            			word\[wIndex\].docID.Add(k+1);
            		}
            	}
            

            //EOF of end of file marker wich tells when the words of a certain doc have ended.

                        if(wordList\[j\].ToString().Equals(EOF))
            	{
            

            //as a certain document has ended so incrementing in current doc number

            		cDocNo++;
            		word\[wIndex\].tf.Add(termCount);
            		word\[wIndex\].docID.Add(cDocNo);	
            

            //now incrementing the term freq if it is greater than zero

            		if(termCount >= 1)
            		{										//storing word frequency
            			word\[wIndex\].df++;
            			docCount++;
            			termCount = 0;
            		}
            						
            		continue; 
            	}
            		
            	//checking the repetition of terms
            	if(wordList\[i\].Equals(wordList\[j\]))
            	{			
            		wordList.RemoveAt(j);
            		termCount++;
            		j--;
            	}
            				
            }
            wIndex++;	
            

            }

            any suggestions to improve the efficiency of this code will b welcomed looking forward for help Regards, -- modified at 10:34 Thursday 18th May, 2006

            R 1 Reply Last reply
            0
            • R Rizwan Rathore

              Thanks alot sir, its working really fine even better than my expectations i ve used the 2nd option of sorting the stopwords list and then applying binary search over it..... Sir now i have another similar sort of problem....after removing the stopwords i hve to make an inverted index of the remaining words i.e to keep the record that how documents contain a certain word and how many times this word occurs in that particular file.......i ve done that but again the time is the major problem it takes lots of time i m wrting the code down wich i m using to do this..... temp is to keep the record of the current document number wIndex keeps the record of the objects of Teminology class each object of this class keeps track of all the info about a certain term.

              for(int i = 0; i < wordList.Count; i++)
              {
              //the control will come here when all the words of doc are compared
              if (wordList[i].ToString().Equals(EOF))
              {
              temp++;
              continue;
              }
              word[wIndex] = new Terminology();
              word[wIndex].term = wordList[i].ToString();
              termCount = 1;
              cDocNo = temp;
              jtemp = 0;
              for(int j = i+1; j < wordList.Count; j++)
              {
              //control will come here when a term is compared with all
              //the words of a document
              if(i == j)
              continue;

              	if(cDocNo >= 1 && jtemp == 0 && temp >=1 )
              	{
              		jtemp++;
              		for(int k = 0; k< cDocNo;k++)
              		{
                                          //saving the term and document frequency of the word.(i.e in how many documents it occurs and how many times in each doc)
              
              			word\[wIndex\].tf.Add(0);
              			word\[wIndex\].docID.Add(k+1);
              		}
              	}
              

              //EOF of end of file marker wich tells when the words of a certain doc have ended.

                          if(wordList\[j\].ToString().Equals(EOF))
              	{
              

              //as a certain document has ended so incrementing in current doc number

              		cDocNo++;
              		word\[wIndex\].tf.Add(termCount);
              		word\[wIndex\].docID.Add(cDocNo);	
              

              //now incrementing the term freq if it is greater than zero

              		if(termCount >= 1)
              		{										//storing word frequency
              			word\[wIndex\].df++;
              			docCount++;
              			termCount = 0;
              		}
              						
              		continue; 
              	}
              		
              	//checking the repetition of terms
              	if(wordList\[i\].Equals(wordList\[j\]))
              	{			
              		wordList.RemoveAt(j);
              		termCount++;
              		j--;
              	}
              				
              }
              wIndex++;	
              

              }

              any suggestions to improve the efficiency of this code will b welcomed looking forward for help Regards, -- modified at 10:34 Thursday 18th May, 2006

              R Offline
              R Offline
              Robert Rohde
              wrote on last edited by
              #6

              Could you please correct your post. There seems to be some errors in it like:

              for(int j = i+1;j

              I could guess what should be there but its easier if you repost.

              R 1 Reply Last reply
              0
              • R Robert Rohde

                Could you please correct your post. There seems to be some errors in it like:

                for(int j = i+1;j

                I could guess what should be there but its easier if you repost.

                R Offline
                R Offline
                Rizwan Rathore
                wrote on last edited by
                #7

                i ve modified the code plzz check it

                R 1 Reply Last reply
                0
                • R Rizwan Rathore

                  i ve modified the code plzz check it

                  R Offline
                  R Offline
                  Robert Rohde
                  wrote on last edited by
                  #8

                  To be honest I have not fully understood your code (and I currently do not have the time to invest into this). First of all you should check if it is really doing what you are expecting (probably with some small sample data). The only thing I can advice you is to avoid using RemoveAt on the wordList. For your understanding: If you remove the first element in an ArrayList it will copy all other elements internally (which in this case means copying 9999 words). Also you have again the problem not being able to use BinarySearch. You could try to reorganzie your data. Instead of having one big list you could have separate lists for each document. You could then sort each of those without losing the reference to their respective documents and do some BinarySearches. I probably have some time later on. As performance tuning is fun for me you could send me the complete code by mail (along with some sample data).

                  1 Reply Last reply
                  0
                  Reply
                  • Reply as topic
                  Log in to reply
                  • Oldest to Newest
                  • Newest to Oldest
                  • Most Votes


                  • Login

                  • Don't have an account? Register

                  • Login or register to search.
                  • First post
                    Last post
                  0
                  • Categories
                  • Recent
                  • Tags
                  • Popular
                  • World
                  • Users
                  • Groups