Removing stopwords

Rizwan Rathore

Hi all, i need help in removing stop words i mean i ve an ArrayList wich contains about 10000 words each word is stored at each index of ArrayList and i ve to remove all the occurences of some 300 words wich are also stored in an array list well i m trying to do this like this

for(int i = 0; i < stopWords.Count; i ++)
{
while(totalWords.Contains(stopWords[i]))
totalWords.Remove(stopWords[i]);
}

i have also tried to do this through this way

for(int i = 0; i < totalWords.Count ; i++)
{
for(int j = 0; j < stopWords.Count; j++)
{

	if(totalWords\[i\].Equals(stopWords\[j\]))
	{
		totalWords.Remove(totalWords\[i\]);							
		i--;
	}
}

}

but both of these methods r taking ages to complete :( .... so plz anyone tell me some better and efficient appraoch then this....in the above code totalWords is the arraylist wich contains all the words and stopWords is the arraylist that contains the words wich r to b removed lookin forward for help Regards, -- modified at 8:09 Thursday 18th May, 2006

Robert Rohde

Its probably a combination of the fact that your doing a linear search through a large list (which might lead to 300*10000 comparisons) and that the ArrayList is getting reorganzied each time you remove an element. Two things to improve: 1. Use BinarySearch (log n instead of n comparisons) a) Sort the list with: totalWords.Sort(); b) Call BinarySearch(stopWords[i]) to get the index of the first found item (you'll get -1 if not found) 2. Instead of removing from the existing list create a new one and add the elements which are not in the stopword list. This will reduce reorganization overhead of the ArrayList.

Rizwan Rathore

thxx sir, but the problem is that i cant sort the arraylist bcz it contains the words from different docutments and if i sort it now i cant keep track that wich word occurs in wich document so as i cant arrange the array list i cant apply the binary search :( so plzz tell me anyother solution looking forward for help Regards,

Robert Rohde

Then at least follow the other point:

ArrayList newTotalWords = new ArrayList(totalWords.Count);
for(int i = 0; i < totalWords.Count ; i++)
{
if (!stopWords.Contains(totalWords[i]))
{
newTotalWords.Add(totalWords[i]);
}
}

If you can sort the stopwords list then you can even use BinarySearch here:

stopWords.Sort();
ArrayList newTotalWords = new ArrayList(totalWords.Count);
for(int i = 0; i < totalWords.Count ; i++)
{
if (stopWords.BinarySearch(totalWords[i]) < 0)
{
newTotalWords.Add(totalWords[i]);
}
}

Rizwan Rathore

Thanks alot sir, its working really fine even better than my expectations i ve used the 2nd option of sorting the stopwords list and then applying binary search over it..... Sir now i have another similar sort of problem....after removing the stopwords i hve to make an inverted index of the remaining words i.e to keep the record that how documents contain a certain word and how many times this word occurs in that particular file.......i ve done that but again the time is the major problem it takes lots of time i m wrting the code down wich i m using to do this..... temp is to keep the record of the current document number wIndex keeps the record of the objects of Teminology class each object of this class keeps track of all the info about a certain term.

for(int i = 0; i < wordList.Count; i++)
{
//the control will come here when all the words of doc are compared
if (wordList[i].ToString().Equals(EOF))
{
temp++;
continue;
}
word[wIndex] = new Terminology();
word[wIndex].term = wordList[i].ToString();
termCount = 1;
cDocNo = temp;
jtemp = 0;
for(int j = i+1; j < wordList.Count; j++)
{
//control will come here when a term is compared with all
//the words of a document
if(i == j)
continue;

	if(cDocNo >= 1 && jtemp == 0 && temp >=1 )
	{
		jtemp++;
		for(int k = 0; k< cDocNo;k++)
		{
                            //saving the term and document frequency of the word.(i.e in how many documents it occurs and how many times in each doc)

			word\[wIndex\].tf.Add(0);
			word\[wIndex\].docID.Add(k+1);
		}
	}

//EOF of end of file marker wich tells when the words of a certain doc have ended.

            if(wordList\[j\].ToString().Equals(EOF))
	{

//as a certain document has ended so incrementing in current doc number

		cDocNo++;
		word\[wIndex\].tf.Add(termCount);
		word\[wIndex\].docID.Add(cDocNo);

//now incrementing the term freq if it is greater than zero

		if(termCount >= 1)
		{										//storing word frequency
			word\[wIndex\].df++;
			docCount++;
			termCount = 0;
		}
						
		continue; 
	}
		
	//checking the repetition of terms
	if(wordList\[i\].Equals(wordList\[j\]))
	{			
		wordList.RemoveAt(j);
		termCount++;
		j--;
	}
				
}
wIndex++;

}

any suggestions to improve the efficiency of this code will b welcomed looking forward for help Regards, -- modified at 10:34 Thursday 18th May, 2006

Robert Rohde

Could you please correct your post. There seems to be some errors in it like:

for(int j = i+1;j

I could guess what should be there but its easier if you repost.

Rizwan Rathore

i ve modified the code plzz check it

Robert Rohde

To be honest I have not fully understood your code (and I currently do not have the time to invest into this). First of all you should check if it is really doing what you are expecting (probably with some small sample data). The only thing I can advice you is to avoid using RemoveAt on the wordList. For your understanding: If you remove the first element in an ArrayList it will copy all other elements internally (which in this case means copying 9999 words). Also you have again the problem not being able to use BinarySearch. You could try to reorganzie your data. Instead of having one big list you could have separate lists for each document. You could then sort each of those without losing the reference to their respective documents and do some BinarySearches. I probably have some time later on. As performance tuning is fun for me you could send me the complete code by mail (along with some sample data).