converting html file into text file
-
Hi all, I want to convert html files into text file but i need efficient method for it coz i need to convert about 500 documents each time. currently i m using this code to convert it.....
string temp = null;
StreamReader sr = new StreamReader(path);
temp = sr.ReadToEnd();
temp = Regex.Replace(temp, "<[^>]*>", " ");
sr.Close();StreamWriter sw = new StreamWriter(newpath,false);
sw.WriteLine(temp);
sw.Flush();
sw.Close();well this is very efficient and fast method but the problem m having with this is that it is removing the symbols like ([,],<,>) and all other of this sort......but it is not removing actual tags like (html, class, table ) etc....so plz tell wat shud i do to remove these looking forward for help Regards,
-
Hi all, I want to convert html files into text file but i need efficient method for it coz i need to convert about 500 documents each time. currently i m using this code to convert it.....
string temp = null;
StreamReader sr = new StreamReader(path);
temp = sr.ReadToEnd();
temp = Regex.Replace(temp, "<[^>]*>", " ");
sr.Close();StreamWriter sw = new StreamWriter(newpath,false);
sw.WriteLine(temp);
sw.Flush();
sw.Close();well this is very efficient and fast method but the problem m having with this is that it is removing the symbols like ([,],<,>) and all other of this sort......but it is not removing actual tags like (html, class, table ) etc....so plz tell wat shud i do to remove these looking forward for help Regards,
hi, You can follow a simple logic in this case. When you encounter a "<" symbol note the index of this first ocurrence. Then until you get the proper ">" symbol, parse the string, when you get the symbol, just remove string from the index you noted first and the new index for ">". This logic may be inefficient but its the simplest one. To tweak your performance use StringBuilder instead of string. And one reason for this logic to work is that HTML also parses the text in this manner to render the output, so if the html file is displayed fine in a browser then the above logic will work in any case. Tell me if it works. I would like to know that. Anant Y. Kulkarni
-
hi, You can follow a simple logic in this case. When you encounter a "<" symbol note the index of this first ocurrence. Then until you get the proper ">" symbol, parse the string, when you get the symbol, just remove string from the index you noted first and the new index for ">". This logic may be inefficient but its the simplest one. To tweak your performance use StringBuilder instead of string. And one reason for this logic to work is that HTML also parses the text in this manner to render the output, so if the html file is displayed fine in a browser then the above logic will work in any case. Tell me if it works. I would like to know that. Anant Y. Kulkarni
Hi Sir, i will surely try this logic.....but tell me that wots the difference between string class and stringbuilder class and how it effects the efficiency of the program..... Looking forward for help Regards,
-
Hi Sir, i will surely try this logic.....but tell me that wots the difference between string class and stringbuilder class and how it effects the efficiency of the program..... Looking forward for help Regards,
Hi, string is immutable. That means for each operation that you perform on a string a new string object needs to be created and the result of the operations are stored in this new object. Immutable means unchangable. And thats why it is recomended to use StringBuilder class in .Net if frequent string operations are needed to be performed. To know more about it try searching msdn. "A good programmer is someone who looks both ways before crossing a one-way street." -- Doug Linder Anant Y. Kulkarni
-
Hi all, I want to convert html files into text file but i need efficient method for it coz i need to convert about 500 documents each time. currently i m using this code to convert it.....
string temp = null;
StreamReader sr = new StreamReader(path);
temp = sr.ReadToEnd();
temp = Regex.Replace(temp, "<[^>]*>", " ");
sr.Close();StreamWriter sw = new StreamWriter(newpath,false);
sw.WriteLine(temp);
sw.Flush();
sw.Close();well this is very efficient and fast method but the problem m having with this is that it is removing the symbols like ([,],<,>) and all other of this sort......but it is not removing actual tags like (html, class, table ) etc....so plz tell wat shud i do to remove these looking forward for help Regards,