How to get text from source code files ?

jenni2008

Hi all: I am a project manager for a large (very) project. I am assigned the task of creating plain text version of web pages by removing <> and send to legal dept for review. Is there an automated way to do this ? The number of documents is very large - in excess of 1000. Thank you. Jenni

led mike

jenni2008 wrote:

I am a project manager for a large (very) project.

Then I'm not sure you are allowed to use these forums.

jenni2008 wrote:

I am assigned the task of creating plain text version of web pages

Don't you have a staff of developers that know how to do this? I mean if you don't, why does the company need a project manager?

jenni2008 wrote:

Is there an automated way to do this ?

Yes. I highly recommend using computers and software as a means of automating the task.

led mike

Christian Graus

You could do this with a regex and with C# code ( this is not an ASP.NET question ) that requests each page and then writes the text. One would hope there's an easy way to discover the 1000 pages, perhaps by adding code that finds and pursues links ?

Christian Graus Driven to the arms of OSX by Vista.

led mike

A tie, but you get the first position. Teachers pet ;P

led mike

Christian Graus

*grin* I was just reading through the forums and thinking you seem especially bitter this morning. Most of the questions are ridiculous, but still, have you had a bad day, or are you just worn down by the flood of homework questions ?

Christian Graus Driven to the arms of OSX by Vista.

led mike

Christian Graus wrote:

you seem especially bitter this morning

Christian Graus wrote:

but still, have you had a bad day, or

I'm so misunderstood. I let my creative flair govern my replies, not my mood. :laugh::laugh: Ok, I admit it, I have no creative flair. :-O However I also have almost no emotion so i don't think it has anything to do with my mood. I can't really explain, maybe it has mostly to do with how I interpret, or read between the lines of, the loser posts. :) Interpreting text messages is fairly inaccurate due to the lossieness. No expressions, body language, tone.

led mike

ptrckmc249

I use the following script. Put it in a file called RemoveTags.txt and execute from MS-DOS prompt C:/biterScripting/biterScripting.exe RemoveTags.txt dir("") files("*.html") If you really have 1000 documents, it may take a while. Hope this helps. (If you don't have biterScripting, goto biterScripting.com -> download) Patrick # START OF SCRIPT var str files # patterns for file names var str dir # dir where entire project is # Collect a list of files var str fileList find -rn $files $dir > $fileList # Process files one by one while ( $fileList <> "") do # Get the next file var str file lex "1" $fileList > $file # Read the file contents into a variable. var str content cat $file > $content # Remove all <> tags while ( { sen -r "^<&>^" $content } > 0 ) sal -r "^<&>^" "" $content > null # All <> are now removed in this one file. $content has the modified content. # sen = string enumerator, sal = string alterer, & = regular expression that matches any number of # any characters. <&> means, heck find out help pages. # If you want to remove empty lines, do in a loop like above, sal "^\n\n^" "\n" $content > null # Get the file name without the ending .html, etc. stex "[^.^l" $file > null # stex means string extractor. l means last instance. [ means, ... heck find out from the help pages. # Add .txt extension to file name. set $file = $file + ".txt" # Write the modified content to the .txt file. echo -e "DEBUG: Writing file " $file echo $content > { echo $file } done # end of do after while ( $fileList <> "") # All text version are now availabel in corresponding .txt files in the same directories for # the 1000 of your files.