Replacing Bytes in an Office 2003 Word Document
-
I'm working on a utility to do a find replace operation on Word documents (2003 version not docx) We've a version of this that uses automation, however it's too slow, and runs into problems with documents linked to excel, and containing macros (and anything else that causes messages to show). I've got code sorted out that will do this finding and replacing the bytes in a byte array of the file (see below). This works perfectly as long as the length of the bytes I replace is the same as the length of the bytes I need to find. If I change the length of the file the document will no longer open in word. When I make the change required manually (ie through MS Word), the byte length of the file doesn't change, so I'm assuming there must be a buffer somewhere in the file that is getting used. Please give me feedback on how to update the file correctly. Code: (please excuse the rough nature of this code - I'm prototyping!) (also DoReplace is based on code downloaded from T'Interweb. Can't remember where, but if its yours, thanks!)
Imports System.IO
Public Class Form1
Private Sub Button1\_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Dim find As String = "String To Find" Dim replace As String = "String To Replace" Dim path As String = "c:\\InputPath.doc" Dim updatedpath As String = "c:\\OutputPath.doc" Dim encoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(1252) Dim fi As New FileInfo(path) Dim fs As New FileStream(path, FileMode.Open) Dim bytes(CInt(fs.Length)) As Byte fs.Read(bytes, 0, CInt(fs.Length)) fs.Close() Dim newBytes() As Byte = DoReplace(bytes, encoding.GetBytes(find), encoding.GetBytes(replace)) fs = New FileStream(updatedpath, FileMode.Create) fs.Write(newBytes, 0, newBytes.Length) fs.Close() End Sub Public Function DoReplace(ByVal bytes As Byte(), ByVal findBytes As Byte(), ByVal replaceBytes() As Byte) As Byte() Dim newBytes As New Generic.List(Of Byte) Dim ndx As Integer = 0 For x As Integer = 0 To bytes.Length - 1 ' bytes is the original files bytes If bytes(x) = findBytes(ndx) Then ' findBytes is a byte\[\] from the"find" string If ndx = (findBytes.Length - 1) Then For y As Integer = 0 To replaceBytes.Length - 1 'replaceBytes
-
I'm working on a utility to do a find replace operation on Word documents (2003 version not docx) We've a version of this that uses automation, however it's too slow, and runs into problems with documents linked to excel, and containing macros (and anything else that causes messages to show). I've got code sorted out that will do this finding and replacing the bytes in a byte array of the file (see below). This works perfectly as long as the length of the bytes I replace is the same as the length of the bytes I need to find. If I change the length of the file the document will no longer open in word. When I make the change required manually (ie through MS Word), the byte length of the file doesn't change, so I'm assuming there must be a buffer somewhere in the file that is getting used. Please give me feedback on how to update the file correctly. Code: (please excuse the rough nature of this code - I'm prototyping!) (also DoReplace is based on code downloaded from T'Interweb. Can't remember where, but if its yours, thanks!)
Imports System.IO
Public Class Form1
Private Sub Button1\_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Dim find As String = "String To Find" Dim replace As String = "String To Replace" Dim path As String = "c:\\InputPath.doc" Dim updatedpath As String = "c:\\OutputPath.doc" Dim encoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(1252) Dim fi As New FileInfo(path) Dim fs As New FileStream(path, FileMode.Open) Dim bytes(CInt(fs.Length)) As Byte fs.Read(bytes, 0, CInt(fs.Length)) fs.Close() Dim newBytes() As Byte = DoReplace(bytes, encoding.GetBytes(find), encoding.GetBytes(replace)) fs = New FileStream(updatedpath, FileMode.Create) fs.Write(newBytes, 0, newBytes.Length) fs.Close() End Sub Public Function DoReplace(ByVal bytes As Byte(), ByVal findBytes As Byte(), ByVal replaceBytes() As Byte) As Byte() Dim newBytes As New Generic.List(Of Byte) Dim ndx As Integer = 0 For x As Integer = 0 To bytes.Length - 1 ' bytes is the original files bytes If bytes(x) = findBytes(ndx) Then ' findBytes is a byte\[\] from the"find" string If ndx = (findBytes.Length - 1) Then For y As Integer = 0 To replaceBytes.Length - 1 'replaceBytes
-
I'm working on a utility to do a find replace operation on Word documents (2003 version not docx) We've a version of this that uses automation, however it's too slow, and runs into problems with documents linked to excel, and containing macros (and anything else that causes messages to show). I've got code sorted out that will do this finding and replacing the bytes in a byte array of the file (see below). This works perfectly as long as the length of the bytes I replace is the same as the length of the bytes I need to find. If I change the length of the file the document will no longer open in word. When I make the change required manually (ie through MS Word), the byte length of the file doesn't change, so I'm assuming there must be a buffer somewhere in the file that is getting used. Please give me feedback on how to update the file correctly. Code: (please excuse the rough nature of this code - I'm prototyping!) (also DoReplace is based on code downloaded from T'Interweb. Can't remember where, but if its yours, thanks!)
Imports System.IO
Public Class Form1
Private Sub Button1\_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Dim find As String = "String To Find" Dim replace As String = "String To Replace" Dim path As String = "c:\\InputPath.doc" Dim updatedpath As String = "c:\\OutputPath.doc" Dim encoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(1252) Dim fi As New FileInfo(path) Dim fs As New FileStream(path, FileMode.Open) Dim bytes(CInt(fs.Length)) As Byte fs.Read(bytes, 0, CInt(fs.Length)) fs.Close() Dim newBytes() As Byte = DoReplace(bytes, encoding.GetBytes(find), encoding.GetBytes(replace)) fs = New FileStream(updatedpath, FileMode.Create) fs.Write(newBytes, 0, newBytes.Length) fs.Close() End Sub Public Function DoReplace(ByVal bytes As Byte(), ByVal findBytes As Byte(), ByVal replaceBytes() As Byte) As Byte() Dim newBytes As New Generic.List(Of Byte) Dim ndx As Integer = 0 For x As Integer = 0 To bytes.Length - 1 ' bytes is the original files bytes If bytes(x) = findBytes(ndx) Then ' findBytes is a byte\[\] from the"find" string If ndx = (findBytes.Length - 1) Then For y As Integer = 0 To replaceBytes.Length - 1 'replaceBytes
The '95 versions didn't save an entire document, but they appended changes to the last part of the document. That's a bit faster than writing the entire document anew. You can find the documentation through the link below. Be warned though, there's dragons there :) http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Word97-2007BinaryFileFormat(doc)Specification.pdf
I are Troll :suss:
-
The '95 versions didn't save an entire document, but they appended changes to the last part of the document. That's a bit faster than writing the entire document anew. You can find the documentation through the link below. Be warned though, there's dragons there :) http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Word97-2007BinaryFileFormat(doc)Specification.pdf
I are Troll :suss:
Eddy Vluggen wrote:
there's dragons there
They are everywhere[^]. :)
Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]
I only read code that is properly indented, and rendered in a non-proportional font; hint: use PRE tags in forum messages
-
You need to understand the structure and format of a Word file before you can reliably change anything. Changing a number of bytes in the file without knowing exactly what those bytes may be used for is a recipe for disaster.
Yeah - got it to work after a fashion - by adjusting the empty bytes that are found after the main content of the doc. However handing headers footers and everything else is (as you say) horrible. I think if I had another couple of years I might take this further, however, I'm off Xmas shopping instead! :)
-
The '95 versions didn't save an entire document, but they appended changes to the last part of the document. That's a bit faster than writing the entire document anew. You can find the documentation through the link below. Be warned though, there's dragons there :) http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Word97-2007BinaryFileFormat(doc)Specification.pdf
I are Troll :suss:
Thanks for the info. There are indeed Dragons! - Unfortunately this has to work on thousands of documents (corporate rebrand!) and there's no way I can guarantee that every doc will work within my timescales. We'll just have to get a temp to click on the message boxes that pop up (that are unstoppable!)
-
Eddy Vluggen wrote:
there's dragons there
They are everywhere[^]. :)
Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]
I only read code that is properly indented, and rendered in a non-proportional font; hint: use PRE tags in forum messages
-
Yeah - got it to work after a fashion - by adjusting the empty bytes that are found after the main content of the doc. However handing headers footers and everything else is (as you say) horrible. I think if I had another couple of years I might take this further, however, I'm off Xmas shopping instead! :)
I know it will run somewhat slower, however ... How about using Office Automation to open each doc, do the string substitution and save the resulting doc. We did some work recently that thankfully involved docs saved in WordML format. Resaving in the middle of your process as WordML may make the substitution easier.
If you have knowledge, let others light their candles at it. Margaret Fuller (1810 - 1850) [My Articles] [My Website]
-
Thanks for the info. There are indeed Dragons! - Unfortunately this has to work on thousands of documents (corporate rebrand!) and there's no way I can guarantee that every doc will work within my timescales. We'll just have to get a temp to click on the message boxes that pop up (that are unstoppable!)
jonegerton wrote:
Unfortunately this has to work on thousands of documents (corporate rebrand!)
So I guess you have been tasked with replacing all occurences of "Tiger Woods" with "Tom Watson" or something along those lines? I tried to make something like that work years ago, i.e. something that tried to re-create the Word "save" logic at a low level. My recollection is that somewhere in the file, there is a field holding the length of the data or (less likely) a checksum. I thought I was adjusting that properly, but never did manage to create "valid" Word documents. Apparently there was some other checksum somewhere that I did not know about. Eventually I ended up doing the job with automation. I managed to work through the message-box-related issues... I think there are ways to detect the error condition and kill Winword.exe. In the worst case, you could just assume an error occured after a certain length of time. None of this is beautiful, but in the end it proved more workable than manually messing around with the file. And I did try mightily to make that work... I was just out of college, and had been immersed in a thesis that used Intel assembly, and the low-level approach was definitely the one I preferred. One more thought: the "DOCX" format of Office 2007 is much more regular and well-documented than the old melange of DOC formats. I think a DOCX is basically a zipped-up collection of XML documents and embedded image files. Have you considered converting to DOCX as the first step of the process? It might make your life easier.