Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Visual Basic
  4. Replacing Bytes in an Office 2003 Word Document

Replacing Bytes in an Office 2003 Word Document

Scheduled Pinned Locked Moved Visual Basic
toolsannouncementdata-structurestestingbeta-testing
9 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J Offline
    J Offline
    jonegerton
    wrote on last edited by
    #1

    I'm working on a utility to do a find replace operation on Word documents (2003 version not docx) We've a version of this that uses automation, however it's too slow, and runs into problems with documents linked to excel, and containing macros (and anything else that causes messages to show). I've got code sorted out that will do this finding and replacing the bytes in a byte array of the file (see below). This works perfectly as long as the length of the bytes I replace is the same as the length of the bytes I need to find. If I change the length of the file the document will no longer open in word. When I make the change required manually (ie through MS Word), the byte length of the file doesn't change, so I'm assuming there must be a buffer somewhere in the file that is getting used. Please give me feedback on how to update the file correctly. Code: (please excuse the rough nature of this code - I'm prototyping!) (also DoReplace is based on code downloaded from T'Interweb. Can't remember where, but if its yours, thanks!)

    Imports System.IO

    Public Class Form1

    Private Sub Button1\_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    
    
        Dim find As String = "String To Find"
        Dim replace As String = "String To Replace"
        Dim path As String = "c:\\InputPath.doc"
        Dim updatedpath As String = "c:\\OutputPath.doc"
        Dim encoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(1252)
    
        Dim fi As New FileInfo(path)
    
        Dim fs As New FileStream(path, FileMode.Open)
    
        Dim bytes(CInt(fs.Length)) As Byte
    
        fs.Read(bytes, 0, CInt(fs.Length))
        fs.Close()
    
        Dim newBytes() As Byte = DoReplace(bytes, encoding.GetBytes(find), encoding.GetBytes(replace))
    
        fs = New FileStream(updatedpath, FileMode.Create)
        fs.Write(newBytes, 0, newBytes.Length)
        fs.Close()
    
    End Sub
    
    Public Function DoReplace(ByVal bytes As Byte(), ByVal findBytes As Byte(), ByVal replaceBytes() As Byte) As Byte()
    
        Dim newBytes As New Generic.List(Of Byte)
        Dim ndx As Integer = 0
    
        For x As Integer = 0 To bytes.Length - 1
            ' bytes is the original files bytes 
            If bytes(x) = findBytes(ndx) Then
                ' findBytes is a byte\[\] from the"find" string 
                If ndx = (findBytes.Length - 1) Then
                    For y As Integer = 0 To replaceBytes.Length - 1
                        'replaceBytes
    
    L 2 Replies Last reply
    0
    • J jonegerton

      I'm working on a utility to do a find replace operation on Word documents (2003 version not docx) We've a version of this that uses automation, however it's too slow, and runs into problems with documents linked to excel, and containing macros (and anything else that causes messages to show). I've got code sorted out that will do this finding and replacing the bytes in a byte array of the file (see below). This works perfectly as long as the length of the bytes I replace is the same as the length of the bytes I need to find. If I change the length of the file the document will no longer open in word. When I make the change required manually (ie through MS Word), the byte length of the file doesn't change, so I'm assuming there must be a buffer somewhere in the file that is getting used. Please give me feedback on how to update the file correctly. Code: (please excuse the rough nature of this code - I'm prototyping!) (also DoReplace is based on code downloaded from T'Interweb. Can't remember where, but if its yours, thanks!)

      Imports System.IO

      Public Class Form1

      Private Sub Button1\_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
      
      
          Dim find As String = "String To Find"
          Dim replace As String = "String To Replace"
          Dim path As String = "c:\\InputPath.doc"
          Dim updatedpath As String = "c:\\OutputPath.doc"
          Dim encoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(1252)
      
          Dim fi As New FileInfo(path)
      
          Dim fs As New FileStream(path, FileMode.Open)
      
          Dim bytes(CInt(fs.Length)) As Byte
      
          fs.Read(bytes, 0, CInt(fs.Length))
          fs.Close()
      
          Dim newBytes() As Byte = DoReplace(bytes, encoding.GetBytes(find), encoding.GetBytes(replace))
      
          fs = New FileStream(updatedpath, FileMode.Create)
          fs.Write(newBytes, 0, newBytes.Length)
          fs.Close()
      
      End Sub
      
      Public Function DoReplace(ByVal bytes As Byte(), ByVal findBytes As Byte(), ByVal replaceBytes() As Byte) As Byte()
      
          Dim newBytes As New Generic.List(Of Byte)
          Dim ndx As Integer = 0
      
          For x As Integer = 0 To bytes.Length - 1
              ' bytes is the original files bytes 
              If bytes(x) = findBytes(ndx) Then
                  ' findBytes is a byte\[\] from the"find" string 
                  If ndx = (findBytes.Length - 1) Then
                      For y As Integer = 0 To replaceBytes.Length - 1
                          'replaceBytes
      
      L Offline
      L Offline
      Lost User
      wrote on last edited by
      #2

      You need to understand the structure and format of a Word file before you can reliably change anything. Changing a number of bytes in the file without knowing exactly what those bytes may be used for is a recipe for disaster.

      J 1 Reply Last reply
      0
      • J jonegerton

        I'm working on a utility to do a find replace operation on Word documents (2003 version not docx) We've a version of this that uses automation, however it's too slow, and runs into problems with documents linked to excel, and containing macros (and anything else that causes messages to show). I've got code sorted out that will do this finding and replacing the bytes in a byte array of the file (see below). This works perfectly as long as the length of the bytes I replace is the same as the length of the bytes I need to find. If I change the length of the file the document will no longer open in word. When I make the change required manually (ie through MS Word), the byte length of the file doesn't change, so I'm assuming there must be a buffer somewhere in the file that is getting used. Please give me feedback on how to update the file correctly. Code: (please excuse the rough nature of this code - I'm prototyping!) (also DoReplace is based on code downloaded from T'Interweb. Can't remember where, but if its yours, thanks!)

        Imports System.IO

        Public Class Form1

        Private Sub Button1\_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        
        
            Dim find As String = "String To Find"
            Dim replace As String = "String To Replace"
            Dim path As String = "c:\\InputPath.doc"
            Dim updatedpath As String = "c:\\OutputPath.doc"
            Dim encoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(1252)
        
            Dim fi As New FileInfo(path)
        
            Dim fs As New FileStream(path, FileMode.Open)
        
            Dim bytes(CInt(fs.Length)) As Byte
        
            fs.Read(bytes, 0, CInt(fs.Length))
            fs.Close()
        
            Dim newBytes() As Byte = DoReplace(bytes, encoding.GetBytes(find), encoding.GetBytes(replace))
        
            fs = New FileStream(updatedpath, FileMode.Create)
            fs.Write(newBytes, 0, newBytes.Length)
            fs.Close()
        
        End Sub
        
        Public Function DoReplace(ByVal bytes As Byte(), ByVal findBytes As Byte(), ByVal replaceBytes() As Byte) As Byte()
        
            Dim newBytes As New Generic.List(Of Byte)
            Dim ndx As Integer = 0
        
            For x As Integer = 0 To bytes.Length - 1
                ' bytes is the original files bytes 
                If bytes(x) = findBytes(ndx) Then
                    ' findBytes is a byte\[\] from the"find" string 
                    If ndx = (findBytes.Length - 1) Then
                        For y As Integer = 0 To replaceBytes.Length - 1
                            'replaceBytes
        
        L Offline
        L Offline
        Lost User
        wrote on last edited by
        #3

        The '95 versions didn't save an entire document, but they appended changes to the last part of the document. That's a bit faster than writing the entire document anew. You can find the documentation through the link below. Be warned though, there's dragons there :) http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Word97-2007BinaryFileFormat(doc)Specification.pdf

        I are Troll :suss:

        L J 2 Replies Last reply
        0
        • L Lost User

          The '95 versions didn't save an entire document, but they appended changes to the last part of the document. That's a bit faster than writing the entire document anew. You can find the documentation through the link below. Be warned though, there's dragons there :) http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Word97-2007BinaryFileFormat(doc)Specification.pdf

          I are Troll :suss:

          L Offline
          L Offline
          Luc Pattyn
          wrote on last edited by
          #4

          Eddy Vluggen wrote:

          there's dragons there

          They are everywhere[^]. :)

          Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]


          I only read code that is properly indented, and rendered in a non-proportional font; hint: use PRE tags in forum messages


          L 1 Reply Last reply
          0
          • L Lost User

            You need to understand the structure and format of a Word file before you can reliably change anything. Changing a number of bytes in the file without knowing exactly what those bytes may be used for is a recipe for disaster.

            J Offline
            J Offline
            jonegerton
            wrote on last edited by
            #5

            Yeah - got it to work after a fashion - by adjusting the empty bytes that are found after the main content of the doc. However handing headers footers and everything else is (as you say) horrible. I think if I had another couple of years I might take this further, however, I'm off Xmas shopping instead! :)

            T 1 Reply Last reply
            0
            • L Lost User

              The '95 versions didn't save an entire document, but they appended changes to the last part of the document. That's a bit faster than writing the entire document anew. You can find the documentation through the link below. Be warned though, there's dragons there :) http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Word97-2007BinaryFileFormat(doc)Specification.pdf

              I are Troll :suss:

              J Offline
              J Offline
              jonegerton
              wrote on last edited by
              #6

              Thanks for the info. There are indeed Dragons! - Unfortunately this has to work on thousands of documents (corporate rebrand!) and there's no way I can guarantee that every doc will work within my timescales. We'll just have to get a temp to click on the message boxes that pop up (that are unstoppable!)

              U 1 Reply Last reply
              0
              • L Luc Pattyn

                Eddy Vluggen wrote:

                there's dragons there

                They are everywhere[^]. :)

                Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]


                I only read code that is properly indented, and rendered in a non-proportional font; hint: use PRE tags in forum messages


                L Offline
                L Offline
                Lost User
                wrote on last edited by
                #7

                Now there's a gem that I still need to revisit. ..I'll plan it right after the course on time-management :laugh:

                I are Troll :suss:

                1 Reply Last reply
                0
                • J jonegerton

                  Yeah - got it to work after a fashion - by adjusting the empty bytes that are found after the main content of the doc. However handing headers footers and everything else is (as you say) horrible. I think if I had another couple of years I might take this further, however, I'm off Xmas shopping instead! :)

                  T Offline
                  T Offline
                  The Man from U N C L E
                  wrote on last edited by
                  #8

                  I know it will run somewhat slower, however ... How about using Office Automation to open each doc, do the string substitution and save the resulting doc. We did some work recently that thankfully involved docs saved in WordML format. Resaving in the middle of your process as WordML may make the substitution easier.

                  If you have knowledge, let others light their candles at it. Margaret Fuller (1810 - 1850) [My Articles]  [My Website]

                  1 Reply Last reply
                  0
                  • J jonegerton

                    Thanks for the info. There are indeed Dragons! - Unfortunately this has to work on thousands of documents (corporate rebrand!) and there's no way I can guarantee that every doc will work within my timescales. We'll just have to get a temp to click on the message boxes that pop up (that are unstoppable!)

                    U Offline
                    U Offline
                    User 3677987
                    wrote on last edited by
                    #9

                    jonegerton wrote:

                    Unfortunately this has to work on thousands of documents (corporate rebrand!)

                    So I guess you have been tasked with replacing all occurences of "Tiger Woods" with "Tom Watson" or something along those lines? I tried to make something like that work years ago, i.e. something that tried to re-create the Word "save" logic at a low level. My recollection is that somewhere in the file, there is a field holding the length of the data or (less likely) a checksum. I thought I was adjusting that properly, but never did manage to create "valid" Word documents. Apparently there was some other checksum somewhere that I did not know about. Eventually I ended up doing the job with automation. I managed to work through the message-box-related issues... I think there are ways to detect the error condition and kill Winword.exe. In the worst case, you could just assume an error occured after a certain length of time. None of this is beautiful, but in the end it proved more workable than manually messing around with the file. And I did try mightily to make that work... I was just out of college, and had been immersed in a thesis that used Intel assembly, and the low-level approach was definitely the one I preferred. One more thought: the "DOCX" format of Office 2007 is much more regular and well-documented than the old melange of DOC formats. I think a DOCX is basically a zipped-up collection of XML documents and embedded image files. Have you considered converting to DOCX as the first step of the process? It might make your life easier.

                    1 Reply Last reply
                    0
                    Reply
                    • Reply as topic
                    Log in to reply
                    • Oldest to Newest
                    • Newest to Oldest
                    • Most Votes


                    • Login

                    • Don't have an account? Register

                    • Login or register to search.
                    • First post
                      Last post
                    0
                    • Categories
                    • Recent
                    • Tags
                    • Popular
                    • World
                    • Users
                    • Groups