Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. XML / XSL
  4. OOXML parsing problem

OOXML parsing problem

Scheduled Pinned Locked Moved XML / XSL
questionsysadmincryptographyregexxml
7 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • D Offline
    D Offline
    don_Pardon
    wrote on last edited by
    #1

    Hello. I work on docx parser. All i need is to extract text and some layout info (such as font name, font size, alignment and so on). I confront a problem while parsing document's main part (document.xml). The problem is word division. For some reason Word 2007 sometimes divide words in to different runs. Example: "Identifier information about the certificate authority that issued the certificate." In document.xml it looks like

    <w: p w:rsidR="00000000" w:rsidRDefault="00B91456">
    <w:pPr>
    <w:pStyle w:val="BulletedList" />
    <w:numPr>
    <w:ilvl w:val="0" />
    <w:numId w:val="72" />
    </w:numPr>
    </w:pPr>
    <w:r>
    <w:t>Iden</w:t>
    </w:r>
    <w:r>
    <w:t>tifier information about the certificate authority that issued the certificate.</w:t>
    </w:r>
    </w: p>

    The other example: "The server certificate name specified should match the fully qualified domain name." In document.xml it looks like

    <w: p w:rsidR="00000000" w:rsidRDefault="00B91456">
    <w:pPr>
    <w:pStyle w:val="Listalerttext" />
    <w:framePr w:wrap="notBeside" />
    </w:pPr>
    <w:r>
    <w:t>The server c</w:t>
    </w:r>
    <w:r>
    <w:t>ertificate name specified should match the fully qualified domain name.</w:t>
    </w:r>
    </w: p>

    What is the reason behind this? And, what is more important, how can I handle this? Such situation occurs many times and I completely don't no what to do. Word 2007 somehow handles it, can anyone suggest how? thanks..

    L 1 Reply Last reply
    0
    • D don_Pardon

      Hello. I work on docx parser. All i need is to extract text and some layout info (such as font name, font size, alignment and so on). I confront a problem while parsing document's main part (document.xml). The problem is word division. For some reason Word 2007 sometimes divide words in to different runs. Example: "Identifier information about the certificate authority that issued the certificate." In document.xml it looks like

      <w: p w:rsidR="00000000" w:rsidRDefault="00B91456">
      <w:pPr>
      <w:pStyle w:val="BulletedList" />
      <w:numPr>
      <w:ilvl w:val="0" />
      <w:numId w:val="72" />
      </w:numPr>
      </w:pPr>
      <w:r>
      <w:t>Iden</w:t>
      </w:r>
      <w:r>
      <w:t>tifier information about the certificate authority that issued the certificate.</w:t>
      </w:r>
      </w: p>

      The other example: "The server certificate name specified should match the fully qualified domain name." In document.xml it looks like

      <w: p w:rsidR="00000000" w:rsidRDefault="00B91456">
      <w:pPr>
      <w:pStyle w:val="Listalerttext" />
      <w:framePr w:wrap="notBeside" />
      </w:pPr>
      <w:r>
      <w:t>The server c</w:t>
      </w:r>
      <w:r>
      <w:t>ertificate name specified should match the fully qualified domain name.</w:t>
      </w:r>
      </w: p>

      What is the reason behind this? And, what is more important, how can I handle this? Such situation occurs many times and I completely don't no what to do. Word 2007 somehow handles it, can anyone suggest how? thanks..

      L Offline
      L Offline
      led mike
      wrote on last edited by
      #2

      Member 4083157 wrote:

      What is the reason behind this? And, what is more important, how can I handle this?

      I don't know, but if I needed to know I might start by reading the documentation[^] rather than typing messages into internet forums. :rolleyes:

      led mike

      D 1 Reply Last reply
      0
      • L led mike

        Member 4083157 wrote:

        What is the reason behind this? And, what is more important, how can I handle this?

        I don't know, but if I needed to know I might start by reading the documentation[^] rather than typing messages into internet forums. :rolleyes:

        led mike

        D Offline
        D Offline
        don_Pardon
        wrote on last edited by
        #3

        Well.. it's a very good advice, but the problem is, that I already read the WordprocessingML part 5 times and I didn't find anything regarding my problem.. That's why I'm asking here..

        L 1 Reply Last reply
        0
        • D don_Pardon

          Well.. it's a very good advice, but the problem is, that I already read the WordprocessingML part 5 times and I didn't find anything regarding my problem.. That's why I'm asking here..

          L Offline
          L Offline
          led mike
          wrote on last edited by
          #4

          Member 4083157 wrote:

          I already read the WordprocessingML part 5 times

          You should state in your posts what documentation you read and how/why it seems not to provide your solution. Otherwise someone might waste their time providing you with a link because most people here on code project never read any documentation. ;) Also if you read any documentation it is confusing that you do not understand that nature of the run element. Can you please link to and explain why, the documentation you cited does not (for you) indicate the nature of the run element.

          led mike

          D 1 Reply Last reply
          0
          • L led mike

            Member 4083157 wrote:

            I already read the WordprocessingML part 5 times

            You should state in your posts what documentation you read and how/why it seems not to provide your solution. Otherwise someone might waste their time providing you with a link because most people here on code project never read any documentation. ;) Also if you read any documentation it is confusing that you do not understand that nature of the run element. Can you please link to and explain why, the documentation you cited does not (for you) indicate the nature of the run element.

            led mike

            D Offline
            D Offline
            don_Pardon
            wrote on last edited by
            #5

            Ok, sorry, i'll post documentation in future =) OOXML documentation, part 4 says that the run defines a region of text with a common set of properties, represented by the r element. An r element allows the producer to specify a single set of formatting properties, applying the same information to all the contents of the run. As I understood, it means, that some region of the text (from the same paragraph) whith the common set of properties can be placed in one run, am i not correct? The question is why there so many runs with the same properties in one paragraph? In most documents I viewed, there where lots of runs within the paragraph and all of them had the same properties. In my case, even word was divided. I came to the conclusion, that after some change of properties (and then back to the original), text can be divided in different runs, but how should i handle this? What is really confuses me, is xml:space="preserve" element, OOXML specification refers to http://www.w3.org/XML/1998/namespace, which says

            The value "default" signals that applications' default white-space processing modes are acceptable for this element;
            the value "preserve" indicates the intent that applications preserve all the white space.
            This declared intent is considered to apply to all elements within the content of the element where it is specified,
            unless overridden with another instance of the xml:space attribute.

            But I really couldn't find any logic of how it works in OOXML..

            D 1 Reply Last reply
            0
            • D don_Pardon

              Ok, sorry, i'll post documentation in future =) OOXML documentation, part 4 says that the run defines a region of text with a common set of properties, represented by the r element. An r element allows the producer to specify a single set of formatting properties, applying the same information to all the contents of the run. As I understood, it means, that some region of the text (from the same paragraph) whith the common set of properties can be placed in one run, am i not correct? The question is why there so many runs with the same properties in one paragraph? In most documents I viewed, there where lots of runs within the paragraph and all of them had the same properties. In my case, even word was divided. I came to the conclusion, that after some change of properties (and then back to the original), text can be divided in different runs, but how should i handle this? What is really confuses me, is xml:space="preserve" element, OOXML specification refers to http://www.w3.org/XML/1998/namespace, which says

              The value "default" signals that applications' default white-space processing modes are acceptable for this element;
              the value "preserve" indicates the intent that applications preserve all the white space.
              This declared intent is considered to apply to all elements within the content of the element where it is specified,
              unless overridden with another instance of the xml:space attribute.

              But I really couldn't find any logic of how it works in OOXML..

              D Offline
              D Offline
              don_Pardon
              wrote on last edited by
              #6

              The problem was solved. All, what was said about xml:space="preserve" above is correct. The problem was in my viewer, which incorret displaid documents.

              L 1 Reply Last reply
              0
              • D don_Pardon

                The problem was solved. All, what was said about xml:space="preserve" above is correct. The problem was in my viewer, which incorret displaid documents.

                L Offline
                L Offline
                led mike
                wrote on last edited by
                #7

                Member 4083157 wrote:

                The problem was in my viewer, which incorret displaid documents.

                I am glad you resolved the issue, and thanks for posting back your findings. Which viewer were you using?

                led mike

                1 Reply Last reply
                0
                Reply
                • Reply as topic
                Log in to reply
                • Oldest to Newest
                • Newest to Oldest
                • Most Votes


                • Login

                • Don't have an account? Register

                • Login or register to search.
                • First post
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • World
                • Users
                • Groups