OOXML parsing problem

don_Pardon

Hello. I work on docx parser. All i need is to extract text and some layout info (such as font name, font size, alignment and so on). I confront a problem while parsing document's main part (document.xml). The problem is word division. For some reason Word 2007 sometimes divide words in to different runs. Example: "Identifier information about the certificate authority that issued the certificate." In document.xml it looks like

<w: p w:rsidR="00000000" w:rsidRDefault="00B91456">
<w:pPr>
<w:pStyle w:val="BulletedList" />
<w:numPr>
<w:ilvl w:val="0" />
<w:numId w:val="72" />
</w:numPr>
</w:pPr>
<w:r>
<w:t>Iden</w:t>
</w:r>
<w:r>
<w:t>tifier information about the certificate authority that issued the certificate.</w:t>
</w:r>
</w: p>

The other example: "The server certificate name specified should match the fully qualified domain name." In document.xml it looks like

<w: p w:rsidR="00000000" w:rsidRDefault="00B91456">
<w:pPr>
<w:pStyle w:val="Listalerttext" />
<w:framePr w:wrap="notBeside" />
</w:pPr>
<w:r>
<w:t>The server c</w:t>
</w:r>
<w:r>
<w:t>ertificate name specified should match the fully qualified domain name.</w:t>
</w:r>
</w: p>

What is the reason behind this? And, what is more important, how can I handle this? Such situation occurs many times and I completely don't no what to do. Word 2007 somehow handles it, can anyone suggest how? thanks..

led mike

Member 4083157 wrote:

What is the reason behind this? And, what is more important, how can I handle this?

I don't know, but if I needed to know I might start by reading the documentation[^] rather than typing messages into internet forums. :rolleyes:

led mike

don_Pardon

Well.. it's a very good advice, but the problem is, that I already read the WordprocessingML part 5 times and I didn't find anything regarding my problem.. That's why I'm asking here..

led mike

Member 4083157 wrote:

I already read the WordprocessingML part 5 times

You should state in your posts what documentation you read and how/why it seems not to provide your solution. Otherwise someone might waste their time providing you with a link because most people here on code project never read any documentation. ;) Also if you read any documentation it is confusing that you do not understand that nature of the run element. Can you please link to and explain why, the documentation you cited does not (for you) indicate the nature of the run element.

led mike

don_Pardon

Ok, sorry, i'll post documentation in future =) OOXML documentation, part 4 says that the run defines a region of text with a common set of properties, represented by the r element. An r element allows the producer to specify a single set of formatting properties, applying the same information to all the contents of the run. As I understood, it means, that some region of the text (from the same paragraph) whith the common set of properties can be placed in one run, am i not correct? The question is why there so many runs with the same properties in one paragraph? In most documents I viewed, there where lots of runs within the paragraph and all of them had the same properties. In my case, even word was divided. I came to the conclusion, that after some change of properties (and then back to the original), text can be divided in different runs, but how should i handle this? What is really confuses me, is xml:space="preserve" element, OOXML specification refers to http://www.w3.org/XML/1998/namespace, which says

The value "default" signals that applications' default white-space processing modes are acceptable for this element;
the value "preserve" indicates the intent that applications preserve all the white space.
This declared intent is considered to apply to all elements within the content of the element where it is specified,
unless overridden with another instance of the xml:space attribute.

But I really couldn't find any logic of how it works in OOXML..

don_Pardon

The problem was solved. All, what was said about xml:space="preserve" above is correct. The problem was in my viewer, which incorret displaid documents.

led mike

Member 4083157 wrote:

The problem was in my viewer, which incorret displaid documents.

I am glad you resolved the issue, and thanks for posting back your findings. Which viewer were you using?

led mike