OOXML parsing problem
-
Hello. I work on docx parser. All i need is to extract text and some layout info (such as font name, font size, alignment and so on). I confront a problem while parsing document's main part (document.xml). The problem is word division. For some reason Word 2007 sometimes divide words in to different runs. Example: "Identifier information about the certificate authority that issued the certificate." In document.xml it looks like
<w: p w:rsidR="00000000" w:rsidRDefault="00B91456">
<w:pPr>
<w:pStyle w:val="BulletedList" />
<w:numPr>
<w:ilvl w:val="0" />
<w:numId w:val="72" />
</w:numPr>
</w:pPr>
<w:r>
<w:t>Iden</w:t>
</w:r>
<w:r>
<w:t>tifier information about the certificate authority that issued the certificate.</w:t>
</w:r>
</w: p>The other example: "The server certificate name specified should match the fully qualified domain name." In document.xml it looks like
<w: p w:rsidR="00000000" w:rsidRDefault="00B91456">
<w:pPr>
<w:pStyle w:val="Listalerttext" />
<w:framePr w:wrap="notBeside" />
</w:pPr>
<w:r>
<w:t>The server c</w:t>
</w:r>
<w:r>
<w:t>ertificate name specified should match the fully qualified domain name.</w:t>
</w:r>
</w: p>What is the reason behind this? And, what is more important, how can I handle this? Such situation occurs many times and I completely don't no what to do. Word 2007 somehow handles it, can anyone suggest how? thanks..
-
Hello. I work on docx parser. All i need is to extract text and some layout info (such as font name, font size, alignment and so on). I confront a problem while parsing document's main part (document.xml). The problem is word division. For some reason Word 2007 sometimes divide words in to different runs. Example: "Identifier information about the certificate authority that issued the certificate." In document.xml it looks like
<w: p w:rsidR="00000000" w:rsidRDefault="00B91456">
<w:pPr>
<w:pStyle w:val="BulletedList" />
<w:numPr>
<w:ilvl w:val="0" />
<w:numId w:val="72" />
</w:numPr>
</w:pPr>
<w:r>
<w:t>Iden</w:t>
</w:r>
<w:r>
<w:t>tifier information about the certificate authority that issued the certificate.</w:t>
</w:r>
</w: p>The other example: "The server certificate name specified should match the fully qualified domain name." In document.xml it looks like
<w: p w:rsidR="00000000" w:rsidRDefault="00B91456">
<w:pPr>
<w:pStyle w:val="Listalerttext" />
<w:framePr w:wrap="notBeside" />
</w:pPr>
<w:r>
<w:t>The server c</w:t>
</w:r>
<w:r>
<w:t>ertificate name specified should match the fully qualified domain name.</w:t>
</w:r>
</w: p>What is the reason behind this? And, what is more important, how can I handle this? Such situation occurs many times and I completely don't no what to do. Word 2007 somehow handles it, can anyone suggest how? thanks..
Member 4083157 wrote:
What is the reason behind this? And, what is more important, how can I handle this?
I don't know, but if I needed to know I might start by reading the documentation[^] rather than typing messages into internet forums. :rolleyes:
led mike
-
Member 4083157 wrote:
What is the reason behind this? And, what is more important, how can I handle this?
I don't know, but if I needed to know I might start by reading the documentation[^] rather than typing messages into internet forums. :rolleyes:
led mike
Well.. it's a very good advice, but the problem is, that I already read the WordprocessingML part 5 times and I didn't find anything regarding my problem.. That's why I'm asking here..
-
Well.. it's a very good advice, but the problem is, that I already read the WordprocessingML part 5 times and I didn't find anything regarding my problem.. That's why I'm asking here..
Member 4083157 wrote:
I already read the WordprocessingML part 5 times
You should state in your posts what documentation you read and how/why it seems not to provide your solution. Otherwise someone might waste their time providing you with a link because most people here on code project never read any documentation. ;) Also if you read any documentation it is confusing that you do not understand that nature of the
run element
. Can you please link to and explain why, the documentation you cited does not (for you) indicate the nature of therun element
.led mike
-
Member 4083157 wrote:
I already read the WordprocessingML part 5 times
You should state in your posts what documentation you read and how/why it seems not to provide your solution. Otherwise someone might waste their time providing you with a link because most people here on code project never read any documentation. ;) Also if you read any documentation it is confusing that you do not understand that nature of the
run element
. Can you please link to and explain why, the documentation you cited does not (for you) indicate the nature of therun element
.led mike
Ok, sorry, i'll post documentation in future =) OOXML documentation, part 4 says that the
run defines a region of text with a common set of properties, represented by the r element. An r element allows the producer to specify a single set of formatting properties, applying the same information to all the contents of the run.
As I understood, it means, that some region of the text (from the same paragraph) whith the common set of properties can be placed in one run, am i not correct? The question is why there so many runs with the same properties in one paragraph? In most documents I viewed, there where lots of runs within the paragraph and all of them had the same properties. In my case, even word was divided. I came to the conclusion, that after some change of properties (and then back to the original), text can be divided in different runs, but how should i handle this? What is really confuses me, isxml:space="preserve"
element, OOXML specification refers to http://www.w3.org/XML/1998/namespace, which saysThe value "default" signals that applications' default white-space processing modes are acceptable for this element;
the value "preserve" indicates the intent that applications preserve all the white space.
This declared intent is considered to apply to all elements within the content of the element where it is specified,
unless overridden with another instance of the xml:space attribute.But I really couldn't find any logic of how it works in OOXML..
-
Ok, sorry, i'll post documentation in future =) OOXML documentation, part 4 says that the
run defines a region of text with a common set of properties, represented by the r element. An r element allows the producer to specify a single set of formatting properties, applying the same information to all the contents of the run.
As I understood, it means, that some region of the text (from the same paragraph) whith the common set of properties can be placed in one run, am i not correct? The question is why there so many runs with the same properties in one paragraph? In most documents I viewed, there where lots of runs within the paragraph and all of them had the same properties. In my case, even word was divided. I came to the conclusion, that after some change of properties (and then back to the original), text can be divided in different runs, but how should i handle this? What is really confuses me, isxml:space="preserve"
element, OOXML specification refers to http://www.w3.org/XML/1998/namespace, which saysThe value "default" signals that applications' default white-space processing modes are acceptable for this element;
the value "preserve" indicates the intent that applications preserve all the white space.
This declared intent is considered to apply to all elements within the content of the element where it is specified,
unless overridden with another instance of the xml:space attribute.But I really couldn't find any logic of how it works in OOXML..
The problem was solved. All, what was said about
xml:space="preserve"
above is correct. The problem was in my viewer, which incorret displaid documents. -
The problem was solved. All, what was said about
xml:space="preserve"
above is correct. The problem was in my viewer, which incorret displaid documents.