RegEx to match formula groups [modified]
-
Forgive me if my question is unclear, I'll do my best to clarify it as best as I can. What I'm trying to do is "Tokenize" nested formula's by using RegEx. Consider the following formula in LaTeX:
LaTeX Example 1: text before \frac{ \frac{ SubPart 1 }{ SubPart 2 } }{ Part 2 } text after
What I need is a RegEx that can tokenize this LaTeX formula to output the following tokenlist: Token[0] = text before Token[1] = \frac{ \frac{ SubPart 1 }{ SubPart 2 } }{ Part 2 } Token[2] = text after Looking at Token[1], it has a nested fracture (\frac{...}) in its top-part. That's the way I need it to be to build an object tree. In a sense, the fracture in the top-part is a child to it's parent fracture. Consider the following formula in LaTeX:
LaTeX Example 2: text before \frac{ \left ( SubPart 1 + SubPart 2 \ right ) }{ Part 2 } text after
This should result in the following tokenlist: Token[0] = text before Token[1] = \frac{ \left ( SubPart 1 + SubPart 2 \ right ) }{ Part 2 } Token[2] = text after Again, looking at Token[1], it has a nested subformula in it. In this case, the subformula is a child, associated to the toppart of the fracture, to the fracture object. Final example (and then I assume you catch my drift):
text before \left ( \frac{ SubPart 1 }{ SubPart 2 } \ right ) + X text after
This should result in the following tokenlist: Token[0] = text before Token[1] = \left ( \frac{ SubPart 1 }{ SubPart 2 } \ right ) Token[2] = + X text after The nested-/sub-formulas will be processed when the parent formula is being constructed, so we need not worry about that part of it here. What I need, is a RegEx that can handle formula dimensions to the nth degree. Right now I do this by stepping through the string and seeing if I have a match in the sub-string for a particular item. When found, I add 1 to a level counter and run an internal loop to find the final closing "}" that brings the level back to 0; This works fine, but I've been commented upon by using this method and not using tokens, begotten with the use of RegEx. If anyone has a suggestion of how I should construct my regex, then please let me know. I've gotten as far as: Fracture : (\\frac{[\W].*?((}\s)|(}$)))* Subformula : (\\left \([\W].*?\\right \))* But as you would guess, this fails miserably when having to deal with nested or multidimensional formulas. Any help and/or insight on the matter would be greatl
-
Forgive me if my question is unclear, I'll do my best to clarify it as best as I can. What I'm trying to do is "Tokenize" nested formula's by using RegEx. Consider the following formula in LaTeX:
LaTeX Example 1: text before \frac{ \frac{ SubPart 1 }{ SubPart 2 } }{ Part 2 } text after
What I need is a RegEx that can tokenize this LaTeX formula to output the following tokenlist: Token[0] = text before Token[1] = \frac{ \frac{ SubPart 1 }{ SubPart 2 } }{ Part 2 } Token[2] = text after Looking at Token[1], it has a nested fracture (\frac{...}) in its top-part. That's the way I need it to be to build an object tree. In a sense, the fracture in the top-part is a child to it's parent fracture. Consider the following formula in LaTeX:
LaTeX Example 2: text before \frac{ \left ( SubPart 1 + SubPart 2 \ right ) }{ Part 2 } text after
This should result in the following tokenlist: Token[0] = text before Token[1] = \frac{ \left ( SubPart 1 + SubPart 2 \ right ) }{ Part 2 } Token[2] = text after Again, looking at Token[1], it has a nested subformula in it. In this case, the subformula is a child, associated to the toppart of the fracture, to the fracture object. Final example (and then I assume you catch my drift):
text before \left ( \frac{ SubPart 1 }{ SubPart 2 } \ right ) + X text after
This should result in the following tokenlist: Token[0] = text before Token[1] = \left ( \frac{ SubPart 1 }{ SubPart 2 } \ right ) Token[2] = + X text after The nested-/sub-formulas will be processed when the parent formula is being constructed, so we need not worry about that part of it here. What I need, is a RegEx that can handle formula dimensions to the nth degree. Right now I do this by stepping through the string and seeing if I have a match in the sub-string for a particular item. When found, I add 1 to a level counter and run an internal loop to find the final closing "}" that brings the level back to 0; This works fine, but I've been commented upon by using this method and not using tokens, begotten with the use of RegEx. If anyone has a suggestion of how I should construct my regex, then please let me know. I've gotten as far as: Fracture : (\\frac{[\W].*?((}\s)|(}$)))* Subformula : (\\left \([\W].*?\\right \))* But as you would guess, this fails miserably when having to deal with nested or multidimensional formulas. Any help and/or insight on the matter would be greatl
For what purpose? Do your have example input and output?
-
For what purpose? Do your have example input and output?
The purpose is to create an object tree of the formula. The examples may be found in my previous post. These are simplified examples, mind you, but they are representative of the challenge. Simply put: Parent object is Formula. Formula has a collection of child objects of the base type FormulaItem. FormulaItem[0] could be of type SubFormula (\left (...\right )). FormulaItem[0], being of the type SubFormula, has one collection of child objects of the base type FormulaItem. FormulaItem[0].FormulaItem[0] could be of type Fracture (\frac{...}). FormulaItem[0].FormulaItem[0], being of the type Fracture, has two collections (top-part and bottom-part) of the base type FormulaItem. Etc... I hope to have clarified the "why" in this.
-
Forgive me if my question is unclear, I'll do my best to clarify it as best as I can. What I'm trying to do is "Tokenize" nested formula's by using RegEx. Consider the following formula in LaTeX:
LaTeX Example 1: text before \frac{ \frac{ SubPart 1 }{ SubPart 2 } }{ Part 2 } text after
What I need is a RegEx that can tokenize this LaTeX formula to output the following tokenlist: Token[0] = text before Token[1] = \frac{ \frac{ SubPart 1 }{ SubPart 2 } }{ Part 2 } Token[2] = text after Looking at Token[1], it has a nested fracture (\frac{...}) in its top-part. That's the way I need it to be to build an object tree. In a sense, the fracture in the top-part is a child to it's parent fracture. Consider the following formula in LaTeX:
LaTeX Example 2: text before \frac{ \left ( SubPart 1 + SubPart 2 \ right ) }{ Part 2 } text after
This should result in the following tokenlist: Token[0] = text before Token[1] = \frac{ \left ( SubPart 1 + SubPart 2 \ right ) }{ Part 2 } Token[2] = text after Again, looking at Token[1], it has a nested subformula in it. In this case, the subformula is a child, associated to the toppart of the fracture, to the fracture object. Final example (and then I assume you catch my drift):
text before \left ( \frac{ SubPart 1 }{ SubPart 2 } \ right ) + X text after
This should result in the following tokenlist: Token[0] = text before Token[1] = \left ( \frac{ SubPart 1 }{ SubPart 2 } \ right ) Token[2] = + X text after The nested-/sub-formulas will be processed when the parent formula is being constructed, so we need not worry about that part of it here. What I need, is a RegEx that can handle formula dimensions to the nth degree. Right now I do this by stepping through the string and seeing if I have a match in the sub-string for a particular item. When found, I add 1 to a level counter and run an internal loop to find the final closing "}" that brings the level back to 0; This works fine, but I've been commented upon by using this method and not using tokens, begotten with the use of RegEx. If anyone has a suggestion of how I should construct my regex, then please let me know. I've gotten as far as: Fracture : (\\frac{[\W].*?((}\s)|(}$)))* Subformula : (\\left \([\W].*?\\right \))* But as you would guess, this fails miserably when having to deal with nested or multidimensional formulas. Any help and/or insight on the matter would be greatl
I'm not going to try and work it out myself, but you may find this handy: Expresso[^] - examines and generates Regular expressions. Best bit is it break it down and explains it in English!
No trees were harmed in the sending of this message; however, a significant number of electrons were slightly inconvenienced. This message is made of fully recyclable Zeros and Ones
-
The purpose is to create an object tree of the formula. The examples may be found in my previous post. These are simplified examples, mind you, but they are representative of the challenge. Simply put: Parent object is Formula. Formula has a collection of child objects of the base type FormulaItem. FormulaItem[0] could be of type SubFormula (\left (...\right )). FormulaItem[0], being of the type SubFormula, has one collection of child objects of the base type FormulaItem. FormulaItem[0].FormulaItem[0] could be of type Fracture (\frac{...}). FormulaItem[0].FormulaItem[0], being of the type Fracture, has two collections (top-part and bottom-part) of the base type FormulaItem. Etc... I hope to have clarified the "why" in this.
Björn T.J.M. Spruit wrote:
in my previous post.
I didn't see it and I'm not going to go look for it.
Björn T.J.M. Spruit wrote:
I hope to have clarified the "why" in this.
Nope.
modified on Tuesday, November 3, 2009 6:21 PM
-
Björn T.J.M. Spruit wrote:
in my previous post.
I didn't see it and I'm not going to go look for it.
Björn T.J.M. Spruit wrote:
I hope to have clarified the "why" in this.
Nope.
modified on Tuesday, November 3, 2009 6:21 PM
Hmmm, not very friendly then. :wtf: No worries, I'm always optimistic and don't mind a 'challenge' when I come across one. I'll give you a more extended LaTeX formula example:
\left ( Availability~ \right ) \times \left ( Performance~ \right ) \times \left ( Quality~ \right ) \times 100 ~ =~ \left (\frac{ I}{II} \right ) \times \left (\frac{ III}{ \left ( IV \times I \right ) } \right ) \times \left (\frac{ V}{III} \right ) \times 100
This is an example from how it's actually being used at this very moment. I'm not sure what more information you need to clarify the "why"? "Nope" isn't a very articulate way of asking me for the information you require to clarify this to you. So please let me know what it is you require of me other than what I've told you in order to clarify the "why" more accurately. Just to recap, I need to objectify a LaTeX formula. All I need, is a regex that is able to work with nested and multi-dimensional formulas as explained in the previous posts. The reason I'm exploring this, is because I've been commented upon that I didn't use tokenization by regex to get the formula elements. If there's anybody who knows how to tokenize a string that is a formula with nested and/or multi-dimensional elements, please let me know, otherwise I'll set this aside as:"Not a viable option, can't be done within a reasonable amount of time."
modified on Wednesday, November 4, 2009 12:25 PM
-
I'm not going to try and work it out myself, but you may find this handy: Expresso[^] - examines and generates Regular expressions. Best bit is it break it down and explains it in English!
No trees were harmed in the sending of this message; however, a significant number of electrons were slightly inconvenienced. This message is made of fully recyclable Zeros and Ones
Thank you for your advise. The application is a good one and I would certainly advise it to anyone who's going to work with regular expressions. Limited, controlled recursions of a finite count, can be tackled with regular expressions, though it makes the regular expressions cumbersome. Infinite and intuitive recursions with regular expressions can't be done in .NET as of yet. http://badassery.blogspot.com/2006/03/regex-recursion-without-balancing.html[^] I'm still investigating this and will post my findings if there's anything interesting to report.