I found a terrible bug (rubber duck session)

honey the codewitch

Update: I have since fixed the terrible bug! :-D I'm posting this because when I rant about these things to you folks I tend to come up with a solution, and I've been at this since last night. Skip it if you'd rather not be used like that. :) It's not a programming question, though I will describe the problem. There's not really code as such. [\r\n]* (zero or more carriage returns or line feeds) yields a proper set with two transitions [^\r\n]* (zero or more of anything but carriage returns or line feeds) matches any characters (incorrect). The set has one range with all unicode code points in it, and when you invert the set and then minimize the result it will actually crash. [^\n\r]* (functionally same as above) but works properly, yielding a set of all except carriage return or line feed. This despite the sets ostensibly being sorted. I thought I narrowed it down to a normalization routine I have that takes overlapping ranges and merges them. That still might be part of the problem. However, I removed the call to the normalization routine and it still fails my test, so something else is at fault further downstream. One of the issues is this is in live code - with deployed nuget packages and codeproject articles, and I only just discovered it. So there's some pressure on me to fix it, albeit self imposed. :~

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

0x01AA

Sometimes over-engineered code breaks your neck. And then I can't help but laugh maliciously. ;P :-D

k5054

Does the issue only apply to control characters, or do you have issues with other match groups not in order? e.g. if [^\r\n] fails and/or crashes, does [^rn] fail and/or crash also? If the latter fails also, then maybe you have an issue in your normalization routine. If only the former fails, then I'd have to suspect that it has something to do with handling "special" (i.e. control) characters. Does the group [^\v\n] also fail? What about code>[^\n\r\a\v\t] or other combinations of control chars? What about [^abc\n]?

"A little song, a little dance, a little seltzer down your pants" Chuckles the clown

honey the codewitch

I'll check it out. :) thanks

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

jschell

I would say you should never do any of those. So do not attempt to solve anything. All of those are unbounded and optional. Something like the following is correct in that it provides a bound and is not entirely optional.

^\d[^\r\n]*[\r\n]+

honey the codewitch

Well, I caught this as part of a larger regular expression, I'm simply taking out a portion in order to simplify. In my engine, it's perfectly fine to have a zero length match because every subexpression is an expression. It's expressions all the way down. :) (Oh, and I get the same results with +)

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

honey the codewitch

It's actually based on some simple mathematical concepts. I know enough about the present bug that it has to do with the way I'm sorting and categorizing ranges of characters. I wouldn't even use ranges except I need to in order to make Unicode practical, but it does cause the algorithm to significantly deviate from what you'd find in a textbook. It all works, but basically here's what's going on: Range 1: 0-12 Range 2: 10-0x1ffff It works fine if range two comes before range one, but how do i sort this? Normally it needs to sort such that Z-A becomes A-Z and therein lies the issue, or at least an issue. Maybe I can side step it somehow. Still stewing on this. Edit: I just realized it's groups of ranges I'm trying to sort. Maybe I don't need to at all?

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

Mircea Neacsu

Might be an ordering issue: \r(0x0d) > \n(0x0a).

Mircea

honey the codewitch

It is, but I just can't find where it's creating the problem. I've been kind of avoiding it at the moment.

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

jschell

honey the codewitch wrote:

it's perfectly fine to have a zero length match

I didn't say the engine can't do it. I am saying a programmer should never, not for any reason, write expressions like that for an engine.

honey the codewitch

Sure, but in this case, it was sufficient for running down this bug, because of the way the engine works. I did change it to + just to dot my "i"s and cross my "t"s but same result.

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix