Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. Other Discussions
  3. Clever Code
  4. The 24 day bug

The 24 day bug

Scheduled Pinned Locked Moved Clever Code
helphardwaresales
18 Posts 11 Posters 2 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • T Offline
    T Offline
    Tim Smith
    wrote on last edited by
    #1

    In a time long ago we had a problem with an embedded system using a 68000 family processor. The computer would run great for 24 days without issue. However, late on the 24th day, the computer would "wig out" (technical term) causing itself to restart (a requirement for embedded systems). The customer was withholding the final $90,000 payment until this issue was fixed. The problem was found in the following code. /* return the number of milliseconds since the computer was started. */ /* overflow from the low DWORD goes into the high DWORD giving us 64 */ /* bits of total size for the counter */ void GetUpTime (DWORD *phigh, DWORD *plow) { *phigh = xyz; // get the value from the hardware *plow = zyx; // get the value from the hardware return; } /* return the number of elapsed seconds since startup */ double GetElapsedTime () { DWORD high, low; GetUpTime (&high, &low); return ((double) high) * 4294967.296 + ((double) low) / 1000.0; } By looking at the code, I doubt anyone will see the problem. We didn't see the issue. Theorize what the problem might be and how you would test it. After a while, I'll tell you how GetElapsedTime was behaving.

    Tim Smith I'm going to patent thought. I have yet to see any prior art.

    Q P M D M 7 Replies Last reply
    0
    • T Tim Smith

      In a time long ago we had a problem with an embedded system using a 68000 family processor. The computer would run great for 24 days without issue. However, late on the 24th day, the computer would "wig out" (technical term) causing itself to restart (a requirement for embedded systems). The customer was withholding the final $90,000 payment until this issue was fixed. The problem was found in the following code. /* return the number of milliseconds since the computer was started. */ /* overflow from the low DWORD goes into the high DWORD giving us 64 */ /* bits of total size for the counter */ void GetUpTime (DWORD *phigh, DWORD *plow) { *phigh = xyz; // get the value from the hardware *plow = zyx; // get the value from the hardware return; } /* return the number of elapsed seconds since startup */ double GetElapsedTime () { DWORD high, low; GetUpTime (&high, &low); return ((double) high) * 4294967.296 + ((double) low) / 1000.0; } By looking at the code, I doubt anyone will see the problem. We didn't see the issue. Theorize what the problem might be and how you would test it. After a while, I'll tell you how GetElapsedTime was behaving.

      Tim Smith I'm going to patent thought. I have yet to see any prior art.

      Q Offline
      Q Offline
      QuiJohn
      wrote on last edited by
      #2

      Hmm, was the hardware timer rolling into the high bit when it hit 2^31 rather than 2^32 (i.e. treating it as signed)? Twenty four days in milliseconds is about 2 billion... My other guess would be, because I once saw a similar thing happen in an embedded system, is that once in a while the hardware timer would get incremented between the *phigh and *plow assignments. For example: *phigh = xyz; // 64 bit timer = 0x00000000 0xFFFFFFFF, so *phigh = 0 // HW timer ticks *plow = zyx; // 64 bit timer = 0x00000001 0x00000000, so *plow = 0 Resulting in a total time of zero.


      Faith is a fine invention For gentlemen who see; But microscopes are prudent In an emergency! -Emily Dickinson

      1 Reply Last reply
      0
      • T Tim Smith

        In a time long ago we had a problem with an embedded system using a 68000 family processor. The computer would run great for 24 days without issue. However, late on the 24th day, the computer would "wig out" (technical term) causing itself to restart (a requirement for embedded systems). The customer was withholding the final $90,000 payment until this issue was fixed. The problem was found in the following code. /* return the number of milliseconds since the computer was started. */ /* overflow from the low DWORD goes into the high DWORD giving us 64 */ /* bits of total size for the counter */ void GetUpTime (DWORD *phigh, DWORD *plow) { *phigh = xyz; // get the value from the hardware *plow = zyx; // get the value from the hardware return; } /* return the number of elapsed seconds since startup */ double GetElapsedTime () { DWORD high, low; GetUpTime (&high, &low); return ((double) high) * 4294967.296 + ((double) low) / 1000.0; } By looking at the code, I doubt anyone will see the problem. We didn't see the issue. Theorize what the problem might be and how you would test it. After a while, I'll tell you how GetElapsedTime was behaving.

        Tim Smith I'm going to patent thought. I have yet to see any prior art.

        P Offline
        P Offline
        PIEBALDconsult
        wrote on last edited by
        #3

        All I can think of is that 4294967.296 may be single rather than double, but as that may be compiler-specific I don't know. I will say though, that I'd rather you either include answer, or at least post the answer as a reply, so your audience can choose to read or not read it.

        1 Reply Last reply
        0
        • T Tim Smith

          In a time long ago we had a problem with an embedded system using a 68000 family processor. The computer would run great for 24 days without issue. However, late on the 24th day, the computer would "wig out" (technical term) causing itself to restart (a requirement for embedded systems). The customer was withholding the final $90,000 payment until this issue was fixed. The problem was found in the following code. /* return the number of milliseconds since the computer was started. */ /* overflow from the low DWORD goes into the high DWORD giving us 64 */ /* bits of total size for the counter */ void GetUpTime (DWORD *phigh, DWORD *plow) { *phigh = xyz; // get the value from the hardware *plow = zyx; // get the value from the hardware return; } /* return the number of elapsed seconds since startup */ double GetElapsedTime () { DWORD high, low; GetUpTime (&high, &low); return ((double) high) * 4294967.296 + ((double) low) / 1000.0; } By looking at the code, I doubt anyone will see the problem. We didn't see the issue. Theorize what the problem might be and how you would test it. After a while, I'll tell you how GetElapsedTime was behaving.

          Tim Smith I'm going to patent thought. I have yet to see any prior art.

          M Offline
          M Offline
          Michael Dunn
          wrote on last edited by
          #4

          0xFFFFFFFF milliseconds is 24.85 days, so my psychic powers tell me that something's not right when the high DWORD in GetUpTime() is non-zero. Could be a bug in GetUpTime() not returning the right thing in *phigh, which then causes the calculation in GetElapsedTime() to overflow a double. I messed up somewhere while typing into Calc, it's actually 0x7FFFFFFF milliseconds. I still think it's an overflow somewhere.

          --Mike-- Visual C++ MVP :cool: LINKS~! Ericahist | PimpFish | CP SearchBar v3.0 | C++ Forum FAQ

          P Q 2 Replies Last reply
          0
          • M Michael Dunn

            0xFFFFFFFF milliseconds is 24.85 days, so my psychic powers tell me that something's not right when the high DWORD in GetUpTime() is non-zero. Could be a bug in GetUpTime() not returning the right thing in *phigh, which then causes the calculation in GetElapsedTime() to overflow a double. I messed up somewhere while typing into Calc, it's actually 0x7FFFFFFF milliseconds. I still think it's an overflow somewhere.

            --Mike-- Visual C++ MVP :cool: LINKS~! Ericahist | PimpFish | CP SearchBar v3.0 | C++ Forum FAQ

            P Offline
            P Offline
            PIEBALDconsult
            wrote on last edited by
            #5

            But After a while, I'll tell you how GetElapsedTime was behaving. hints that the problem is in GetElapsedTime, not GetUpTime.

            1 Reply Last reply
            0
            • M Michael Dunn

              0xFFFFFFFF milliseconds is 24.85 days, so my psychic powers tell me that something's not right when the high DWORD in GetUpTime() is non-zero. Could be a bug in GetUpTime() not returning the right thing in *phigh, which then causes the calculation in GetElapsedTime() to overflow a double. I messed up somewhere while typing into Calc, it's actually 0x7FFFFFFF milliseconds. I still think it's an overflow somewhere.

              --Mike-- Visual C++ MVP :cool: LINKS~! Ericahist | PimpFish | CP SearchBar v3.0 | C++ Forum FAQ

              Q Offline
              Q Offline
              QuiJohn
              wrote on last edited by
              #6

              Michael Dunn wrote:

              0xFFFFFFFF milliseconds is 24.85 days

              I got 0x7FFFFFFF as being 24.86 days which led to my speculation about it being a signed/unsigned issue.


              Faith is a fine invention For gentlemen who see; But microscopes are prudent In an emergency! -Emily Dickinson

              P M 2 Replies Last reply
              0
              • Q QuiJohn

                Michael Dunn wrote:

                0xFFFFFFFF milliseconds is 24.85 days

                I got 0x7FFFFFFF as being 24.86 days which led to my speculation about it being a signed/unsigned issue.


                Faith is a fine invention For gentlemen who see; But microscopes are prudent In an emergency! -Emily Dickinson

                P Offline
                P Offline
                PIEBALDconsult
                wrote on last edited by
                #7

                Nah, it's got to be that the hamsters were union.

                1 Reply Last reply
                0
                • Q QuiJohn

                  Michael Dunn wrote:

                  0xFFFFFFFF milliseconds is 24.85 days

                  I got 0x7FFFFFFF as being 24.86 days which led to my speculation about it being a signed/unsigned issue.


                  Faith is a fine invention For gentlemen who see; But microscopes are prudent In an emergency! -Emily Dickinson

                  M Offline
                  M Offline
                  Michael Dunn
                  wrote on last edited by
                  #8

                  Oops, you're right. I had an off-by-a-power-of-2 bug somewhere when I was typing into Calc. :doh:

                  --Mike-- Visual C++ MVP :cool: LINKS~! Ericahist | PimpFish | CP SearchBar v3.0 | C++ Forum FAQ

                  1 Reply Last reply
                  0
                  • T Tim Smith

                    In a time long ago we had a problem with an embedded system using a 68000 family processor. The computer would run great for 24 days without issue. However, late on the 24th day, the computer would "wig out" (technical term) causing itself to restart (a requirement for embedded systems). The customer was withholding the final $90,000 payment until this issue was fixed. The problem was found in the following code. /* return the number of milliseconds since the computer was started. */ /* overflow from the low DWORD goes into the high DWORD giving us 64 */ /* bits of total size for the counter */ void GetUpTime (DWORD *phigh, DWORD *plow) { *phigh = xyz; // get the value from the hardware *plow = zyx; // get the value from the hardware return; } /* return the number of elapsed seconds since startup */ double GetElapsedTime () { DWORD high, low; GetUpTime (&high, &low); return ((double) high) * 4294967.296 + ((double) low) / 1000.0; } By looking at the code, I doubt anyone will see the problem. We didn't see the issue. Theorize what the problem might be and how you would test it. After a while, I'll tell you how GetElapsedTime was behaving.

                    Tim Smith I'm going to patent thought. I have yet to see any prior art.

                    D Offline
                    D Offline
                    Dave Kreskowiak
                    wrote on last edited by
                    #9

                    I agree with Mike Dunn. I think it has to do with an overflow. Please, don't make us wait 24 days to find out.

                    Dave Kreskowiak Microsoft MVP - Visual Basic

                    1 Reply Last reply
                    0
                    • T Tim Smith

                      In a time long ago we had a problem with an embedded system using a 68000 family processor. The computer would run great for 24 days without issue. However, late on the 24th day, the computer would "wig out" (technical term) causing itself to restart (a requirement for embedded systems). The customer was withholding the final $90,000 payment until this issue was fixed. The problem was found in the following code. /* return the number of milliseconds since the computer was started. */ /* overflow from the low DWORD goes into the high DWORD giving us 64 */ /* bits of total size for the counter */ void GetUpTime (DWORD *phigh, DWORD *plow) { *phigh = xyz; // get the value from the hardware *plow = zyx; // get the value from the hardware return; } /* return the number of elapsed seconds since startup */ double GetElapsedTime () { DWORD high, low; GetUpTime (&high, &low); return ((double) high) * 4294967.296 + ((double) low) / 1000.0; } By looking at the code, I doubt anyone will see the problem. We didn't see the issue. Theorize what the problem might be and how you would test it. After a while, I'll tell you how GetElapsedTime was behaving.

                      Tim Smith I'm going to patent thought. I have yet to see any prior art.

                      M Offline
                      M Offline
                      Marc Clifton
                      wrote on last edited by
                      #10

                      IMO, high and low are DWORDS (unsigned), converting them to doubles ended up treating them as signed values. As soon as you hit 0x80000000 in high, you'd get a negative number. You wouldn't notice it in low if it went negative except for some jitter in the time elapsed. Marc

                      Thyme In The Country

                      People are just notoriously impossible. --DavidCrow
                      There's NO excuse for not commenting your code. -- John Simmons / outlaw programmer
                      People who say that they will refactor their code later to make it "good" don't understand refactoring, nor the art and craft of programming. -- Josh Smith

                      CPalliniC T 2 Replies Last reply
                      0
                      • M Marc Clifton

                        IMO, high and low are DWORDS (unsigned), converting them to doubles ended up treating them as signed values. As soon as you hit 0x80000000 in high, you'd get a negative number. You wouldn't notice it in low if it went negative except for some jitter in the time elapsed. Marc

                        Thyme In The Country

                        People are just notoriously impossible. --DavidCrow
                        There's NO excuse for not commenting your code. -- John Simmons / outlaw programmer
                        People who say that they will refactor their code later to make it "good" don't understand refactoring, nor the art and craft of programming. -- Josh Smith

                        CPalliniC Offline
                        CPalliniC Offline
                        CPallini
                        wrote on last edited by
                        #11

                        Marc Clifton wrote:

                        IMO, high and low are DWORDS (unsigned), converting them to doubles ended up treating them as signed values.

                        that was also my idea, at least for a while. However, on my system, DWORD is converted to double without loosing its unsigned nature, hence, unless the Smith embedded system performs a mistake on cast our idea is wrong. :)

                        If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.

                        In testa che avete, signor di Ceprano?

                        1 Reply Last reply
                        0
                        • M Marc Clifton

                          IMO, high and low are DWORDS (unsigned), converting them to doubles ended up treating them as signed values. As soon as you hit 0x80000000 in high, you'd get a negative number. You wouldn't notice it in low if it went negative except for some jitter in the time elapsed. Marc

                          Thyme In The Country

                          People are just notoriously impossible. --DavidCrow
                          There's NO excuse for not commenting your code. -- John Simmons / outlaw programmer
                          People who say that they will refactor their code later to make it "good" don't understand refactoring, nor the art and craft of programming. -- Josh Smith

                          T Offline
                          T Offline
                          Tim Smith
                          wrote on last edited by
                          #12

                          Marc is correct. The code is 100% correct in theory but wrong due to the 68000 not supporting unsigned integer to floating point conversion. Thus once the unsigned value hit 0x80000000, the resulting double value went negative. The fix was a simple and just avoided unsigned conversions. The only problem was that the overflow was in the low and not the high. The ticks were in milliseconds so an overflow in the high DWORD won't happen for a very very long time.

                          Tim Smith I'm going to patent thought. I have yet to see any prior art.

                          M 1 Reply Last reply
                          0
                          • T Tim Smith

                            Marc is correct. The code is 100% correct in theory but wrong due to the 68000 not supporting unsigned integer to floating point conversion. Thus once the unsigned value hit 0x80000000, the resulting double value went negative. The fix was a simple and just avoided unsigned conversions. The only problem was that the overflow was in the low and not the high. The ticks were in milliseconds so an overflow in the high DWORD won't happen for a very very long time.

                            Tim Smith I'm going to patent thought. I have yet to see any prior art.

                            M Offline
                            M Offline
                            Marc Clifton
                            wrote on last edited by
                            #13

                            Tim Smith wrote:

                            Marc is correct.

                            Woot! :jig:

                            Tim Smith wrote:

                            The ticks were in milliseconds so an overflow in the high DWORD won't happen for a very very long time.

                            :doh: Of course. :sigh: Marc

                            Thyme In The Country

                            People are just notoriously impossible. --DavidCrow
                            There's NO excuse for not commenting your code. -- John Simmons / outlaw programmer
                            People who say that they will refactor their code later to make it "good" don't understand refactoring, nor the art and craft of programming. -- Josh Smith

                            P 1 Reply Last reply
                            0
                            • T Tim Smith

                              In a time long ago we had a problem with an embedded system using a 68000 family processor. The computer would run great for 24 days without issue. However, late on the 24th day, the computer would "wig out" (technical term) causing itself to restart (a requirement for embedded systems). The customer was withholding the final $90,000 payment until this issue was fixed. The problem was found in the following code. /* return the number of milliseconds since the computer was started. */ /* overflow from the low DWORD goes into the high DWORD giving us 64 */ /* bits of total size for the counter */ void GetUpTime (DWORD *phigh, DWORD *plow) { *phigh = xyz; // get the value from the hardware *plow = zyx; // get the value from the hardware return; } /* return the number of elapsed seconds since startup */ double GetElapsedTime () { DWORD high, low; GetUpTime (&high, &low); return ((double) high) * 4294967.296 + ((double) low) / 1000.0; } By looking at the code, I doubt anyone will see the problem. We didn't see the issue. Theorize what the problem might be and how you would test it. After a while, I'll tell you how GetElapsedTime was behaving.

                              Tim Smith I'm going to patent thought. I have yet to see any prior art.

                              M Offline
                              M Offline
                              Mike_V
                              wrote on last edited by
                              #14

                              Divsion by zero. Operator precedence. [edit]Just read your reply to Marc - still think I found something, though.[/edit] [edit2] Oops! :-O I guess it's called Division by zero for a reason! [/edit2] Mike

                              P 1 Reply Last reply
                              0
                              • M Mike_V

                                Divsion by zero. Operator precedence. [edit]Just read your reply to Marc - still think I found something, though.[/edit] [edit2] Oops! :-O I guess it's called Division by zero for a reason! [/edit2] Mike

                                P Offline
                                P Offline
                                Pete OHanlon
                                wrote on last edited by
                                #15

                                Mike_V wrote:

                                Oops! I guess it's called Division by zero for a reason

                                It's called Nullity.

                                the last thing I want to see is some pasty-faced geek with skin so pale that it's almost translucent trying to bump parts with a partner - John Simmons / outlaw programmer
                                Deja View - the feeling that you've seen this post before.

                                1 Reply Last reply
                                0
                                • M Marc Clifton

                                  Tim Smith wrote:

                                  Marc is correct.

                                  Woot! :jig:

                                  Tim Smith wrote:

                                  The ticks were in milliseconds so an overflow in the high DWORD won't happen for a very very long time.

                                  :doh: Of course. :sigh: Marc

                                  Thyme In The Country

                                  People are just notoriously impossible. --DavidCrow
                                  There's NO excuse for not commenting your code. -- John Simmons / outlaw programmer
                                  People who say that they will refactor their code later to make it "good" don't understand refactoring, nor the art and craft of programming. -- Josh Smith

                                  P Offline
                                  P Offline
                                  peterchen
                                  wrote on last edited by
                                  #16

                                  you just should have stopped reading earlier ;)


                                  Developers, Developers, Developers, Developers, Developers, Developers, Velopers, Develprs, Developers!
                                  We are a big screwed up dysfunctional psychotic happy family - some more screwed up, others more happy, but everybody's psychotic joint venture definition of CP
                                  Linkify!|Fold With Us!

                                  M 1 Reply Last reply
                                  0
                                  • P peterchen

                                    you just should have stopped reading earlier ;)


                                    Developers, Developers, Developers, Developers, Developers, Developers, Velopers, Develprs, Developers!
                                    We are a big screwed up dysfunctional psychotic happy family - some more screwed up, others more happy, but everybody's psychotic joint venture definition of CP
                                    Linkify!|Fold With Us!

                                    M Offline
                                    M Offline
                                    Marc Clifton
                                    wrote on last edited by
                                    #17

                                    peterchen wrote:

                                    you just should have stopped reading earlier

                                    I should have stopped writing sooner! :) Marc

                                    Thyme In The Country

                                    People are just notoriously impossible. --DavidCrow
                                    There's NO excuse for not commenting your code. -- John Simmons / outlaw programmer
                                    People who say that they will refactor their code later to make it "good" don't understand refactoring, nor the art and craft of programming. -- Josh Smith

                                    1 Reply Last reply
                                    0
                                    • T Tim Smith

                                      In a time long ago we had a problem with an embedded system using a 68000 family processor. The computer would run great for 24 days without issue. However, late on the 24th day, the computer would "wig out" (technical term) causing itself to restart (a requirement for embedded systems). The customer was withholding the final $90,000 payment until this issue was fixed. The problem was found in the following code. /* return the number of milliseconds since the computer was started. */ /* overflow from the low DWORD goes into the high DWORD giving us 64 */ /* bits of total size for the counter */ void GetUpTime (DWORD *phigh, DWORD *plow) { *phigh = xyz; // get the value from the hardware *plow = zyx; // get the value from the hardware return; } /* return the number of elapsed seconds since startup */ double GetElapsedTime () { DWORD high, low; GetUpTime (&high, &low); return ((double) high) * 4294967.296 + ((double) low) / 1000.0; } By looking at the code, I doubt anyone will see the problem. We didn't see the issue. Theorize what the problem might be and how you would test it. After a while, I'll tell you how GetElapsedTime was behaving.

                                      Tim Smith I'm going to patent thought. I have yet to see any prior art.

                                      P Offline
                                      P Offline
                                      PICguy
                                      wrote on last edited by
                                      #18

                                      The 24 day hint was the clue for me. 3 to 4 minutes to the correct solution. And I found another VERY rare bug. Bug found by finding the number of milliseconds in 24 days. This was about half of 2^32. Thus DWORD had to be signed which produced a negative number after 0x7FFFFFFF milliseconds. The second bug is in your GetUpTime() function. About every 48 days the low counter overflows into the high word. If you catch the high word before the overflow and the low word just after the overflow you will have a problem. Fix: read the high word a second time. If unchanged you are done otherwise loop back and read both again. The second time the high word will match. (If counters take much processing to read then range check the low word. If (unsigned) less than 1000 – one second – then check the high word again. Otherwise it should be safe to skip the high word check.) The method given above works well in 8-bit microcontrollers for reading 16-bit counters on the fly without requiring h/w interlocks.

                                      1 Reply Last reply
                                      0
                                      Reply
                                      • Reply as topic
                                      Log in to reply
                                      • Oldest to Newest
                                      • Newest to Oldest
                                      • Most Votes


                                      • Login

                                      • Don't have an account? Register

                                      • Login or register to search.
                                      • First post
                                        Last post
                                      0
                                      • Categories
                                      • Recent
                                      • Tags
                                      • Popular
                                      • World
                                      • Users
                                      • Groups