Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. SW Engineering, NASA & how things go wrong

SW Engineering, NASA & how things go wrong

Scheduled Pinned Locked Moved The Lounge
helpcombeta-testingtutorialquestion
25 Posts 13 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R raddevus

    Since we receive so many files per day (and at times it's X per second -- & I know other apps stress things much further) my system has pushed network, file storage h/w & the windows file system to their limits (since other apps are running our medium-sized company's file system also). Anyway, my point here is that there are times we are receiving files but the network cannot access the SSD or network node where the SSD is or whatever so I would get low-level errors back in my system when I'm just trying to save a file. Meanwhile, Infrastructure h/w & OS types are like, "You shouldn't ever get write errors. It's just not possible." So I had to make sure my app doesn't crash and somehow handles the situation without losing everything. It's a challenge & those kinds of errors almost always occur at 3am-4am local time. X| I really don't want to wake up (or wake anyone else up) in the middle of the night. My final point: People who may be wakened in the middle of the night are the closest to being/becoming engineers. Cuz it's on them. :laugh:

    C Offline
    C Offline
    charlieg
    wrote on last edited by
    #11

    Curious - how are these files arriving? ftp? email? etc... rad, Check with your employer but I would love to see an article on the engineering description of your system. I consulted with a firm for a few years that processed motor vehicle records for insurance companies. The amount of data and file processing was incredible.

    Charlie Gilley “They who can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.” BF, 1759 Has never been more appropriate.

    R 1 Reply Last reply
    0
    • Greg UtasG Greg Utas

      I worked on large systems (tens of millions of lines of code) that were "five nines" and later "six nines", which translates to 5 minutes or 30 seconds of downtime per year. Probably 90% of our code dealt with failure scenarios.

      Robust Services Core | Software Techniques for Lemmings | Articles
      The fox knows many things, but the hedgehog knows one big thing.

      T Offline
      T Offline
      trønderen
      wrote on last edited by
      #12

      Somewhat related: I have several times had to explain to youngsters why dotNet JIT code generation from MSIL is not a problem. Of course part of the issue is that lexical analysis etc. has been done earlier in the process, but the one, single factor is that the JIT compiler does not have to do error checking. It assumes error free input, and saves lots of time on that, compared to a complete from-source-code compiler that can not take error free input for given.

      1 Reply Last reply
      0
      • R raddevus

        I really can't stop with this book, Modern Software Engineering[^], because so much of it resonates with me after working in IT/Dev for over 30 years. I came to Dev thru QA so I've always focused on "repeatable processes, errors & failing safely".

        Quote:

        One of the driving forces behind [Margaret] Hamilton’s[^] approach was the focus on how things fail—the ways in which we get things wrong. "There was a fascination on my part with errors, a never ending pass-time of mine was what made a particular error, or class of errors, happen and how to prevent it in the future." This focus was grounded in a scientifically rational approach to problem-solving. The assumption was not that you could plan and get it right the first time, rather that you treated all ideas, solutions, and designs with skepticism until you ran out of ideas about how things could go wrong. Occasionally, reality is still going to surprise you, but this is engineering empiricism at work. The other engineering principle that is embodied in Hamilton’s early work is the idea of “failing safely.” The assumption is that we can never code for every scenario, so how do we code in ways that allow our systems to cope with the unexpected and still make progress? Famously it was Hamilton’s unasked-for implementation of this idea that saved the Apollo 11 mission and allowed the Lunar Module Eagle to successfully land on the moon, despite the computer becoming overloaded during the descent. As Neil Armstrong and Buzz Aldrin descended in the Lunar Excursion Module (LEM) toward the moon, there was an exchange between the a

        T Offline
        T Offline
        trønderen
        wrote on last edited by
        #13

        Quote:

        The other engineering principle that is embodied in Hamilton’s early work is the idea of “failing safely.”

        As I have said before: Fail safe systems fail by failing to fail safely.

        1 Reply Last reply
        0
        • R raddevus

          I really can't stop with this book, Modern Software Engineering[^], because so much of it resonates with me after working in IT/Dev for over 30 years. I came to Dev thru QA so I've always focused on "repeatable processes, errors & failing safely".

          Quote:

          One of the driving forces behind [Margaret] Hamilton’s[^] approach was the focus on how things fail—the ways in which we get things wrong. "There was a fascination on my part with errors, a never ending pass-time of mine was what made a particular error, or class of errors, happen and how to prevent it in the future." This focus was grounded in a scientifically rational approach to problem-solving. The assumption was not that you could plan and get it right the first time, rather that you treated all ideas, solutions, and designs with skepticism until you ran out of ideas about how things could go wrong. Occasionally, reality is still going to surprise you, but this is engineering empiricism at work. The other engineering principle that is embodied in Hamilton’s early work is the idea of “failing safely.” The assumption is that we can never code for every scenario, so how do we code in ways that allow our systems to cope with the unexpected and still make progress? Famously it was Hamilton’s unasked-for implementation of this idea that saved the Apollo 11 mission and allowed the Lunar Module Eagle to successfully land on the moon, despite the computer becoming overloaded during the descent. As Neil Armstrong and Buzz Aldrin descended in the Lunar Excursion Module (LEM) toward the moon, there was an exchange between the a

          C Offline
          C Offline
          charlieg
          wrote on last edited by
          #14

          Margaret is legend. If you are serious about engineering software, read everything and I mean EVERYTHING she's written. She took her job a hell of a lot more seriously than most of us software weenies do - no offense. She knew three people were going to be on top of some serious energy, go to the moon and hopefully come back without going splat. dammit rad, I'm ordering the book now. On a side note, if anyone has good references on the material behind "Hidden Figures", please post.

          Charlie Gilley “They who can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.” BF, 1759 Has never been more appropriate.

          R 1 Reply Last reply
          0
          • C charlieg

            Curious - how are these files arriving? ftp? email? etc... rad, Check with your employer but I would love to see an article on the engineering description of your system. I consulted with a firm for a few years that processed motor vehicle records for insurance companies. The amount of data and file processing was incredible.

            Charlie Gilley “They who can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.” BF, 1759 Has never been more appropriate.

            R Offline
            R Offline
            raddevus
            wrote on last edited by
            #15

            Most (80%) files are transferred via AS2[^] (a relatively unknown protocol for transferring EDI) over HTTPS including a separate Sec. Cert to encrypt & sign the data. 20% are done via FTP (sFTP).

            1 Reply Last reply
            0
            • C charlieg

              Margaret is legend. If you are serious about engineering software, read everything and I mean EVERYTHING she's written. She took her job a hell of a lot more seriously than most of us software weenies do - no offense. She knew three people were going to be on top of some serious energy, go to the moon and hopefully come back without going splat. dammit rad, I'm ordering the book now. On a side note, if anyone has good references on the material behind "Hidden Figures", please post.

              Charlie Gilley “They who can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.” BF, 1759 Has never been more appropriate.

              R Offline
              R Offline
              raddevus
              wrote on last edited by
              #16

              charlieg wrote:

              Margaret is legend.

              I agree. She is an amazing person. :thumbsup: Every time I read an article or excerpt about her work I am always inspired. :thumbsup: And you won't regret getting this book so far (chapter 2) it is fantastic read. So much great stuff in there.

              1 Reply Last reply
              0
              • C Calin Negru

                In my early days of programming I had the mindset that keeping the program running at all cost results in making the source code hard to read and maintain because of the all the extra safety checks. Then I’ve learned about all the safety features and warnings in a plane’s cockpit. I come to realize that getting notifications while still keeping things under control has a point

                J Offline
                J Offline
                jmaida
                wrote on last edited by
                #17

                I agree. I didn't do flight software, but I composed quite a lot of software (mostly for engineering and business applications). I try to handle every error condition possible with a message that an error occurred, what the error was and what created it, if possible. That follows with a recommendation of how to correct, avoid, and/or what one should document for later review. This is just a summary as it can get more complicated if a real-time system where software activity has to correct and control without user intervention. BTW all this back and forth on software engineering is very good info. In my grad school days, software engineering was it's formative stages. I was luck to be a part of that and used what I learned with success. Just a simple thing as naming convention considerations in the early stages made a big difference. It forced one to think of the architecture as a whole. Do it wrong and woe to you and others.

                "A little time, a little trouble, your better day" Badfinger

                1 Reply Last reply
                0
                • R raddevus

                  I really can't stop with this book, Modern Software Engineering[^], because so much of it resonates with me after working in IT/Dev for over 30 years. I came to Dev thru QA so I've always focused on "repeatable processes, errors & failing safely".

                  Quote:

                  One of the driving forces behind [Margaret] Hamilton’s[^] approach was the focus on how things fail—the ways in which we get things wrong. "There was a fascination on my part with errors, a never ending pass-time of mine was what made a particular error, or class of errors, happen and how to prevent it in the future." This focus was grounded in a scientifically rational approach to problem-solving. The assumption was not that you could plan and get it right the first time, rather that you treated all ideas, solutions, and designs with skepticism until you ran out of ideas about how things could go wrong. Occasionally, reality is still going to surprise you, but this is engineering empiricism at work. The other engineering principle that is embodied in Hamilton’s early work is the idea of “failing safely.” The assumption is that we can never code for every scenario, so how do we code in ways that allow our systems to cope with the unexpected and still make progress? Famously it was Hamilton’s unasked-for implementation of this idea that saved the Apollo 11 mission and allowed the Lunar Module Eagle to successfully land on the moon, despite the computer becoming overloaded during the descent. As Neil Armstrong and Buzz Aldrin descended in the Lunar Excursion Module (LEM) toward the moon, there was an exchange between the a

                  K Offline
                  K Offline
                  Kate X257
                  wrote on last edited by
                  #18

                  Nice. We built a new internal service this year, for inbound files of the same magnitude. Because it's very important data, I designed the system to automatically recover from a broad set of unknown failures, with redundancy in place that gives us a 2 month leeway to get the automated system back on track. Extra attention was made so we could re-create scenario's on a disconnected environment easily, and audit both production and simulation process with very basic tools. So far we've had 7 failures, with a critical versioning failure last week. No impact at all because of our redundancy. When dealing with large volumes or critical data, you really need a sensible and simple failsafe.

                  1 Reply Last reply
                  0
                  • R raddevus

                    I really can't stop with this book, Modern Software Engineering[^], because so much of it resonates with me after working in IT/Dev for over 30 years. I came to Dev thru QA so I've always focused on "repeatable processes, errors & failing safely".

                    Quote:

                    One of the driving forces behind [Margaret] Hamilton’s[^] approach was the focus on how things fail—the ways in which we get things wrong. "There was a fascination on my part with errors, a never ending pass-time of mine was what made a particular error, or class of errors, happen and how to prevent it in the future." This focus was grounded in a scientifically rational approach to problem-solving. The assumption was not that you could plan and get it right the first time, rather that you treated all ideas, solutions, and designs with skepticism until you ran out of ideas about how things could go wrong. Occasionally, reality is still going to surprise you, but this is engineering empiricism at work. The other engineering principle that is embodied in Hamilton’s early work is the idea of “failing safely.” The assumption is that we can never code for every scenario, so how do we code in ways that allow our systems to cope with the unexpected and still make progress? Famously it was Hamilton’s unasked-for implementation of this idea that saved the Apollo 11 mission and allowed the Lunar Module Eagle to successfully land on the moon, despite the computer becoming overloaded during the descent. As Neil Armstrong and Buzz Aldrin descended in the Lunar Excursion Module (LEM) toward the moon, there was an exchange between the a

                    L Offline
                    L Offline
                    Lost User
                    wrote on last edited by
                    #19

                    I found that (user) logging reduces a lot of "errors". Another term is graceful degradation. But that requires understanding when a try-catch block should continue or not. And, yes, it may require asking the user if it should proceed (e.g. a file not available). It then comes down to transparency (of the software) ... what Boeing failed to do with their software changes: tell the user what they did and how it might impact them.

                    "Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I

                    P R 2 Replies Last reply
                    0
                    • L Lost User

                      I found that (user) logging reduces a lot of "errors". Another term is graceful degradation. But that requires understanding when a try-catch block should continue or not. And, yes, it may require asking the user if it should proceed (e.g. a file not available). It then comes down to transparency (of the software) ... what Boeing failed to do with their software changes: tell the user what they did and how it might impact them.

                      "Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I

                      P Offline
                      P Offline
                      PhilipOakley
                      wrote on last edited by
                      #20

                      The Boeing thing shows that big Engineering can be really hard, and that in many cases, software is just another unreliable component, somewhat miss-designed and under appreciated (in its failure modes!). The world is eating software, with the usual outcomes.

                      1 Reply Last reply
                      0
                      • R raddevus

                        I really can't stop with this book, Modern Software Engineering[^], because so much of it resonates with me after working in IT/Dev for over 30 years. I came to Dev thru QA so I've always focused on "repeatable processes, errors & failing safely".

                        Quote:

                        One of the driving forces behind [Margaret] Hamilton’s[^] approach was the focus on how things fail—the ways in which we get things wrong. "There was a fascination on my part with errors, a never ending pass-time of mine was what made a particular error, or class of errors, happen and how to prevent it in the future." This focus was grounded in a scientifically rational approach to problem-solving. The assumption was not that you could plan and get it right the first time, rather that you treated all ideas, solutions, and designs with skepticism until you ran out of ideas about how things could go wrong. Occasionally, reality is still going to surprise you, but this is engineering empiricism at work. The other engineering principle that is embodied in Hamilton’s early work is the idea of “failing safely.” The assumption is that we can never code for every scenario, so how do we code in ways that allow our systems to cope with the unexpected and still make progress? Famously it was Hamilton’s unasked-for implementation of this idea that saved the Apollo 11 mission and allowed the Lunar Module Eagle to successfully land on the moon, despite the computer becoming overloaded during the descent. As Neil Armstrong and Buzz Aldrin descended in the Lunar Excursion Module (LEM) toward the moon, there was an exchange between the a

                        A Offline
                        A Offline
                        agolddog
                        wrote on last edited by
                        #21

                        It's just shocking to me how many "professional" developers don't embrace defensive programming.

                        R 1 Reply Last reply
                        0
                        • R raddevus

                          I really can't stop with this book, Modern Software Engineering[^], because so much of it resonates with me after working in IT/Dev for over 30 years. I came to Dev thru QA so I've always focused on "repeatable processes, errors & failing safely".

                          Quote:

                          One of the driving forces behind [Margaret] Hamilton’s[^] approach was the focus on how things fail—the ways in which we get things wrong. "There was a fascination on my part with errors, a never ending pass-time of mine was what made a particular error, or class of errors, happen and how to prevent it in the future." This focus was grounded in a scientifically rational approach to problem-solving. The assumption was not that you could plan and get it right the first time, rather that you treated all ideas, solutions, and designs with skepticism until you ran out of ideas about how things could go wrong. Occasionally, reality is still going to surprise you, but this is engineering empiricism at work. The other engineering principle that is embodied in Hamilton’s early work is the idea of “failing safely.” The assumption is that we can never code for every scenario, so how do we code in ways that allow our systems to cope with the unexpected and still make progress? Famously it was Hamilton’s unasked-for implementation of this idea that saved the Apollo 11 mission and allowed the Lunar Module Eagle to successfully land on the moon, despite the computer becoming overloaded during the descent. As Neil Armstrong and Buzz Aldrin descended in the Lunar Excursion Module (LEM) toward the moon, there was an exchange between the a

                          R Offline
                          R Offline
                          rjmoses
                          wrote on last edited by
                          #22

                          One of the hardest, or at least most memorable, software problem I ever had to chase was a programming error in a seldomly used error recovery routine. It seems my programmer, who was highly experienced in other programming languages such as COBOL, coded a "=" instead of an "==" inside an if statement in a C program--kinda like "if ( A = B)....". This ALWAYS returns true WHILE assigning the value of B to A. Unfortunately, this little error caused the system to crash. It took about six months to find the cause of the crash and understand what was happening. Looking at the code under the pressure of a "down system", we always asked why the condition was true because our mind was saying if A equals B and not considering that A was not equal to B before the if statement. I cursed (and still do curse to this day) whoever decided that allowing an assignment inside of a conditional statement was a good idea. And I have to wonder how many systems, like autonomous cars, have a statement like that buried way down deep in a infrequently used piece of critical code.

                          R 1 Reply Last reply
                          0
                          • L Lost User

                            I found that (user) logging reduces a lot of "errors". Another term is graceful degradation. But that requires understanding when a try-catch block should continue or not. And, yes, it may require asking the user if it should proceed (e.g. a file not available). It then comes down to transparency (of the software) ... what Boeing failed to do with their software changes: tell the user what they did and how it might impact them.

                            "Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I

                            R Offline
                            R Offline
                            raddevus
                            wrote on last edited by
                            #23

                            Very good points and exactly right on the Boeing problem. Could’ve been solved properly. Such a sad and terrible thing.

                            1 Reply Last reply
                            0
                            • A agolddog

                              It's just shocking to me how many "professional" developers don't embrace defensive programming.

                              R Offline
                              R Offline
                              raddevus
                              wrote on last edited by
                              #24

                              I believe it is because they don’t “have to” since they are not the ones who will be wakened in the middle of the night. You learn a lot from losing sleep. :rolleyes:

                              1 Reply Last reply
                              0
                              • R rjmoses

                                One of the hardest, or at least most memorable, software problem I ever had to chase was a programming error in a seldomly used error recovery routine. It seems my programmer, who was highly experienced in other programming languages such as COBOL, coded a "=" instead of an "==" inside an if statement in a C program--kinda like "if ( A = B)....". This ALWAYS returns true WHILE assigning the value of B to A. Unfortunately, this little error caused the system to crash. It took about six months to find the cause of the crash and understand what was happening. Looking at the code under the pressure of a "down system", we always asked why the condition was true because our mind was saying if A equals B and not considering that A was not equal to B before the if statement. I cursed (and still do curse to this day) whoever decided that allowing an assignment inside of a conditional statement was a good idea. And I have to wonder how many systems, like autonomous cars, have a statement like that buried way down deep in a infrequently used piece of critical code.

                                R Offline
                                R Offline
                                raddevus
                                wrote on last edited by
                                #25

                                That is a fantastic story, thanks for sharing. I’ve seen that error (in my own code) and fortunately caught it before it went into production. You would def think the compiler would catch that.

                                1 Reply Last reply
                                0
                                Reply
                                • Reply as topic
                                Log in to reply
                                • Oldest to Newest
                                • Newest to Oldest
                                • Most Votes


                                • Login

                                • Don't have an account? Register

                                • Login or register to search.
                                • First post
                                  Last post
                                0
                                • Categories
                                • Recent
                                • Tags
                                • Popular
                                • World
                                • Users
                                • Groups