Alcoraiden t1_jdio6ir wrote on March 24, 2023 at 5:44 PM

Reply to comment by WillShelbyOBE in ELI5: Why does "turning it off and on again" work so well for troubleshooting? by WillShelbyOBE

Let me give this a shot. I'm an electrical engineer, for context, so I'm more on the hardware than the software, but here goes.

Preface: In an ideal world, a computer would work the same way each time. That's what code and circuits are designed to do. However, obviously, that doesn't always happen.

Hardware answer: All your machines are made of circuits, as you know. The software is running on something. When you transmit data from point A to point B, like when you press keys on your keyboard and signals go flying out to the processor, things can go wrong. Some common issues in circuit design, some of which should be caught by engineers in design:

- Marginal voltages, where you barely have enough juice to run what you're trying to run and any sagging in the power will cause problems)

- Transistors overlapping their switching, where you get "shoot-through" where power connects to ground briefly, usually happens in motor drives,

- Overheating, where electronics change behavior sometimes drastically when hot, and can even melt/fuse

- Not enough static protection, so you can shock it with your finger when you touch it and change the voltages inside

There are so many more. Sometimes your traces on the printed circuit board are poorly matched, so high speed components will sometimes glitch out when they don't recognize a parallel data bus that comes in out of sync. Sometimes your power rails don't come up in the right order. I could write a list as long as my arm of issues that can cause intermittent bugs.

I had to restart my smartwatch once when I shocked it just right that it froze up -- presumably the sudden voltage spike, while taken care of by protective diodes, had caused enough chaos inside that the processor didn't know what to do. But once you drop the power rail and everything goes to 0V (roughly), you can start it up again and now it's fresh. It doesn't remember what went wrong unless you physically damaged a part.

The summary of all this is that there are many parts to a machine that are functioning marginally, such that small random events can determine when it works and when it doesn't. Restarting just clears out the negative effects of a bad run, can allow hardware to reset its voltages or cool off, and starts from scratch for another try. Good engineering will minimize the chance of these random events causing issues, but a few will always get through now and then.

Turning it off and back on again is a temporary solution. As an engineer, I am not allowed to just reboot and let a known bug through. It will show up again later. Bugs are almost never single events, even if it takes a while for them to reappear.