WillShelbyOBE OP t1_jdiat5s wrote on March 24, 2023 at 4:20 PM

Reply to comment by cdtoad in ELI5: Why does "turning it off and on again" work so well for troubleshooting? by WillShelbyOBE

Agreed! I think your comment is a well articluated version of my understanding. But I want to know the deep, deep rationale behind it. It seems so crazy to me that in our advanced technology world that the best troubleshooting is to turn it off and on again.

LARRY_Xilo t1_jdifi7b wrote on March 24, 2023 at 4:50 PM

Its "the best" because its very fast, everyone understands what to do and you dont need to figure out what even went wrong or how to fix it. Most things could be fixed without a restart but it requires knowledge, time and access to the device. So why try to use valuable time from someone with knowledge if a restart also works.

Pilchard123 t1_jdimkav wrote on March 24, 2023 at 5:34 PM

> everyone understands what to do

laughs in IT support

jak0b345 t1_jdida1e wrote on March 24, 2023 at 4:36 PM

it is not the best troubleshooting method we have, not by a long shot. but it is the easiest one that still is able to solve some (or even most) problems. because of that it is always the first go-to method before trying something more involved like deciphering logs or error codes to try and understand the source of the problem.

Alcoraiden t1_jdio6ir wrote on March 24, 2023 at 5:44 PM

Let me give this a shot. I'm an electrical engineer, for context, so I'm more on the hardware than the software, but here goes.

Preface: In an ideal world, a computer would work the same way each time. That's what code and circuits are designed to do. However, obviously, that doesn't always happen.

Hardware answer: All your machines are made of circuits, as you know. The software is running on something. When you transmit data from point A to point B, like when you press keys on your keyboard and signals go flying out to the processor, things can go wrong. Some common issues in circuit design, some of which should be caught by engineers in design:

- Marginal voltages, where you barely have enough juice to run what you're trying to run and any sagging in the power will cause problems)

- Transistors overlapping their switching, where you get "shoot-through" where power connects to ground briefly, usually happens in motor drives,

- Overheating, where electronics change behavior sometimes drastically when hot, and can even melt/fuse

- Not enough static protection, so you can shock it with your finger when you touch it and change the voltages inside

There are so many more. Sometimes your traces on the printed circuit board are poorly matched, so high speed components will sometimes glitch out when they don't recognize a parallel data bus that comes in out of sync. Sometimes your power rails don't come up in the right order. I could write a list as long as my arm of issues that can cause intermittent bugs.

I had to restart my smartwatch once when I shocked it just right that it froze up -- presumably the sudden voltage spike, while taken care of by protective diodes, had caused enough chaos inside that the processor didn't know what to do. But once you drop the power rail and everything goes to 0V (roughly), you can start it up again and now it's fresh. It doesn't remember what went wrong unless you physically damaged a part.

The summary of all this is that there are many parts to a machine that are functioning marginally, such that small random events can determine when it works and when it doesn't. Restarting just clears out the negative effects of a bad run, can allow hardware to reset its voltages or cool off, and starts from scratch for another try. Good engineering will minimize the chance of these random events causing issues, but a few will always get through now and then.

Turning it off and back on again is a temporary solution. As an engineer, I am not allowed to just reboot and let a known bug through. It will show up again later. Bugs are almost never single events, even if it takes a while for them to reappear.