Software Engineering When You Can't Reboot the Server

Your web server crashes at 3 AM. Kubernetes restarts it. Users see a brief error page. Nobody gets fired. Now imagine your server is orbiting Mars, the restart takes 45 minutes (during which you have no attitude control, no thermal regulation, and no communication), the 'user' is a $2.5 billion spacecraft, and there's no Kubernetes — just your code and the radiation-hardened CPU it's running on.

Space software operates under constraints that make normal software engineering look casual. You can't deploy a hotfix. You can't SSH in and check the logs. You can't add more servers when load increases. Every line of code must be correct before launch, because after launch, the software is on its own for years or decades, running on hardware that's being slowly damaged by radiation, with a communication link that has multi-minute delays and kilobits-per-second bandwidth.

The Constraints That Shape Everything

Radiation. In space, high-energy particles constantly bombard electronics. A single particle can flip a bit in memory (a single-event upset, or SEU), corrupt a register, or cause a processor to lock up. This isn't rare — LEO satellites experience thousands of bit flips per day. Space-grade hardware uses radiation-hardened components that are slower, more expensive, and generations behind consumer hardware. The processor on Mars Perseverance is a RAD750 — roughly equivalent to a 1998-era PowerPC running at 200 MHz.

Communication delays. Mars is 4 to 24 minutes away by radio (depending on orbital position). Jupiter is 33 to 54 minutes. This isn't just latency — it means the spacecraft must handle any problem autonomously for at least the round-trip time before ground control can even see the problem, let alone send a command. When the Voyager probes encounter an anomaly at 22+ light-hours from Earth, the software makes its own decisions for nearly two days.

No physical access. You can't replace a failed component, add RAM, or swap a hard drive. If the primary computer dies and the backup doesn't work, the mission is over. Every failure mode must be anticipated in software before launch.

How Space Code Is Different

Space software uses techniques that would be considered absurdly over-engineered in normal software development.

Triple modular redundancy (TMR). Critical computations run three times on three independent processors. A voter compares the three results and uses the majority answer. If one processor gets a radiation-induced wrong answer, the other two outvote it. Some systems run five copies (pentuple redundancy) for extra assurance.

Memory scrubbing. A background process continuously reads memory, checks it against error-correcting codes (ECC), and fixes single-bit errors before they accumulate into uncorrectable multi-bit errors. This runs constantly — every byte of memory is checked and corrected several times per second.

Watchdog timers. Hardware timers that must be regularly reset by software. If the software hangs (perhaps due to a radiation-induced lockup), the timer expires and triggers a hardware reset. The software must be designed to survive unexpected reboots at any point in its execution — any computation might be interrupted and restarted.

// Simplified space software patterns
// Watchdog — must be kicked regularly or hardware resets
void main_loop(void) {
while (1) {
kick_watchdog();  // Reset the timer — we're alive
// All operations must complete within watchdog timeout
read_sensors();
kick_watchdog();
run_attitude_control();
kick_watchdog();
check_thermal_limits();
kick_watchdog();
process_ground_commands();
kick_watchdog();
}
}
// Critical value stored with redundancy
typedef struct {
int32_t value_a;  // Primary copy
int32_t value_b;  // Redundant copy
int32_t value_c;  // Third copy for voting
} RedundantInt;
int32_t read_redundant(RedundantInt *r) {
// Majority vote — tolerates one corrupted copy
if (r->value_a == r->value_b) return r->value_a;
if (r->value_a == r->value_c) return r->value_a;
if (r->value_b == r->value_c) return r->value_b;
// All three differ — flag anomaly
raise_anomaly(MEMORY_CORRUPTION);
return r->value_a;  // Best guess
}

The Testing Is the Product

NASA's Jet Propulsion Lab estimates that space software testing consumes 60-80% of the total development effort. Not 60% of time — 60% of the total cost. The testing is more expensive than the development because the testing has to prove correctness to a degree that normal software testing doesn't approach.

Every code path must be tested. Not 'high code coverage' in the way web developers mean it — literally every path through every function, including error paths, timeout paths, and paths that handle hardware failures. Branch coverage of 100% is the starting point, not the goal.

Beyond unit tests, space software undergoes hardware-in-the-loop testing (running the real software on the real flight hardware, simulating the space environment), long-duration stress testing (running for months to find timing-dependent bugs), and fault injection testing (deliberately corrupting memory, killing processors, and severing communication links to verify the software recovers).

Formal verification is increasingly used for the most critical components. Instead of testing that code works for specific inputs, formal verification mathematically proves that the code works for all possible inputs. It's expensive and slow, but for code that controls a spacecraft's attitude (orientation) or manages its propulsion, the cost is justified.

What Goes Wrong Anyway

Despite this rigor, space software fails. The failures are instructive because they reveal the limits of even the most careful engineering.

Mars Climate Orbiter (1999) crashed because one team used imperial units and another used metric. The software was correct — the requirements were wrong. No amount of testing catches a wrong specification.
Ariane 5 Flight 501 (1996) exploded because a 64-bit float was converted to a 16-bit integer and overflowed. The code was reused from Ariane 4, where the value never exceeded 16-bit range. The code was correct for Ariane 4 and catastrophically wrong for Ariane 5.
Mars Polar Lander (1999) probably crashed because a sensor vibration during leg deployment was misinterpreted as ground contact, causing the engines to shut off at 40 meters altitude. A timing-dependent sensor interpretation issue that testing didn't catch because the test didn't perfectly replicate the vibration characteristics.
Hubble's initial mirror flaw (1990) wasn't a software bug — the mirror was ground to the wrong shape because of a miscalibrated testing instrument. The testing tool itself had a bug.

The common thread: the code was correct according to its specification, but the specification didn't match reality. This is the hardest kind of bug to prevent because it exists in the gap between the model and the world.

What Earth Software Can Learn

Most of us don't write space software. But some of these practices translate directly to building reliable systems on Earth.

Design for restart. Space software assumes it can be rebooted at any time and must recover to a known-good state. Web services should have this property too — if your server crashes and restarts, does it recover without manual intervention? Does it resume processing from where it left off, or does it lose work? Defensive engineering patterns that space engineers consider mandatory are often optional in web development, but they shouldn't be.

Test failure modes, not just happy paths. Space software testing specifically injects failures: kill a process, corrupt memory, drop network connections, return error codes from every system call. Most web application testing focuses on whether features work, not on whether the system handles failures gracefully.

Redundancy for critical data. If your application stores data that's expensive to recreate — financial records, user content, configuration state — store it redundantly and verify it regularly. Database replication is the obvious example, but in-application redundancy (checksums, validation, periodic integrity checks) catches corruption that replication propagates.

Watchdogs for critical processes. If a process must not hang, monitor it. Not just 'is the process running' but 'is the process making progress.' Health checks that verify the system is actually functioning — processing requests, updating state, meeting deadlines — catch failures that process-level monitoring misses.

The New Space Software

The commercial space industry (SpaceX, Rocket Lab, Planet) is challenging some of these traditions. SpaceX uses Linux and C++ on its Falcon 9 flight computers — heresy by traditional aerospace standards. Planet Labs operates hundreds of small satellites and treats them more like a distributed system than bespoke hardware: if a satellite fails, the constellation compensates.

This shift reflects a broader question: how much reliability do you actually need? A $2.5 billion Mars rover justifies five years of testing. A $500,000 communications satellite that's one of hundreds in a constellation justifies much less. The engineering practice should match the risk profile, not blindly follow traditions developed for a different era.

But the core lesson persists regardless of budget: software that can't be physically accessed after deployment must be more reliable than software that can. Whether you're launching a spacecraft or deploying to an IoT fleet or running an edge computing network, the principle is the same — if you can't SSH in and fix it, the software has to handle it. That mindset, applied judiciously, makes all software better.