Resilient Frontend

Resilient Frontend

A fault tolerant system does not prevent faults. It prevents fault from becoming failures

— Source: The Art of Fault-Tolerant System Design

A Fresh Start

I got a new job and a new project to work on. It was quite simple—a workflow that the business team required to automate:

  1. Read Excel Sheet
  2. Make a GET request
  3. Save new data in the Excel Sheet

Along with the core logic, I needed to create a GUI for the team to use the tool. I completed the task in no time. I used Python for the business logic, PySide6 for the GUI, Pandas for Excel processing, and the Requests library for API calls. I even used PyInstaller to ship it as a standalone Windows application.

I thoroughly reviewed the UI and UX. I used highly readable Inter fonts, smooth icons, and a tiny local database for preferences. The end product was user-friendly, simple, and exactly what the team needed.

Hole In The Pot

On my machine, it worked flawlessly. The process completed in seconds. But then... the business team called. The tool wasn't working.

I had assumed that if it worked on my machine, it would work on theirs. I was wrong. The failures were:

The Core Issue

After reading some insights from AWS developers, I identified the bottlenecks:

  1. Huge Payloads: Sending 80+ rows in a single request.
  2. No Retries: The system just gave up on the first failure.
  3. Manual Failsafe: No way for a human to recover without restarting everything.

The Real Engineering

To fix this, I moved away from "perfect world" assumptions and designed for failure:

The Lesson

Stability comes from designing for failures, not avoiding it.

— Source: Release IT, Michael T. Nygard

This saved me from debugging third-party API failures that were out of my control. The work was "simple" in theory, but implementing these real-world failsafes turned a fragile script into a stable engineering solution.