Introduction

Being based in the Bay Area and managing a complex home infrastructure – including Kubernetes clusters, VMs, and Proxmox nodes – I’ve faced the significant challenge of rolling blackouts. Here’s my journey, including the lessons learned, the solutions implemented, and my plans for the future.

The 2 AM Wake-Up Call

My UPS alarm went off at 2 AM, and with only 15 minutes of power left, I failed to turn off my nodes in time, leading to an unsafe shutdown. I knew I needed to develop a more strategic solution.

Manual Solution: The First Success

I wrote an Ansible playbook that drained the Kubernetes nodes and safely shut them down. During the next blackout, this manual method worked seamlessly, shutting down everything in a controlled manner. But I wanted something more robust.

Transition to Low-Power Mode: Essentials Stay On

Building upon my initial success, I optimized for a low-power mode that kept essential services running on a low-powered Celeron node, such as Home Assistant and Omada Controller. Additionally, a proxmox VM running OPNsense served as my backup firewall/router, providing critical connectivity. This allowed quick WiFi restoration and light control, creating a more resilient setup.

Custom Draining Explained

Here’s why I used specific options in the draining script:

kubectl drain $node --force --delete-local-data --ignore-daemonsets --selector='essential!=true'

The --force and --delete-local-data options ensure the node drains without getting stuck, even if local data must be deleted.

Low-Power Playbook Example

Here’s an excerpt from the playbook designed for low-power mode:

- hosts: workers
  tasks:
    - name: Custom Drain worker nodes
      command: /path/to/custom-drain.sh {{ inventory_hostname }}
    - name: Stop unnecessary worker nodes
      command: shutdown -h now

Future Plans: Full Automation

  1. Monitor Power Status: Create a script that interfaces with the NUT server to keep an eye on power changes.
  2. Trigger Playbooks: Automate the execution of playbooks during a power outage.
  3. Reversing the Process: Develop a playbook to restore the system when power is back.

Conclusion

From an initial struggle to an elegant manual solution, and finally to a plan for full automation, I’ve turned the challenge of Bay Area’s blackouts into an opportunity to innovate my home infrastructure. For those facing similar challenges, this journey proves that with ingenuity and the right technical approach, you can build a resilient system that thrives in uncertainty.