Power & ISP Outages
While at work, I received a notification at 14:35 local time from my power company that there was an outage in my area. I then received a notification at 15:50 that the power was back up. I also received notification from my ISP reporting an outage at 15:45 local due to the power outage, which is weird, but after I got home, I could see multiple occurrences of losing connectivity to my ISP in the UDM.
Survey the Situation
Upon returning home, I find that my network was up, but my Grafana dashboards are not reporting current data. The data pipeline was clearly broken somewhere between the weather station and Timescale.
I began with checking that my Pi is up, my NAS is up, and that all of my containers are up. They all are. From what I can tell from uptime reported by the NAS, Pi, and gateway, the UPS held through the outage and nothing ever shut down.
I’m seeing some weird stuff with Pi-hole where I didn’t have many queries throughout last night into today, then a huge spike when I got home. That’s likely unrelated and something else I’ll need to address later.
I had set Network UPS Tools up on a Protectli Vault to try to orchestrate a graceful shutdown in the event of a power outage. I set up orchestration to shut things down gracefully, but it didn’t fire. That’s a story for another day. The NUT script on the Vault doesn’t show a log where it detected being on battery, but I can see though in PeaNUT on the NAS that the battery did discharge and is currently charging back up now. I’ll need to address that issue later.
So why is my dashboard not displaying current weather data? I was able to connect to Timescale running in the Portainer container using pgAdmin. I see that my latest data from my weather station was inserted right before the power outage was reported. I can see on my Pi that my API services are running, and that they are attempting to write to the database. I can also see in my NWS scraper logs that I’ve got successful scrapes of forecast data and severe weather alerts.
Interestingly, forecast data was inserting when the successful runs are seen in the log. I can connect to the admin GUI of the weather station console and see that it is online and still configured properly.
I checked my weather station console and can see that the time is accurate, and it is showing logs of observations reported from the weather station.
I look at my flows in the UDM and see that the weather station console is pushing data to the API that the Pi hosts.
I can use my manual snow logging API that lives on the Pi to record a test event to the manual.snow_events table on Timescale. This works and the Snow Events dashboard in Grafana shows the test entry.
Taking Action
I’ll restart the Ecowitt ingest service on the Pi, which is an API that handles catching data from my weather station and inserting it into my database tables, to see if that will do the trick. Instead of waiting for the usual five minute interval for the console to push, I’ll tell the console to push every sixty seconds so I can troubleshoot faster.
I can see in the UDM that the weather station pushed the data to the Pi and that the Pi caught it and logged a 200 response, but nothing is appearing either in the appropriate Timescale tables or in Grafana. I’ll stop the Ecowitt ingest service on the Pi, then start it instead of restarting. Then, I’ll modify my api.py file to log the raw requests being caught to a JSON file. This code already existed from testing and just needed to be uncommented.
In the UDM, when I look at the weather station console, I see it has Poor connectivity. I can see when it went offline, as it’s not connected to a UPS, the UDM shows the console was not connected to Wi-Fi from 14:41 to 15:20.
I’ll take a look at the service logs using the command
journalctl -u ecowitt_ingest.service
When I get to today’s entries, I see no interruptions aside from my restarts of the service. I’m also seeing that the raw posts from the weather station appear to not be containing any data. So it seems as if the weather station is sending packets, but there’s no data there. Because the data isn’t encrypted, I’m going to install tcpdump on the Pi and try to capture some packets to confirm whether the packets are empty. I’ll install with the command
sudo apt install tcpdump
then listen with
sudo tcpdump -n port 44380 -v
because I’m sending to and listening on port 44380 and -v gives me verbose output.
I’m not able to see the actual data of the packet, but I can see packets coming in that don’t appear to be empty, but maybe not complete given the size of the packets. I’m still not seeing any data in my JSON file though, and nothing in the database or Grafana.
I’m going to reboot the Pi as well as the weather station console and see if that will do the trick. An issue that I didn’t consider was that I have a Pi-hole running on this Pi, but didn’t set a backup DNS server. That means that when I bring the Pi down, I lose DNS resolution and my whole network breaks. For now, I’m going to add a backup DNS server outside of my network to get the network back up.
Once the Pi is back up, I can now see I’ve got some partial data in my raw_posts.json from the Ecowitt ingest service, but I’m still not seeing anything in Timescale or Grafana. It looks like I’m only catching partial data from the weather station console which could cause issues with the script. Now that the Pi is back up and I know that my ingest service is up and should be functioning properly, I’m going to reboot the weather station console one more time, now that it will be able to connect and resolve DNS.
I built in a /health endpoint on the Ecowitt ingest API, which returns the current timestamp and “OK” when it’s up, and I can get that response. This definitely seems to be an issue of the weather station console sending incomplete data which is causing an error. I’ll go reboot it one more time and hopefully it’ll come back up properly. And it did. I also noticed that the console had switched to UTC and had reset the date to 2024-01-01, so I changed it back to my local time zone and the proper date.
Resolution
There wound up being a few issues here. When the weather station console lost power, it caused the date and time to be thrown off when the console came back online. Then, after it initially came back up, it was sending incomplete data. So I had incomplete data being inserted into the database with the wrong timestamps attached.
Instead of trying to edit the timestamps in the table, I simply deleted the data with the bad timestamps with the following command:
DELETE FROM ws_observations WHERE time::date = '2024-01-01';
At the end of the day, the rows with the bad timestamps will be a drop in the bucket compared to my total amount of data.
In total, this caused me roughly six hours of downtime with no data flowing into the database. This took me in total about three hours to troubleshoot once arriving home and having access.
Going Forward…
- I’m probably going to get a little UPS for the Ecowitt console now that I know that it seems sensitive to losing power.
- I need to finish my NUT orchestration so that all of my devices shut themselves down gracefully. Once I can accomplish that, I can work on bringing everything back up after power restores.
- Learn how to use tcpdump properly when you’re not mid-outage and trying to move quickly.
- I need to set up my second Pi and have both Pis running Pi-hole so I can fail over if one goes down.