TX1 on 28 Dec 2025 18:52:27 (UTC-05:00)
Resolved on 29 Dec 2025 21:22:50 (UTC-05:00)

Incident Summary

On December 28, 2025, the Dallas, TX data center experienced an outage affecting TX1. The initial outage began at approximately 6:06 PM Eastern Time due to a brief utility interruption that caused HVAC systems to fail. This led to elevated temperatures and automatic shutdowns of critical equipment.

After the facility was stabilized, it was discovered that TX1 remained offline for a total of 16 hours and 48 minutes due to a failed SFP on the switch. The data center took an extended period to acknowledge the issue and replace the faulty component, resulting in prolonged downtime for this server.

We have opened an investigation regarding the data center’s delay in restoring network connectivity for TX1 after the HVAC was restored. Additional updates will be provided as we work through our internal processes for evaluating incidents such as this.

Timeline of Events


- 6:06 PM ET – NOC observed multiple servers going offline at the Dallas location. Investigation began immediately.
- 6:13 PM ET – High temperatures detected on core network devices. Monitoring continued.
- 6:43 PM ET – Elevated temperatures persisted. HVAC failure suspected. Facility contacted; on-call technicians dispatched.
- 6:52 PM ET – Facilities arrived and began restoring CRAC and rooftop HVAC units. Cooling estimated to take ~30 minutes.
- 7:25 PM ET – Facility still warm (~89°F). Estimated 15 more minutes until safe for equipment power-up.
- 7:43 PM ET – Facility reached ~80°F. Network equipment began powering up gradually.
- 8:21 PM ET – All network devices restored. Servers being powered up in controlled sequence.
- 9:27 PM ET – Most servers restored. However, TX1 remained offline due to a faulty SFP on the switch. Extended downtime continued until the SFP was replaced the following morning.
- 11:32 AM ET (Dec 29) – TX1 fully restored after the SFP replacement and verification.

Impact


- Affected Server: TX1
- Downtime: ~17 hours (6:37 PM ET Dec 28 – 11:32 AM ET Dec 29)
- Customer Impact: Extended loss of service for TX1

Root Cause


The outage had two main contributing factors:

1. Data Center HVAC Failure: A utility interruption caused the HVAC systems to fail, which led to elevated temperatures and automatic shutdowns of critical equipment.
2. Faulty SFP on Switch for TX1: After temperatures normalized, TX1 remained offline because the SFP module on the switch had failed. The data center took a long time to acknowledge and replace the faulty hardware, resulting in prolonged downtime.

Additional notes from the facility:

- Core networking and servers shut down automatically due to high temperatures
- Some servers experienced temporary connectivity issues during recovery

A full Root Cause Analysis (RCA) will be provided by the facility.

Resolution


- CRAC and rooftop HVAC units restored and operating normally
- Facility temperature returned to safe levels (~80°F)
- Network equipment powered on first, followed by servers
- TX1 restored after SFP replacement (~16 hours total downtime)
Monitoring on 29 Dec 2025 11:31:09 (UTC-05:00)
TX1 is now online. We expect VMs to be online momentarily.
Update on 29 Dec 2025 09:23:18 (UTC-05:00)
The data center replied that our OOB is online, which is not what we asked them to look into. We have clarified what we are experiencing.

We are sorry for the inconvenience.
Update on 28 Dec 2025 22:07:55 (UTC-05:00)
After a reboot, the network uplinks to this server are showing as down in the OS and OOB (out-of-band) management. We have asked the data center to look into this.
In progress on 28 Dec 2025 21:55:20 (UTC-05:00)
TX1 seems to be locked up after the switches came back online. We are rebooting TX1 to hopefully resolve this issue.
Identified on 28 Dec 2025 19:08:12 (UTC-05:00)
From the datacenter:


At approximately 15:06 pacific time our NOC began to see machines dropping offline in our Dallas, TX location. We are investigating

UPDATE 15:13 pacific time. We are seeing high temperatures on core network devices, we are continuing to investigate.

UPDATE 15:43 pacific time. We are seeing high temperatures across many devices that are still online. We suspect that there is an HVAC issue with the facility, and we have reached out to the facility. We are also ensuring our own on-call technicians are enroute. Further updates will be posted when information is available. All appropriate resources have been engaged.

UPDATE 15:52 pacific time. Facilities personnel have arrived at the datacenter and CRAC units are being re-energized, as well as additional rooftop units. It will take approximately 30 minutes to cool down the facility to levels that are acceptable for servers to power up without damage. Further updates will be posted when information is available.
Update on 28 Dec 2025 19:01:28 (UTC-05:00)
The node's out-of-band management is not responding and is timing out. We have asked the datacenter to investigate if there are any issues with power.
Investigating on 28 Dec 2025 18:52:27 (UTC-05:00)
We are aware of an outage pertaining to the TX1 node. We are investigating and will update once we know more.