APC UPS issue or..?

We had a storm here, electricity went down, generator kicked in…
It all doesn’t matter really, shouldn’t be, in theory. Because NAS connected to APC UPS battery.

I did recently test by disconnecting rack from power outlet and all went fine…

But why this time did it (NAS) freeze? I had to go and force power down via IPMI, and then got following alarm:
" New alerts:

  • Sensor: ‘Vcpu’ had an ‘Assertion Event’ (Lower Critical - going low ; Sensor Reading = 0.09 Volts ; Threshold = 0.45 Volts)
  • Sensor: ‘Vcpu’ had an ‘Assertion Event’ (Lower Non-recoverable - going low ; Sensor Reading = 0.09 Volts ; Threshold = 0.44 Volts)
  • Sensor: ‘VDIMM’ had an ‘Assertion Event’ (Lower Critical - going low ; Sensor Reading = 0.12 Volts ; Threshold = 0.97 Volts)
  • Sensor: ‘VDIMM’ had an ‘Assertion Event’ (Lower Non-recoverable - going low ; Sensor Reading = 0.12 Volts ; Threshold = 0.95 Volts)
  • Sensor: ‘PVCCSRAM’ had an ‘Assertion Event’ (Lower Critical - going low ; Sensor Reading = 0.10 Volts ; Threshold = 0.66 Volts)
  • Sensor: ‘PVCCSRAM’ had an ‘Assertion Event’ (Lower Non-recoverable - going low ; Sensor Reading = 0.10 Volts ; Threshold = 0.65 Volts)
  • Sensor: ‘Sensor #255’ had an ‘Assertion Event’ (IERR)

Current alerts:

  • Sensor: ‘Vcpu’ had an ‘Assertion Event’ (Lower Critical - going low ; Sensor Reading = 0.09 Volts ; Threshold = 0.45 Volts)
  • Sensor: ‘Vcpu’ had an ‘Assertion Event’ (Lower Non-recoverable - going low ; Sensor Reading = 0.09 Volts ; Threshold = 0.44 Volts)
  • Sensor: ‘VDIMM’ had an ‘Assertion Event’ (Lower Critical - going low ; Sensor Reading = 0.12 Volts ; Threshold = 0.97 Volts)
  • Sensor: ‘VDIMM’ had an ‘Assertion Event’ (Lower Non-recoverable - going low ; Sensor Reading = 0.12 Volts ; Threshold = 0.95 Volts)
  • Sensor: ‘PVCCSRAM’ had an ‘Assertion Event’ (Lower Critical - going low ; Sensor Reading = 0.10 Volts ; Threshold = 0.66 Volts)
  • Sensor: ‘PVCCSRAM’ had an ‘Assertion Event’ (Lower Non-recoverable - going low ; Sensor Reading = 0.10 Volts ; Threshold = 0.65 Volts)
  • Sensor: ‘Sensor #255’ had an ‘Assertion Event’ (IERR)"

We may never know the real reason however let me put this possibility in your mind. All the sensor data looks to be pointing towards a sudden power loss.

You tested the UPS by unplugging it from the wall, and that is what a good person will do to ensure the system performs a controlled shutdown.

A few theories:

  1. When the storm caused the power outage, when the generator powered on, the sudden spike or a power dip caused the system to crash.
  2. The storm created a power spike which got through the UPS and caused the crash.
  3. Was mercury in retrograde? Ha Ha, just keeping it light.

It could have been many things, all power related of course. I’d test your UPS again, verify the battery still has a good capacity to allow the system to shutdown properly.

is your UPS overloaded? and are your batteries in decent shape?

Those two things I’ve seen cause shutdowns where the UPS might otherwise appear to work and pass self tests, runtime calibrations, etc.

keep in mind also that an overloaded UPS of the style that APC on their SmartUPS line uses (low frequency inverter) causes longer transfer time as it has to build a magnetic field in the transformer under an increased load, which takes longer and is possibly too long for the power supply to sustain the load. this can of course vary because you don’t control when in the sine-wave power is removed.

I don’t think this shall ever happen (and as you confirmed with clear shut-off – it does not: neither UPS switching to battery or back nor yanking server power shall result in internal power rail brownout. I’m almost certain it’s a PSU issue).

Besides drying power supply (usually drying out tank capacitors that cannot provide enough power to survive short power interruption), it may also be indicative of dying UPS batteries, where instead of sustaining power during surge, it somehow browned out itself (-- but server PSU should have cut the power cleanly).

Either way, I would start with the server power supply.

False alarm! It’s a new server, I tested APC with old one just before, but when I connected new server I “temporarily” connected it to wall outlet. So, all that was on a grid power. Not on UPS

2 Likes

I see that you found the root cause, however …

While testing actual outage conditions for correct behavior is important and should be lauded, if at all possible try to avoid unplugging the UPS as a way to trigger the test; unplugging it breaks your connection to ground (unless you have independently grounded the chassis, as is possible with some models). APC describes the issue in their FAQ, but it applies to UPSes in general.

Instead, try to perform such tests by interrupting the circuit without unplugging anything, for example either by tripping the breaker (if nothing else critical is on that circuit) or by temporarily putting a switched power bar between the UPS and the upstream socket, and then switching off power at the power bar. The power bar should be removed once your tests are complete.

1 Like