TrueNAS Scale suddenly can't read more than 9 SAS Disks

Hello everyone,
I’m new here, I’ve been using TrueNAS for some time now as a Plex Media Center and general storage, but on monday happened something strange.
Here’s my config
Case:
CPU: intel i7 6700k
Ram: 4x4gb DDR4
Mobo: Asus ROG Z170
HDD: 12x HGST HUS72403CLAR3000 SAS
PSU: Corsair RM1000x SHIFT
HBA Cables: 3x Mini SAS SFF-8087 36 Pin TO 4xSFF-8482 29+15 Pin
HBA: 2x DELL PERC H200 LSI SAS2008 - flashed in IT

truenas_admin@truenas[~]$ sudo sas2flash -list -c 1
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

        Controller Number              : 1
        Controller                     : SAS2008(B2)
        PCI Address                    : 00:04:00:00
        SAS Address                    : 54dae52-0-ac07-d555
        NVDATA Version (Default)       : 14.01.00.08
        NVDATA Version (Persistent)    : 14.01.00.08
        Firmware Product ID            : 0x2213 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9211-8i
        BIOS Version                   : N/A
        UEFI BSD Version               : N/A
        FCODE Version                  : N/A
        Board Name                     : 6Gbps SAS HBA
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.
truenas_admin@truenas[~]$ sudo sas2flash -list -c 0
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

        Controller Number              : 0
        Controller                     : SAS2008(B2)
        PCI Address                    : 00:01:00:00
        SAS Address                    : 5d4ae52-0-7642-2200
        NVDATA Version (Default)       : 14.01.00.08
        NVDATA Version (Persistent)    : 14.01.00.08
        Firmware Product ID            : 0x2213 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9211-8i
        BIOS Version                   : N/A
        UEFI BSD Version               : N/A
        FCODE Version                  : N/A
        Board Name                     : 6Gbps SAS HBA
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

I’ve been having some problem with my main pool degrading, 1 disk always (recently changed, so I thought it was the PSU), that needed to be solved with a zpool clear HDD, but on monday it suddenly lost 3 out of the 12 disks I have. I’ve been trying to reconnect everything by checking all the SAS cables, and I thought it was a problem with my PSU who broke a rail, which was old and I had to use adapter to give the SATA connection power from a MOLEX cable.
I decided to upgrade the PSU and bought a Corsair with 16 sata connection.
Today I redid all the connection with the new PSU, and still, only 9 out of 12 disks shows.
I thought it was the 3 disks that I cannot list the problem, but when I swap them to a connection that previously work, they show up.
So I thought it was the cable, but I changed it to another one that I know it works, and they still not show.
I also swapped the cable to a different HBA port, nothing changed.
It seems that TrueNAS decided that 3 disks cannot be read and seen by any means, and I dunno why.

Thank you for the detailed information.

Just to clarify and consolidate:

  1. You have 3 drives that refuse to be connected to TrueNAS.
  2. These 3 drives do not appear to be failed/bad, they work when connected to a different data connector where a drive does get recognized.
  3. You replaced the data cable that would be suspect, no change.
  4. You swapped HBA data ports, no change.
  5. You replaced the PSU, no change.

Here is where I need clarification:

  1. When you swapped the HBA data ports, did you swap it with a currently working HBA port? And if yes, did the drives connected to that port originally now fail when connected to the HBA port that is related to the failure? I’m not saying the HBA port is faulty, I’m just gathering details.
  2. Post zpool status -v
  3. Post what version of TrueNAS you are running.
  4. You seem to be very good at troubleshooting, taking matters into your own hands to figure out what might be going on before asking for help. I love it. I wish more people were like that.
  5. Now think back, did you move the computer? Update any software? Sudden power loss to the entire system? Reboot or Shutdown?

I need you to think back several days before the incident, not just the day it occurred.

Here is a good link to read, might be similar.

You might also post some of the requested data that this thread lists to provide.
@Protopia is apt to jump in and save the day.

I would have already jumped in (and possibly saved the day) if I had any good ideas to add - but unfortunately I don’t.

As Joe has already said, you have already done all the swap-thing-around options that there seem to be and they haven’t helped, and I haven’t got a single further action for you to try.

The only thing I would ask is exactly how do you know that the drives are not showing up (and do show up when connected to another port)? Are they missing from lsblk?

Also, I am not sure whether you can see any of the drives on your HBAs from BIOS, but if you can do these missing drives appear there?

Hello!
Thank you for your reply, I’ll try to answer everything:

  1. I have 2 HBA with 2 ports each, I tried pretty much every combination with cable-drive-hba port, I can’t seem to find a pattern. Sometimes after a reboot some drive show up and sometime they are missing based on the combo. Gonna need more try and error on that, thinking of doing a spredsheet to track it.
  2. The “HDD” pool is now exported cause it failed and dumped all the disk out once the 3 disk failed, I didnt want to risk losing data.
  3. I was using 24.10.0 (Electric Eel), tried to update today to 25.04.1 (Fangtooth), nothing changed.
  4. Thank you! I’m a sys admin as profession, and a troubleshooter for every friend and family that I know.
  5. Not moved, updated just Plex when he asked to be updated. No power loss, it’s under an APC UPS. Shutdown everytime I’m not using it, since wife doesnt want an additional heater in the house.

I can’t recall anything unusual as the day prior the incident… I’ll try to ask the wife if she did anything.

I’ll read that topic asap!

As per the lsblk question, yes, they are missing completely. 12 drive connected, only 9 sdX shows. I know when they show up by reading the serial number from the Storage → Disk list on TrueNas GUI.

Bios doesnt show any drive, only the SSD I’m booting TrueNAS from. I’m gonna double check it once I come back home.

Thank you both for your time :slight_smile:

@alexaldin It shows that you are a Sys Admin.

As for the troubleshooting, ensure to track the drives also by the serial number. I’m sure you already know that the Device ID can and will change and are not tied to a specific drive. I would also track the Device IDs as well, just in-case there is something strange going on there.

If you made a change 2 weeks ago, that could affect it, but if you are powering down every day/night, then I would expect the issue to show up right away.

If you haven’t done so already, you might try this:

  1. Boot from a Ubuntu Live CD. Can you see all the drives?
  2. Do a Clean Install of TrueNAS SCALE, 25.04.1 is fine, but do not restore the config file. Leave it untouched. Can you see all the drives now?

If you can see the drives under one of these conditions, then the hardware (HBA) “should” be good.

Something else, can you move your HBA to another slot? I didn’t look at your hardware to see what it is capable of, but if you can and haven’t already don so, it is worth a try.

Best of luck.

1 Like

Ok, sooo, I did some more days of troubleshooting and I think I found out what happened.
As you said, I went back with my memory some day prior, and I remembered that, to try and fix the random HDD faulting degrading my pool, I changed the cable I used for 4 of the HDD with the one I’m using for the SSD (I didnt mention them cause the SDD pool was working always as intended without any issue), and some day later the problem occured.
While troubleshooting, I also tried connecting each individual SAS while waiting for it to appear on screen, and after connecting everything, I noticed that from 9 disk I went down to 6. So, panic. But!
One other thing I didnt mention is that the HDD I’m using are from 2 types of SAS Enterprise drives, the model shown in TRUENAS is HUS72403CLAR3000, but in reality they are HUS724030ALS640 (3Tb) and HUS726040ALS210 (4Tb).
Dell (these drives comes from a EOL Enterprise Storage) decided for them to work in their storage to flash both with the HUS72403CLAR3000 firmware, so the software reads them like they are the same, but in fact they are not.
The HUS724030ALS640 consume 5V 0.8A and 12V 0.8A, the HUS726040ALS210 (the majority of the drives in my system), 5V 0.9A and 12 0.8A.
So, the 0.1A is REALLY important, cause, next thing I remembered, is that I have 4 cables, but 2 from 1 brand, and 2 from another.
The first 2 are the oldest I have, and that cable came like this:

The newest 2, I needed to order them with a MOLEX adapter, cause the old PSU didnt had another 8 sata connection:

In my attempt with the old psu to resolve the HDD failure, I removed the Molex adapter and connected directly to sata, and apparently the cables CAN’T give 0.9A without MOLEX.
Also, apparently, while removing the adapter and connecting directly to the sata connection from the PSU, I burned them, cause now they aren’t working anymore even with the molex adapter connected.

I ordered 2 other cable the same brand as the 2 working one I have, they are coming tomorrow!
Hope this was entertaining and useful for someone, I think I’m gonna go and burn everything now :melting_face:

1 Like

Well that sounds like fun :clown_face:

I hope the two new cables solve the problem completely.

You would be surprised by how many people truly forget that they did something a few days earlier that caused an issue. I am guilty as well. Some problems only show up during some specific evolution which makes it even more difficult to isolate.

Be careful with those SATA power connectors, it is easy to put them on a little crooked and short out the power. Done it myself, scared the crap out of me. Now I only connect/disconnect power lines with the unit unplugged. Lesson Learned.

1 Like

The two new cables solved the problem, both pool online with all disks!

Thank you everyone for everything :slight_smile:

For future, I dont know if it’s only in Italy those two brand on Amazon, but:
YIWENTEC :+1:
CY :-1:

1 Like

Glad to hear it is working again. Yes, some manufacturers produce not so good products. I just ordered a set of USB-C to USB-A (female) Gen 3.2 (10Gbps) adapters. I did what I could to research it and found no bad reviews and I just placed the order. I wasn’t so concerned about the 10Gbps, it was if I would have to flip the connector because only half the wiring was inside. I have one like that now, it sucks to not have full speed.