Is my TrueNAS Scale system limited to 6 Sata ports?

Hi everyone,

I am very new to Truenas and tech only is a hobby. … I am a noob, I know. :innocent: Anyway, for my homelab I created myself a small system running Truenas Scale Dragonfish-24.04.1.1. I used my old hardware as follows:

AsRock B250M Pro with i5-7400
DDR4 2x 8GB Kingston Fury
PSU be quiet! Pure Power 11 CM 400W

Crucial MX500 > Boot
Crucial MX300 > Apps
2x WD60EFZX 6TB (pool1)
2x Toshiba MG09ACA 18TB (pool2)

Everything was running smooth. Then I was thinking, I would like to add a couple of additional harddrives since my mainboard is limited to 6 Sata ports.

I got a Fujitsu PSAS D3327-A12 from a trustworthy shop on ebay and installed it on my mainboard. I then attached two old WD Blue 2TB disks. … and this is when the trouble started.

The two disks showed. No issues creating a pool. But my Toshiba pool showed “degraded” after reboot. One of my Toshiba drives directly connected to the mainbord wasn’t recognized by the system anymore. Is it possible that during boot the additional drives “kicked out” one of the Toshiba drives due to loading order? I was wondering whether my system is only able to load 6 SATA drives. (I forgot to cable one of my WD Red drives at that time - luckily I was able to resilver later.)

Over the days I kept busy with troubleshooting. I will keep it short. I was thinking its the PSU, the mainboard, cables, bad drives… nothing showed which makes sense. Then I was reading the mainbord chip is able to run 6 SATA ports only.

So I was wondering: Is it possible adding the controller messed up the sata ports on my mainboard/the Toshiba drives? Am I chip-wise limited to 6 Sata ports? Is there any bios configuration I should consider before attaching the HBA again?

I probably should also update the controller’s firmware… I found a flashing guide. I thought it might work plug and play since it isn’t a RAID controller.

Truenas Scale itself is running fine btw. Only the Toshiba drives were causing trouble. I was able to backup the data though. At last both of them didnt show in the bios. Before sometimes one, sometimes none. I formatted one of them and now both show again on bios and even seem to run stable. After doing a “replace” I am doing a “resilver” now.

Hope someone is reading this. Looking forward to possible explainations what might have caused the trouble. …it doesn’t make sense to me, maybe it does to you?

Also my goal is to attach more disks to the system trough the HBA. Is this a bad idea? :slightly_smiling_face:

I’m 95% sure that it IS a raid controller & that you should look into flashing it to IT and latest firmware.

I’m fairly certain that it is using LSI SAS3008 chipset & that the following command will provide some good insight:

sas3flash -list

The motherboard itself only being able of handling 6 sata ports makes sense since that is the total amount of physical plugs available on it & likely the max bandwidth dedicated on its chipset; this is normal & fine & why HBAs are used. You picked the right solution, but might have jumped the gun on thinking it is plug & play.

I’m more concerned that the Toshibas are connected directly to motherboard & not being detected; would recommend standard troubleshooting like check power & data connections, trying different ports on the motherboard, confirming in manual if putting in x into y pcie slot disables z sata ports (see page 4: “* If M2_1 is occupied by a SATA-type M.2 device, SATA3_0 will be disabled”). Checking if the drives work on another system if available. etc.

Generally it is also good idea to burn in drives prior to deployment… I’ve had amazing luck & managed to at times get 3 hdds DOA (or soon after arrival) from different stores, vendors, and models.

When/if you do get your Toshiba drives to show an output of:

smartctl -a /dev/whateveriscurrentlylabeledfortoshiba
2 Likes

LSI 3008 in this Fujitsu part is a genuine HBA, so any mention of RAID must, at worst, be IR mode, which is not that bad for ZFS. It should still be flashed to the firmware version which TrueNAS drivers expect, i.e. P16.00.12.00.

1 Like

Hey, thank you! First I want to figure out, if my Toshiba drives are ok. Later I will flash the HBA and go on with my Truenas system.

So the resilvering finished… and suddenly the new formatted drive goes offline again. Funny … after rebooting my Truenas system again the 2nd Toshiba suddenly is offline too. This off and on drives me crazy. :crazy_face:

Here the smart info of the one which went offline first. On my other system the drive shows btw. I will test it. Here the smart info:

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA MG09ACA18TE
Serial Number:    XXX
LU WWN Device Id: XXX
Firmware Version: 4306
User Capacity:    18.000.207.937.536 bytes [18,0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jun 16 14:29:24 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  120) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (1491) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       8310
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       316
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
 23 Unknown_Attribute       0x0023   100   100   075    Pre-fail  Always       -       0
 24 Unknown_Attribute       0x0023   100   100   075    Pre-fail  Always       -       0
 27 Unknown_Attribute       0x0023   100   100   030    Pre-fail  Always       -       264542
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       2
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       36
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       31 (Min/Max 21/45)
196 Reallocated_Event_Count 0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       1703957
222 Loaded_Hours            0x0032   100   100   000    Old_age   Always       -       222
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       613
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       35001925840
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       26011874404

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       316         -
# 2  Short offline       Completed without error       00%       315         -
# 3  Short offline       Completed without error       00%       314         -
# 4  Short offline       Completed without error       00%       312         -
# 5  Short offline       Completed without error       00%       118         -
# 6  Extended offline    Completed without error       00%        22         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I changed SATA ports, power supply, cable strains… nothing seemed to be consistent enough to proof hardware faulty. I only remember the issues started after adding the HBA. Of course this also included a reboot of the system after the Toshiba drives kept running for a few days the first time.

I found this thread Hard Drive Burn-in Testing. I will do an extended smart test now and then proceed with the guide. Is this still up to date?

Nothing has really changed. The general idea is still to make the drive work until is reasonably confident that it is not DOA.
Nothing suspicious in SMART output either. The mystery remains.

Check the manual page 4. Details use of m.2 and sata port sharing. I think that maybe your issue.

Now I use the 6 SATA3 ports only.

Crucial MX500 > sata3_0
Crucial MX300 > sata3_1
WD Red > sata3_4
WD Red > sata3_5

The Toshibas are/were on sata3_2 and sata3_3. I cant find any information on the manual saying this causing any issues related to the PCIE1 connected HBA. The Crucial drives work fine. Maybe you could explain? Maybe there is something I don’t understand.

Manual only says things like:

* If M2_1 is occupied by a SATA-type M.2 device, SATA3_0 will
be disabled.

But sata3_0 always worked. And now the HBA being disconnected sata3_2 and _3 should work, right?

Sometimes the drives get initialized on boot (BIOS) sometimes they don’t. And now the connection was lost even though it showed up in Truenas Scale… :crazy_face:

Check that the chipset SATA controller is in AHCI mode.
Check the cables, and possibly swap cables/ports and see whether errors follow a particular drive, a particular cable or a particular port.
But I’m as puzzled as you are.

2 Likes

I’ll be honest, considering I don’t see an m.2 drive listed anywhere in your specs - I think this might be a red-hearing & can be ignored.

I’d follow what etorix has suggested & would add (if possible) try to see if disks are stable on a different system. Who knows, maybe the HBA is 100% unrelated & it is a wiring/motherboard/HDD/PSU issue. The easiest troubleshooting step is replacing the SATA wires, I’d explore that first.

1 Like

Doing my research I found this: How to Fix the 3.3V Pin Issue. Can this be related? I found at least one post on reddit including a Toshiba drive. Though it is weird the drives are showing sometimes, sometimes not. :exploding_head:

Also waiting for the extended smart tests to complete. Will come back later. Thx everyone. :smiley:

At worst if you block 3.3 then the drive won’t (shouldn’t) spin up at all. I don’t imagine there would be harm to use some electrical tape to try it out.

If these are shucked drives then these could indeed be of use. My experience with shucked drives has been awful & I’ve never been lucky enough to get stability regardless of 3.3v tapped off or not (with my amazing sample size of two), so I just grab either used or on sale.

…ok here it is. The first drive shows failure… let’s hope the RMA goes smooth.

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA MG09ACA18TE
Serial Number:    XXX
LU WWN Device Id: XXX
Firmware Version: 4306
User Capacity:    18.000.207.937.536 bytes [18,0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jun 17 18:34:27 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 113)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(  120) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (1491) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       8310
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       344
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
 23 Unknown_Attribute       0x0023   100   100   075    Pre-fail  Always       -       0
 24 Unknown_Attribute       0x0023   100   100   075    Pre-fail  Always       -       0
 27 Unknown_Attribute       0x0023   100   100   030    Pre-fail  Always       -       264542
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       2
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       37
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       34 (Min/Max 21/47)
196 Reallocated_Event_Count 0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       1703957
222 Loaded_Hours            0x0032   100   100   000    Old_age   Always       -       244
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       622
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       35001925840
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       26011875029

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%       338         40393384
# 2  Short offline       Completed without error       00%       316         -
# 3  Short offline       Completed without error       00%       316         -
# 4  Short offline       Completed without error       00%       315         -
# 5  Short offline       Completed without error       00%       314         -
# 6  Short offline       Completed without error       00%       312         -
# 7  Short offline       Completed without error       00%       118         -
# 8  Extended offline    Completed without error       00%        22         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I will do an extended smart on the second drive. It willl probably be the same. So … two DOA. :crazy_face:

1 Like

At least, you’ve found the issue.

1 Like