Reused HDD Crashes Whole System Even After Sanitizing etc

You could try to bootstrap a Linux Live CD, and see if the drives are recognized properly.

@PhilD13 Thanks for those suggestions. I keep forgetting about the sector size being a possible issue.

I don’t understand, what password are you talking about? I do not enter a password. Please be very specific.

When I installed TrueNAS I was asked for a Username and Password to be able to Login to the TrueNAS UI.

To run these various commands I’ve been Logging-in to TrueNAS then selecting System>Shell, which gives me a prompt. (I think that’s what’s called a CLI. Am I wrong?)

When I type in a command and press Enter TrueNAS asks me for my Password. I have to type in the password used to Login to TrueNAS and press Enter again before the command will run. Sometimes I have to type sudo before the rest of the command and sometimes I don’t. But I’m always asked for the Password.

Sounds like you are logging in as truenas-admin.

Try to use the word sudo in front of your commands. Or you could enter su alone, then you should be root. Then try the commands again normally.

I honestly don’t know if truenas admin alone can run all of the privledged commands.

If I don’t enter sudo nothing happens. If I enter su alone I get asked for my Password. When I enter that I get ‘Authentication Failure’. And if I enter sudo followed by the commands things appear to run, but the TrueNAS still won’t accept any of the suspect disks to make up 4 including the 3 new disks.

I usually use sudo -s which gives me root after entering the password. You could try that. Use exit to exit the root session back to the admin session when done with the commands. If you are doing a series of commands that require sudo it’s a way to not have to remember to type it all the time in front of the command.

The user [root] has to have Allowed Sudo Commands: ALL when the user Credentials are viewed. It it is not then you need to edit the user and check Allow all sudo commands. The user also has to have a shell selected *Select the shell to use for local and SSH logins: II usually set it to bash.

I also find an ssh session from my laptop works better and is easier to use overall than the shell in Truenas and it can be done from a windows command line. admin@trueas.local is an example of how to ssh into the server.

If a command requires elevated permissions and it does not have them it will return nothing as it does not run. That is one way to know to try sudo in front of the command.

I can confirm (now that I am home) that you are not using a privileged account when you log in. You now know sudo itself works, and can confirm this by entering sudo blkdiscard and if you do not get command not found, then you are good to run a privileged instruction.

Be very descriptive with what you did and what happened. We can’t help if we are guessing what you did. All the instructions above were to wipe one disk, I’m not sure if you had any error messages, or whatever.

The first thing I’d like to see if the complete output of the command smartctl -x /dev/sda (if sda is the correct drive). Click on the </> icon above, then Cut and Paste the results. This will retain the text format, which is very helpful to someone having to read it.

I want this data so I can see if something is odd with this drive, before you try to use dd.

There are many ways to wipe a drive, however using dd is probably the best way.

  1. Ensure the only drives in the system are your boot-pool and the drive you want to wipe.
  2. Enter tmux new -s wipe and a new CLI window will appear. This is needed as using the GUI, it will time out and your session will be gone.
  3. Identify the 10TB drive using lsblk (sda, sdb, sdc)
  4. Unmount the drive umount /dev/sda* (CHANGE sda* to the 10TB drive, if it is sdb then your command would be umount /dev/sdb*)
  5. Enter sudo dd if=/dev/zero of=/dev/sd? bs=1M status=progress (Replace the ? with the proper drive ID)
  6. Wait.
  7. If the GUI CLI session goes away, you close your web browser, whatever, you can go back to it by entering tmux attach -t wipe and your SSH CLI window will reappear.
  8. Once the dd has completed, reboot.
  9. Can you see the drive in the GUI?
  10. Power Down, add your other three new drives, power on.
  11. Did this fix the problem?

You can try booting a Live CD as I mentioned before. This will remove TrueNAS from the equation. You could also use a different boot drive and install Debian Bookworm, then boot it up, add those drives and see what the system does.

Hope I’ve got this right. Result below.

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.33-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST10000VN0004-1ZD101
Serial Number:    ZA28TP6A
LU WWN Device Id: 5 000c50 0b2ee84b8
Firmware Version: SC60
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Mar 23 16:38:53 2026 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 845) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   064   064   044    -    2696092
  3 Spin_Up_Time            PO----   094   086   000    -    0
  4 Start_Stop_Count        -O--CK   093   093   020    -    7908
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    3
  7 Seek_Error_Rate         POSR--   089   060   045    -    718972403
  9 Power_On_Hours          -O--CK   031   031   000    -    60735 (217 217 0)
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   097   097   020    -    3113
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    1
189 High_Fly_Writes         -O-RCK   084   084   000    -    16
190 Airflow_Temperature_Cel -O---K   073   038   040    Past 27 (0 24 27 27 0)
191 G-Sense_Error_Rate      -O--CK   099   099   000    -    2357
192 Power-Off_Retract_Count -O--CK   099   099   000    -    2991
193 Load_Cycle_Count        -O--CK   078   078   000    -    44250
194 Temperature_Celsius     -O---K   027   062   000    -    27 (0 13 0 0 0)
195 Hardware_ECC_Recovered  -O-RC-   005   001   000    -    2696092
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
200 Pressure_Limit          PO---K   100   100   001    -    0
240 Head_Flying_Hours       ------   100   253   000    -    2966h+12m+37.770s
241 Total_LBAs_Written      ------   100   253   000    -    18051866146
242 Total_LBAs_Read         ------   100   253   000    -    48579000311
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   2048  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    768  Current Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      24  Device vendor specific log
0xa2       GPL     VS   16320  Device vendor specific log
0xa6       GPL     VS     192  Device vendor specific log
0xa8-0xa9  GPL,SL  VS     136  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xad       GPL     VS      16  Device vendor specific log
0xb0       GPL     VS   17208  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc1       GPL,SL  VS      16  Device vendor specific log
0xc3       GPL,SL  VS       8  Device vendor specific log
0xc9       GPL,SL  VS       8  Device vendor specific log
0xca       GPL,SL  VS      16  Device vendor specific log
0xd1       GPL     VS     304  Device vendor specific log
0xd2       GPL     VS   10000  Device vendor specific log
0xd4       GPL     VS    2048  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     60077         -
# 2  Short offline       Completed without error       00%     60063         -
# 3  Short offline       Completed without error       00%     60061         -
# 4  Extended offline    Interrupted (host reset)      00%     58451         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
Device State:                        Active (0)
Current Temperature:                    27 Celsius
Power Cycle Min/Max Temperature:     24/28 Celsius
Lifetime    Min/Max Temperature:     13/52 Celsius
Under/Over Temperature Limit Count:   0/3

SCT Temperature History Version:     2
Temperature Sampling Period:         3 minutes
Temperature Logging Interval:        59 minutes
Min/Max recommended Temperature:      0/ 0 Celsius
Min/Max Temperature Limit:            0/ 0 Celsius
Temperature History Size (Index):    128 (107)

Index    Estimated Time   Temperature Celsius
 108    2026-03-18 11:37     ?  -
 109    2026-03-18 12:36    22  ***
 110    2026-03-18 13:35     ?  -
 111    2026-03-18 14:34    22  ***
 112    2026-03-18 15:33     ?  -
 113    2026-03-18 16:32    22  ***
 114    2026-03-18 17:31     ?  -
 115    2026-03-18 18:30    22  ***
 116    2026-03-18 19:29     ?  -
 117    2026-03-18 20:28    22  ***
 118    2026-03-18 21:27     ?  -
 119    2026-03-18 22:26    22  ***
 120    2026-03-18 23:25     ?  -
 121    2026-03-19 00:24    22  ***
 122    2026-03-19 01:23     ?  -
 123    2026-03-19 02:22    23  ****
 124    2026-03-19 03:21     ?  -
 125    2026-03-19 04:20    23  ****
 126    2026-03-19 05:19     ?  -
 127    2026-03-19 06:18    23  ****
   0    2026-03-19 07:17     ?  -
   1    2026-03-19 08:16    20  *
   2    2026-03-19 09:15     ?  -
   3    2026-03-19 10:14    20  *
   4    2026-03-19 11:13     ?  -
   5    2026-03-19 12:12    22  ***
   6    2026-03-19 13:11     ?  -
   7    2026-03-19 14:10    24  *****
   8    2026-03-19 15:09     ?  -
   9    2026-03-19 16:08    24  *****
  10    2026-03-19 17:07     ?  -
  11    2026-03-19 18:06    25  ******
  12    2026-03-19 19:05     ?  -
  13    2026-03-19 20:04    25  ******
  14    2026-03-19 21:03     ?  -
  15    2026-03-19 22:02    25  ******
  16    2026-03-19 23:01     ?  -
  17    2026-03-20 00:00    25  ******
  18    2026-03-20 00:59     ?  -
  19    2026-03-20 01:58    26  *******
  20    2026-03-20 02:57     ?  -
  21    2026-03-20 03:56    26  *******
  22    2026-03-20 04:55     ?  -
  23    2026-03-20 05:54    26  *******
  24    2026-03-20 06:53     ?  -
  25    2026-03-20 07:52    28  *********
  26    2026-03-20 08:51     ?  -
  27    2026-03-20 09:50    28  *********
  28    2026-03-20 10:49     ?  -
  29    2026-03-20 11:48    29  **********
  30    2026-03-20 12:47     ?  -
  31    2026-03-20 13:46    29  **********
  32    2026-03-20 14:45     ?  -
  33    2026-03-20 15:44    29  **********
  34    2026-03-20 16:43     ?  -
  35    2026-03-20 17:42    27  ********
  36    2026-03-20 18:41     ?  -
  37    2026-03-20 19:40    27  ********
  38    2026-03-20 20:39     ?  -
  39    2026-03-20 21:38    28  *********
  40    2026-03-20 22:37     ?  -
  41    2026-03-20 23:36    28  *********
  42    2026-03-21 00:35     ?  -
  43    2026-03-21 01:34    28  *********
  44    2026-03-21 02:33     ?  -
  45    2026-03-21 03:32    28  *********
  46    2026-03-21 04:31     ?  -
  47    2026-03-21 05:30    28  *********
  48    2026-03-21 06:29     ?  -
  49    2026-03-21 07:28    29  **********
  50    2026-03-21 08:27     ?  -
  51    2026-03-21 09:26    29  **********
  52    2026-03-21 10:25     ?  -
  53    2026-03-21 11:24    29  **********
  54    2026-03-21 12:23     ?  -
  55    2026-03-21 13:22    29  **********
  56    2026-03-21 14:21     ?  -
  57    2026-03-21 15:20    30  ***********
  58    2026-03-21 16:19     ?  -
  59    2026-03-21 17:18    30  ***********
  60    2026-03-21 18:17     ?  -
  61    2026-03-21 19:16    30  ***********
  62    2026-03-21 20:15     ?  -
  63    2026-03-21 21:14    21  **
  64    2026-03-21 22:13     ?  -
  65    2026-03-21 23:12    27  ********
  66    2026-03-22 00:11     ?  -
  67    2026-03-22 01:10    28  *********
  68    2026-03-22 02:09     ?  -
  69    2026-03-22 03:08    28  *********
  70    2026-03-22 04:07     ?  -
  71    2026-03-22 05:06    28  *********
  72    2026-03-22 06:05     ?  -
  73    2026-03-22 07:04    28  *********
  74    2026-03-22 08:03     ?  -
  75    2026-03-22 09:02    28  *********
  76    2026-03-22 10:01     ?  -
  77    2026-03-22 11:00    28  *********
  78    2026-03-22 11:59     ?  -
  79    2026-03-22 12:58    28  *********
  80    2026-03-22 13:57     ?  -
  81    2026-03-22 14:56    29  **********
  82    2026-03-22 15:55     ?  -
  83    2026-03-22 16:54    29  **********
  84    2026-03-22 17:53     ?  -
  85    2026-03-22 18:52    29  **********
  86    2026-03-22 19:51     ?  -
  87    2026-03-22 20:50    29  **********
  88    2026-03-22 21:49     ?  -
  89    2026-03-22 22:48    29  **********
  90    2026-03-22 23:47     ?  -
  91    2026-03-23 00:46    29  **********
  92    2026-03-23 01:45     ?  -
  93    2026-03-23 02:44    30  ***********
  94    2026-03-23 03:43     ?  -
  95    2026-03-23 04:42    31  ************
  96    2026-03-23 05:41     ?  -
  97    2026-03-23 06:40    27  ********
  98    2026-03-23 07:39     ?  -
  99    2026-03-23 08:38    27  ********
 100    2026-03-23 09:37     ?  -
 101    2026-03-23 10:36    28  *********
 102    2026-03-23 11:35     ?  -
 103    2026-03-23 12:34    28  *********
 104    2026-03-23 13:33     ?  -
 105    2026-03-23 14:32    24  *****
 106    2026-03-23 15:31    27  ********
 107    2026-03-23 16:30    27  ********

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4            3113  ---  Lifetime Power-On Resets
0x01  0x010  4           60735  ---  Power-on Hours
0x01  0x018  6     18051538548  ---  Logical Sectors Written
0x01  0x020  6        33492971  ---  Number of Write Commands
0x01  0x028  6     48286821301  ---  Logical Sectors Read
0x01  0x030  6      1730483685  ---  Number of Read Commands
0x01  0x038  6               -  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4       466902427  N--  Spindle Motor Power-on Hours
0x03  0x010  4       466897682  N--  Head Flying Hours
0x03  0x018  4           44250  ---  Head Load Events
0x03  0x020  4               3  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               1  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4            2991  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              27  ---  Current Temperature
0x05  0x010  1              27  ---  Average Short Term Temperature
0x05  0x018  1              32  ---  Average Long Term Temperature
0x05  0x020  1              62  ---  Highest Temperature
0x05  0x028  1               0  ---  Lowest Temperature
0x05  0x030  1              56  ---  Highest Average Short Term Temperature
0x05  0x038  1              19  ---  Lowest Average Short Term Temperature
0x05  0x040  1              44  ---  Highest Average Long Term Temperature
0x05  0x048  1              24  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              70  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4            3642  ---  Number of Hardware Resets
0x06  0x010  4             160  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
0xff  0x008  7               0  ---  Vendor Specific
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS

Seagate FARM log (GP Log 0xa6) supported [try: -l farm]

Hi Again,

Did the smrtctrl log reveal anything? Should I proceed to dd now?

Thanks.

Sorry I didn’t see this yesterday. For some reason I don’t always get notified when someone responds to a thread I posted in.

This is the specific thing I was looking for, and this is a good sign. The rest of your drive SMART data looks very good, extremely good for a drive with this many hours. It did get a bit warm at least once, but that would not cause anything like what you are seeing.

If you have not already done so, I would perform the procedure I laid out above to use tmux and dd to wipe the drive completely. In step 4, the drive may not be mounted, so you might see an error message there stating it was not mounted.

If a complete dd wipe does not resolve the error, I’d thing (not 100% positive) that your drive has something wrong with it, but for the life of me, I wouldn’t know what that could be.

OUT OF THE BOX THINKING Maybe try this before dd and definately faster.
Did you say that if you added this one single drive, that the drive stayed recognized? If yes, then create a single stripe pool. See if that does create a new pool. Then add in your other drives one at a time, meaning just power off, install one drive, power on, is the stripe pool still there. Can you get all the drives installed and the stripe remains? Lets assume you get all the drives installed, now remove the stripe pool and ensure you select WIPE. What happens now? Are all the drives still there?

You could also do this a different way, just create the new stripe pool, make sure you can create a dataset, then remove the pool and select WIPE. Then see if things work properly. Doing this “should” remove any doubt that the drive does not have any ZFS pool markers.

Hi, I went straight to your Out of the Box Suggestion in the hope of saving time. Here’s what happened:-

First there was no problem creating a single disk Pool using just the rogue disk. I named it test_pool and it was displayed as expected like any healthy Pool. Then I inserted the 3 brand new disks and waited, and waited, and waited, listening to the sound of HDDs initialising over and over again. The 3 new HDDs never appeared in Storage as Unallocated Disks and never got listed in Storage>Disks either. But the test_pool disk disappeared from Storage>Disks and test_pool is now showing the Disk healthy (Green Tick) but the Storage Health has a red cross and the text box says “Undefined, 3 errors”

Then I removed the 3 new disks and, lo and behold, the single rogue disk is now listed again in Storage>Disks, but its health hasn’t been restored.

So it would appear the rogue disk will happily run on its own, but if you add and further disks they all fail.

I haven’t yet tried running a scrub on the single rogue disk pool to see if its health comes back and nor have I tried adding a dataset to it and then deleting the pool to see what happens. Do you still think it’s worth trying those things or should we just give up and dd?

On a slightly more positive note, I was able to buy a 4th new HDD this morning, which I should hopefully have by the weekend. So I’ll be able to recreate the old pool that was corrupted - but without using any of the original 4 disks. That should at least get the TrueNAS up and running again.

But that still leaves me with 4 problem children to try to sort out, so I don’t want to give up trying to find out what on earth the trouble is.

That is very strange. You were able to create a stripe with the suspect drive and it worked, then you insert the 3 new drives and it went away.

Did you insert the 3 new drives one at a time? Just to find out if any single drive installed causes this issue? Or does all 3 drives need to be installed to cause this issue?

You never said if you deleted and wiped the stripe yet. This is a key part of the Out of the Box thinking.

This “should” ensure the drive is no longer a ZFS member. And you could also do this with the other three drives as well, if they all are recognized at once, create a 3 drive Raidz1, then delete and wipe. After that you should have four drives that were once a pool and wiped.

If after all of that, if you still have the issue, I absolutely would run the dd command on the older drive. You said it was from a crashed pool that was working. The dd command will clean that out completely, no doubt left in your mind at all.

This is so frustrating…

After yesterday, when I successfully created a single disk stripe using on of the old disks, adding the 3 new ones crashed things. They briefly appeared, one at a time, as Unallocated Disks and then all 3 disappeared from Storage>Disks. The old disks was still listed in Storage>Disks, but showing 3 unspecified errors in Storage.

I tried to destroy the single disks stripe and TrueNAS hung with the action at 60%. The only way out of that was to restart TrueNAS, and I had to do that by pressing and holding the power buttton on the case to force it to shut down. It would not respond to either Restart or Shut Down from within TrueNAS.

After ejecting the 3 new disks, leaving just the single old disk in the TrueNAS, it restarted, saw the disk, let me create a Pool, Add a Dataset, populate it with a few files and thos files would stream perfectly to my PC. The I deleted the dataset and wipe the old disk.

I restarted the TrueNAS, inserted the wiped, old disk, which it saw, then added the 3 new disks one at a time. As each one was added it would show, briefly, as an Unallocated Disk and then disappear.

If I eject the old disk the 3 new ones show up as 3 Unallocated Disks and are listed in Storage>Disks. But as soon as I try to add the old disk all 4 of them disappear. None are listed in Storage>Disks.

Does the dd command do anything different to Diskpart>Clean (with the disk in a USB housing connected to my Windows PC)? Or to Seagate SeaTools Erase Sanitize, Full with zeroes?

I’ve nothing to lose by running dd, but it’ll take a good few hours with a 10 Tb disk I expect.

How long have you already spent on this problem? The only thing this action would prove is the data on the physical drive either was the issue or was not the issue. You in some deep troubleshooting right now.

But, you have a very non-typical problem. Do you still have your other pool installed and operating? If yes, I would simply power down and remove them. You could Export the pool if you desire. Then install those four drives into the same locations, see how the system acts. Again, you want to have ONLY the four drives installed in the system plus your boot-pool of course.

Why this course of action, maybe you have a hardware issue. One other thing you could do, grab a USB Flash Drive and install TrueNAS on that, then use it as your test boot-pool, and disconnect your current boot-pool drive.

for all intents and purposes, and unless selling or returning the drive, using the dd command already posted that erases the first 10MB of the drive will clear enough to get rid of any and all partitions and superblocks. If dd is failing to run and you are root (sudo) when attempting to run it, then there is. likely a hardware failure.

I did post other more in depth commands in a different post with the same or similar issue the other day. It won’t hurt to try these. HGST SAS drives showing 0B - Oracle OEM firmware locked? H7210A520SUN010T and HUH721010AL5201 - #3 by joeschmuck

Putting the old drives back in their original positions, with nothing else but the boot drive connected gave an unexpected result. In Storage 2 Unallocated Disks and 2 Disks with Exported Pools were listed. In Storage>Disks I was allowed to Wipe the 2 disks with exported pools. Then Storage showed 4 Unallocated Disks. But when I tried to Create a Pool using them it failed, again. I always got the same error message, Failed (19) No Such Device. That happens irrespective of whether I Wipe using Quick Wipe or Full, with Zeroes.

If I restart the TrueNAS the same 2 drives are again shown as Unallocated Disks and the 2 Disks with Exported Pool return - even after wiping them and having them show as Unallocated Disks.

Moving the 4 original disks to the positions previously occupied by my good, working Pool gives exactly the same results. And putting my good, working Pool disks into the places previously occupied by the ‘problem’ disks leaves me with my good pool in perfect working order.

So it doesn’t matter where I put the original disks. I always get a failure to Create a New Pool using them, no matter how or how many times I wipe of format them.

As a last-ditch I’m now in the process of Sanitizing the 2 disks that keep being reported as having Exported Pools using SeaTools, Erase, Sanitize etc. That’ll take another couple of days I expect.

I can only think the problem lies within TrueNAS itself.

I understand that RAIDZ1 Pools are supposedly able to be moved from TrueNAS to TrueNAS. So the only thing I can think of now is to delete TrueNAS from the SSD it’s installed upon, re-format the SSD, download and do a complete new installation of TrueNAS, then see if it will accept the Sanitized Disks and allow me to Create a Pool using them. If that works I’ll see if it will then recognize my good, working Pool too.

I’m at my wits end with this.

Here is another out of the box ideas…

  1. Save your TrueNAS configuration.
  2. Export your good pool.
  3. Power down and remove the good pool drives.
  4. Install all four of the problematic drives.
  5. Install TrueNAS CORE to a boot drive, this could be a usb flas drive.
  6. Once you are configured enough to get to the Gui, try to create your pool.

Why this might work? This version has been around for a while. I’d actually use version 13.0 and not the most current version.

If you are able to create a pool without any issues, then you know the previous version of the software was the issue.

Now shutdiwn, put in your original boot-pool drive and start it up. See if you can import the new pool. If you can, do not update the zfs pool, it is not needed. It shouldn’t hurt but the zfs version you have would allow you to roll back to an earlier scale version if you needed to.

If it seems all functional, then power down? Install your original pool, power up, import the pool.

Good luck. And that is about all i can think of.

Hi Again, Finally it’s fixed!!!

I tried a fresh install of TrueNAS Scale and then TrueNAS Core, both to no avail. Then spent a couple of days swapping disks in and out. Eventually I reslised that I could have 7 disks connected without problem, but each time I added the eighth disc the system crashed. That was irrespective of which disc was the eighth. Then I realised it didn’t matter which drive bay I put the eighth disk into. It was the number of disks that was causing the crashes. Not their content or location.

The problem was a Molex to 4 x SATA power supply adapter cable. The Power supply itself is good and the same adapter cable had worked perfectly for at least 6 months before the problems arose.

Even after the problem arose the adapter cable has been passing enough power to spin up all the connected disks, but seemingly not enough to allow proper data transfer to and from them. Hence all the I/O error reports and crashes.

I tried a meter across all the conductors and connectors and that revealed no loss of continuity or intermittent connection. But when I replaced the adapter cable normality was restored and all my disks are now working properly again. I can only assume I’ve had some sort of cross-conductor leakage or a high resistance spot somewhenre.

Anyway I just want to thank everyone who’s tried to help resolve this elusive problem. All your help has really been appreciated.

4 Likes

Head on nail.
Bravo!

Lucky. I’ve seen issues like that before and we normally tend to think of cables as working forever. Only when you’ve seen many different ones fail you tend to start to think about them as well.

1 Like