I had a mild panic this morning when my TrueNAS Scale server issued critical SMART alerts for four disks (SAS SSDs), all on the same VDEV, within minutes of each other. The logging in TrueNAS wasn’t very descriptive, simply that the “Self-Test Log error count increased from 0 to 1” for each device.
These four disks were all given a SMART Long Test today as part of the regular testing scheduled by @joeschmuck’s Multi-Report script.
Further investigation showed TrueNAS reporting “Power On Hours Ago” values of 65536 for each of the flagged disks. This immediately made me suspicious: is this some sort of roll over bug?
If I issue a “smartctl -t long ” for any of these four devices it fails immediately with a status of “Failed in segment –> 8”. I’ve issued the same command for other devices in the system and they’re running still with no problem.
The disks are all repurposed/rebadged 1.6TB Samsung SAS SSDs (model MZ-IWS1T9B) from a NetApp SAN that have been running without problems in this server for more than 2 years.
I think I will take the informed decision of ignoring the SMART errors until I actually start seeing read/write errors appear on them. The data’s all regularly backed up nightly, so not the end of the world to restore if the pool dies.
Multi-Report always shows the Long Tests as in progress on my system when scheduled as there’s not a long enough delay, so if the test completes successfully, or takes a long time to fail, you only find out on the next run. So, knowing that I had four “failed” tests this morning I re-ran Multi-Report with the “-dev” switch to avoid starting new tests. Multi-Report shows all four disks as having passed the last test, but the time/hours counters don’t match.
# 1 Background long Failed in segment --> 8 12845 - [- - -]
Here’s the full output of smartctl -x for that specific device:
‘smartctl -x /dev/sdw’ output
root@eurybia\[/mnt/Apps/scripts\]# smartctl -x /dev/sdw
smartctl 7.4 2023-08-01 r5530 \[x86_64-linux-6.12.15-production+truenas\] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X439_S16331T6AMD
Revision: NA04
Compliance: SPC-4
User Capacity: 1,600,321,314,816 bytes \[1.60 TB\]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x5002538a758029f0
Serial number: S20JNWAG800671
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Mon Jan 5 13:19:15 2026 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Disabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Percentage used endurance indicator: 0%
Current Drive Temperature: 28 C
Drive Trip Temperature: 60 C
Manufactured in week 31 of year 2015
Accumulated start-stop cycles: 273
Specified load-unload count over device lifetime: 0
Accumulated load-unload cycles: 0
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations \[10^9 bytes\] errors
read: 0 0 0 0 0 239726.212 0
write: 0 0 0 0 0 192266.352 0
verify: 0 0 0 0 0 498749.808 0
Non-medium error count: 10
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err \[SK ASC ASQ\]
Description number (hours)
# 1 Background long Failed in segment → 8 12845 - \[- - -\]
# 2 Background long Failed in segment → 8 12842 - \[- - -\]
# 3 Background long Failed in segment → 8 12841 - \[- - -\]
# 4 Background short Completed - 12839 - \[- - -\]
# 5 Background short Completed - 12815 - \[- - -\]
# 6 Background short Completed - 12791 - \[- - -\]
# 7 Background short Completed - 12767 - \[- - -\]
# 8 Background short Completed - 12743 - \[- - -\]
# 9 Background short Completed - 12719 - \[- - -\]
#10 Background short Completed - 12695 - \[- - -\]
#11 Background long Completed - 12673 - \[- - -\]
#12 Background short Completed - 12671 - \[- - -\]
#13 Background short Completed - 12647 - \[- - -\]
#14 Background short Completed - 12623 - \[- - -\]
#15 Background short Completed - 12599 - \[- - -\]
#16 Background short Completed - 12575 - \[- - -\]
#17 Background short Completed - 12551 - \[- - -\]
#18 Background short Completed - 12527 - \[- - -\]
#19 Background long Completed - 12505 - \[- - -\]
#20 Background short Completed - 12503 - \[- - -\]
Long (extended) Self-test duration: 3600 seconds \[60.0 minutes\]
Background scan results log
Status: waiting until BMS interval timer expires
Accumulated power on time, hours:minutes 78382:00 \[4702920 minutes\]
Number of background scans performed: 96, scan progress: 84.01%
Number of background medium scans performed: 96
Device does not support General statistics and performance logging
Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 5
number of phys = 1
phy identifier = 0
attached device type: expander device
attached reason: power on
reason: loss of dword synchronization
negotiated logical link rate: phy enabled; 6 Gbps
attached initiator port: ssp=0 stp=0 smp=1
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5002538a758029f1
attached SAS address = 0x500143803227a7bf
attached phy identifier = 6
Invalid DWORD count = 14796
Running disparity error count = 14698
Loss of DWORD synchronization count = 1
Phy reset problem count = 1
Phy event descriptors:
Received ERROR count: 7300
Received address frame error count: 0
Received abandon-class OPEN_REJECT count: 0
Received retry-class OPEN_REJECT count: 552505
Received SSP frame error count: 0
relative target port id = 2
generation code = 5
number of phys = 1
phy identifier = 1
attached device type: expander device
attached reason: power on
reason: loss of dword synchronization
negotiated logical link rate: phy enabled; 6 Gbps
attached initiator port: ssp=0 stp=0 smp=1
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5002538a758029f2
attached SAS address = 0x500143803227a7bd
attached phy identifier = 6
Invalid DWORD count = 2484
Running disparity error count = 2336
Loss of DWORD synchronization count = 1
Phy reset problem count = 1
Phy event descriptors:
Received ERROR count: 1514
Received address frame error count: 0
Received abandon-class OPEN_REJECT count: 0
Received retry-class OPEN_REJECT count: 0
Received SSP frame error count: 0
N.b. I managed to confuse myself a little by rebooting the server earlier, which re-enumerated the disks. But I’ve checked through and confirmed I’m working with the correct disks.
All four disks (sdk, sdn, sdw and sdx, as they are now) fail within about 30 seconds when asked to run a long test.
You can change that timeout for the Long tests if you desire.
If you edit the multi_report_config.txt file, near the bottom look for Short_Drives_Test_Delay=130 and change this to the longest amount of time for the drives to complete a Long test. For example, if it takes 825 minutes for the recommended polling time, I’d add another 10 minutes due to possible pool activity, then it becomes (825 + 10) * 60 Short_Drives_Test_Delay=50100 seconds.
The Short Drive Delay is at the end of the script just to allow the drives running a Short test time to complete.
The side effect: The script will continue to run in the background for that extended amount of time before returning control to the main script and generating the report. If you run the CRONJOB at 2AM, it will be almost 14 hours before the report is generated.
So, this is an option you can use if you desire.
In the next version of drive_selftest (the portion of the script that runs these tests), I plan to have the option to monitor for when each drive is no longer running the SMART test, then pass control back to generate the report. I use to have that feature but it was just for myself at the time, then I determined I didn’t want it anymore. It came down to complicating is recognizing all the different drive reporting possibilities which can be a real pain. I’m over that now.
If all the drives are in the same vdev and all drives actually have the same Failed in segment –> 8 then what is the chance of all the drives failing on the exact same block? Maybe you just actually have one bad drive?
Make sure your backups actually work. Recently had a recovery issue on a database (not TN related) where it was not working though everyone thought it was.
I believe “segment 8” refers to the portion of the Long Test that has failed, rather than a particular block on the device that has failed.
I’ve searched high and low for documentation explaining what that “segments” are in SMART testing and never managed to find anything, so it’s a rather opaque error message on the part of the drive firmware/SMART system.
I did a search on the specific message for smartctl before I posted yesterday but This is the general take I got on the error. The Error in segment is from what I gathered is a ssd block or sector but because of the architecture differences between spinning rust (pie shaped) and memory cell layout (square) it is a segment but means the same thing overall.
A smartctl error Failed in segment xx indicates a hardware issue, likely a bad sector or otherwise failing or failed area on the drive, indicating a possible drive failure soon. Since it is an SSD that has no moving parts, I take that to mean it cannot read a section or segment of the drive since it is a SSD and means the drive is failing. On a SSD there are are no moving parts so it can’t read a section of memory cells. It could be for a number of reasons, but that area can no longer be read. If there are no replacement sectors due to all the replacments being used up the drive is worn out and dies.
Now I did find a lot more discussion about this error on Truenas forums including the old forums. Possibly IMO indicating that either this is more likely to occur on a cow system. General recommendation seemed to be replace the ssd So maybe cow is just hard on storage drives in general.
Industrial robots are famous for this, using a ssd without running smartctl to provide any pre-warning until the ssd just dies, usually in the middle of production.
While it is nice to try to understand exactly what the problem is with the drive, the SMART test failed to complete and pass the test. That is all anyone “needs” to know. But like you, I like to understand more than the basics.
Questions:
Are all 4 drives sharing something such as the same backplane, the same power connector, something like that. I ask because four drives, failing the SMART test so close together is kind of crazy. There must be something common.
Next question, how do you know that the drives failed within minutes of each other?
Do you know how many drives were being tested and if anything like a SCRUB was occuring at the same time? I’m looking at a weak power supply as a possible cause for this scenario.
Can you send me a dump using -dump email so I can examine the data and test run it against Multi-Report to see what is happening?
I’m not doubting you, I am trying to put this into writing so I can digest it and try to understand what is going on.
1 - no, two of the disks (sdk and sdn) are in the server’s internal drive cage, sdw and sdx are in an external disk shelf
2 - those four disks were all commanded to run long tests by Multi-Report that morning, and TrueNAS emailed me as each disk alerted “Self-Test Log error count increased from 0 to 1”. The email time stamps and the entries in the TrueNAS event log were all pretty much at the same time
3 - those were the only disks being tested that day. That pool was last scrubbed on the 14th of December, so it wasn’t scrubbing then. The server’s an old HP DL380 in good condition, the PSUs in it typically run around 25% capacity. Likewise, the external disk shelf is an HP DP2700 with dual PSUs in good condition. This kit was designed for spinning disks with a much higher power draw than these SSDs
4 - sure thing, I’ll DM you that once it’s complete edit Ah, no need, I see it automatically emails you
Based on @PhilD13’s post above, I’m now a little more worried that these disks are all actually failing. I could start replacing and resilvering, but am worried that the added stress of resilvering four times will provoke an outright failure
Thinking about this last night, I know why Multi-Report always tells me the last test passed successfully, even with the failed long tests.
I have short tests of all disks scheduled to run every morning, before Multi-Report runs
The staggered long tests don’t complete before Multi-Report runs
Therefore, with these disks passing short tests but failing long tests, whenever Multi-Report runs there will always have been a passed short test. The long tests fail after Multi-Report runs, then there is a short test the following day which passes, just before Multi-Report runs again.
Well I’m glad this topic came up now, not that you have drives which are failing.
It has been a long time ago when I wrote the script for checking the SMART test results, I need to somehow run a check on the last Long test and verify it passed of failed, then report that result before running another test (short or long) so the user knows the test had failed. I may do that now but I don’t think so.
I would recommend that you run multi-report once a day to test all your drives. Do not use anything else. The default values would run a short test every day unless a long test is scheduled to happen.
I think this would work (I have not tested it in years):
Next you can run multi-report using the `-m’ switch to “monitor” the results of the last tests and if there is a problem, you would get a new report, if no problem, no report. You could setup a CRONTAB for this to run once an hour, just make sure it does not run at the same time as your daily multi-report script. I’d at least run it 30 minutes later given the number of drives you have.
DELETE your drive_selftest_tracking.csv file. It “should” have been replaced with a new version. A new file will be generated and will track the date/time each drive was ‘commanded’ to run a test. In the next version I will verify the drive is really In Test. This is normally not an issue, however with drives that are on USB or some strange interface that smartmontools does not recognize automatically, it is best to verify. I still don’t know why this file is replaced on some systems and others it is not. With the next version, we will start with a blank slate and migrate only the minimum data we need to keep.
You will note that you have SMART data for your NETAPP drives under the NON-SMART section. This is because the script did not see the drive as SMART capable, but tried anyway.
TROUBLESHOOTING:
This is my recommended troubleshooting, unless you have already done this:
Have you powered down the system and powered back on, then run a Long test to see if it passes? If not, do this first. Power everything off.
Run smartctl -t long /dev/sdk (S/N: S20JNWAG800456) to run a Long test. We expect it to fail. Also run a long test on a known passing drive, only to verify it still passes.
Periodically check if the test is running or failed, or maybe even passed smartctl -a /dev/sdk.
Assuming the drive still fails, after the “good” drive has completed its testing, verify it was a pass, now power down and swap the drive with the known passing drive.
Rerun the long tests on both drives.
What are the results?
I find it odd that 4 drives exhibit the exact same failure at the same time. I do not think you have 4 drives that died at the exact same time, and not ever the same number of POH time.
Thanks for continuing to put time and effort into this @joeschmuck, it’s really appreciated
I’ve deleted the cron job that runs a short test every day and will let multi-report handle all the test scheduling (I’m not sure why I did it this way in the first place if Mulit-Report will automatically run a short test if no long test is scheduled; maybe I misunderstood the documentation).
I shut down the server and powered it up again. NB. I have no remote control of power to the disk shelf, so only the server chassis was powered down. Some of the disks have changed drive letter, so we now have sdm and sdn in the server chassis, and sdv and sdab in the disk shelf.
I’ve run multi_report.sh -m and that’s sat monitoring now.
I’ve deleted drive_selftest_tracking.csv
I’ve issued commands to run long tests on all four problem devices.
sdm and sdn (the internal devices that had a proper power cycle) are still running their long tests after about 15mins. sdx and sdab, the drives in the external shelf which have not been power cycled, failed within less than 60s. I think this is the issue: if I power cycle the disk shelf when I’m back in the datacentre on Monday I’m confident this will solve the problem and they’ll pass a long test again.
Re. changes to the system: nothing significant. It was running in an air conditioned datacentre over the Christmas break with no interruptions. It was running short and long tests as per my usual schedule throughout that time. January the 5th, when the errors occurred, was my first day back in the office. There have been no hardware changes for months, the OS was last updated in October 2025. The only thing I can think of is that shortly before the disk alerts were raised the pool that they’re part of went over 80% usage so I deleted a load of files. But this server has a busy storage subsystem all the time.
I’ll report back on Monday when I’ve been able to power cycle the disk shelf.
sdm and sdn (the devices in the server chassis, S20JNWAG800456 and S20JNWAG800828) eventually failed their long tests, same result as before “Failed in segment 8”, they just took a lot longer to fail this time.
It’s also now flagging sds, sdaf and sdah as failing long tests (which it triggered to run when I ran multi-report with ‘-m’ switch earlier).
NB. these devices are all on the external disk shelf. This is still looking like the commonality, here. The external shelf is also on a different HBA to the internal disks. More to investigate next week when I’m back on site.
It’s also telling me that Pool1 is in a degraded state as sdaj (which is still running a long test) has a read error!
Times passes…
It’s now telling me the pool is online (it appears to have resilvered, but very quickly considering it’s a 33 disk, 38TB pool)(it only resilvered 182k of data, not the whole pool, which is why it was so fast). But I have still have an exclamation mark against “Topology” in the UI (“Pool is not healthy”), and an X against ZFS Health. But the front page dashboard says everything is online with no errors. The UI is giving me conflicting information
The command line seems to think everything’s okay (bar one read error):
root@eurybia[/mnt/Apps/scripts]# zpool list Pool1
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
Pool1 48T 38.0T 10.0T - - 4% 79% 1.00x ONLINE /mnt
root@eurybia[/mnt/Apps/scripts]# zpool status -v Pool1
pool: Pool1
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: resilvered 172K in 00:00:00 with 0 errors on Fri Jan 9 16:14:50 2026
config:
NAME STATE READ WRITE CKSUM
Pool1 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ca05d5aa-54a6-40db-b4a3-4cc5dbc54077 ONLINE 0 0 0
6b398a80-7701-489a-89c7-32ef46778a63 ONLINE 0 0 0
1d77d204-4e2e-4a96-93c7-6d5eb890a22d ONLINE 0 0 0
1f63f681-444e-4d1e-b1c0-3cd5868118b9 ONLINE 0 0 0
a077f95b-e2d7-49df-b4f5-9b8213a68bd9 ONLINE 0 0 0
f8f2fc2c-7030-463c-80f9-aec818154b21 ONLINE 0 0 0
5d735252-cf4b-4804-b7a6-441f9503070a ONLINE 0 0 0
c86a5bfa-16c9-4384-9147-91e8f28a0c12 ONLINE 1 0 0
0fcac123-7533-4777-ac20-3c27f8fc242c ONLINE 0 0 0
d97f4c42-0663-454b-81bc-79310c433431 ONLINE 0 0 0
3b0c4262-5606-4156-b41c-03d504a7fe09 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
ec7d3874-3c13-43fe-ae6a-0e56ace5cdcb ONLINE 0 0 0
ec91f07e-a21a-43cd-8283-adc2bcc04c0a ONLINE 0 0 0
afc7b971-e3e2-4f99-b40f-a059648d61c9 ONLINE 0 0 0
660934ad-af7e-4f43-911f-7181a5a3ad58 ONLINE 0 0 0
0b2bff21-acdb-446a-8036-c5cd1c5b5d2a ONLINE 0 0 0
91312505-40cb-45d6-9068-d73aea270852 ONLINE 0 0 0
067d7be6-d336-42de-8b29-b8a129c882fd ONLINE 0 0 0
94d0d685-0ffa-4828-a6c2-2b725044db59 ONLINE 0 0 0
fec62dd4-f536-4ade-97d1-11f334684e6a ONLINE 0 0 0
633fd2d6-2e43-4098-8b3d-8c2297a9d656 ONLINE 0 0 0
295074ff-0941-409a-84fe-2607a800e24e ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
da683169-268b-42a3-89f1-4250dbdc910f ONLINE 0 0 0
2df44b77-2c11-41cc-802c-6e21f524f82e ONLINE 0 0 0
8273cbcd-97c7-4452-8c39-74eb032b7874 ONLINE 0 0 0
eb678a85-09c4-4e6f-a106-9cae17781890 ONLINE 0 0 0
65d90bdf-f7a2-4701-9d99-e1012fc0a9b6 ONLINE 0 0 0
5db54936-ac13-4343-8c0a-df702858f42c ONLINE 0 0 0
a3ee9eab-f49b-484d-a78e-b46119f8caf1 ONLINE 0 0 0
0c5a32c6-5e5c-4345-9516-9d02f2b0c791 ONLINE 0 0 0
94c36df2-afd3-4aa6-96f4-01b8c1e890ce ONLINE 0 0 0
d4429931-8911-4784-947a-887c28f1d23e ONLINE 0 0 0
046bd705-222a-4201-9368-52dd00390cfb ONLINE 0 0 0
errors: No known data errors
The cascading failures are puzzling and can also point to hardware failure issues.
It’s unfortunate but I don’t find the GUI to be that overly accurate in a lot of places that it really needs to be. I have never found it all that accurate in drive issues especially. The view of a specific widget or page may show everything is good when it is not.
Where I usually go that is more accurate in the GUI for drive issues is
Storage > Pool > Manage Devices. This gets you a view of the vdevs that make up each pool and will likely show a vdev there with errors. If it does or even if it does not, Open the drop down for each vdev and it will list drives by sd?. Click on each drive and it will open the side panel showing some drive info there. You get the idea. This is the most likely spot to find drive issues and get them fixed.
How clean is the server room? There may be dust buildup internally in the disk shelves or server. I would also if possible power down and reseat the drives, cables, cards that connect the disk shelves to the server. It’s also a good time to check if all fans are operational. Maybe one is dead and not reporting as dead. If you have redundant power supplies in the server make sure both are working. If the server is backed by an ups, make sure it is not overloaded.
As a suggestion. Use only one method of setting or triggering tests such as SMART tests. I personally use multi_report because it makes things easy for me and my methods of tracking and working and provides loads of drive info and statistics. But, whatever method used, use that method exclusively otherwise it is just confusing and you can wind up with drives always performing or seemingly stuck in long tests which is bad overall.
I really dislike writing documentation. I’m actually pretty sure it could be done much better.
I just want to make sure I was clear:
Run multi_report.sh normally, no switches, once a day. This will run the SMART tests.
You can run multi_report.sh -m (with this switch) as often as you desire (maybe once an hour or every 2 hours?, if you are tracking heat related issues then once every 15 minutes it reasonable) and it should ONLY send you a report if it sees a problem. The -m will not run a SMART test. It is only a Monitor.
The normal run multi_report.sh cannot be run at the same time as the -m version, the script will check to ensure there is not two instances running and if there is, it will abort. I have run into this scenario earlier on when some people were running the test too often. Crazy days.
Brother, you have some problems to figure out. I really hope it is resolved with a simple power down and power up sequence. But the HBA or the enclosure, as you said, are common. Those make more sense.
If there is a rollover error at 65khrs of operating time, some of my drives will be past that sometime this summer. I’m posting here so I’ll know how to find and re-animate this thread.