Suspicious Disk Self-Test Log Errors

I had a mild panic this morning when my TrueNAS Scale server issued critical SMART alerts for four disks (SAS SSDs), all on the same VDEV, within minutes of each other. The logging in TrueNAS wasn’t very descriptive, simply that the “Self-Test Log error count increased from 0 to 1” for each device.

These four disks were all given a SMART Long Test today as part of the regular testing scheduled by @joeschmuck’s Multi-Report script.

Further investigation showed TrueNAS reporting “Power On Hours Ago” values of 65536 for each of the flagged disks. This immediately made me suspicious: is this some sort of roll over bug?

If I issue a “smartctl -t long ” for any of these four devices it fails immediately with a status of “Failed in segment –> 8”. I’ve issued the same command for other devices in the system and they’re running still with no problem.

The disks are all repurposed/rebadged 1.6TB Samsung SAS SSDs (model MZ-IWS1T9B) from a NetApp SAN that have been running without problems in this server for more than 2 years.

Sounds very much like it.

2 to the power of 16 = 65536

I presume these drives are fairly old and probably not designed to last that long?

65536 / 24 / 365 = 7.48 years

Yeah, manufacture date on them is 2015 :grinning_face:

I think I will take the informed decision of ignoring the SMART errors until I actually start seeing read/write errors appear on them. The data’s all regularly backed up nightly, so not the end of the world to restore if the pool dies.

1 Like

@WiteWulf

I see two issues:

  1. The 2^16 does happen for some drives, in various locations. I suspect you would have several tests using the same POH value in the SMART Test Log.
  2. The “Failed in segment –> 8” sounds like a real test failure.

Did Multi-Report flag the SMART Test failure? Did it also make a comment in the text section about the Test POH value?

If you can either send me a dump or post the output of smartctl -x /dev/sdX that would help.

The plot thickens…

Multi-Report always shows the Long Tests as in progress on my system when scheduled as there’s not a long enough delay, so if the test completes successfully, or takes a long time to fail, you only find out on the next run. So, knowing that I had four “failed” tests this morning I re-ran Multi-Report with the “-dev” switch to avoid starting new tests. Multi-Report shows all four disks as having passed the last test, but the time/hours counters don’t match.

E.g. Multi-Report output:

/dev/sdw	S20JNWAG800671	NETAPP X439_S16331T6AMD (SCSI)	1.60T	Enabled PASSED	28*C	---	---	78381	100	---	0	---	0	---	0	Background long (78381 hrs)

Vs. smartctl output:

# 1  Background long   Failed in segment -->       8   12845                 - [-   -    -]

Here’s the full output of smartctl -x for that specific device:

‘smartctl -x /dev/sdw’ output
root@eurybia\[/mnt/Apps/scripts\]# smartctl -x /dev/sdw
smartctl 7.4 2023-08-01 r5530 \[x86_64-linux-6.12.15-production+truenas\] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               NETAPP
Product:              X439_S16331T6AMD
Revision:             NA04
Compliance:           SPC-4
User Capacity:        1,600,321,314,816 bytes \[1.60 TB\]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5002538a758029f0
Serial number:        S20JNWAG800671
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Mon Jan  5 13:19:15 2026 GMT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature:     28 C
Drive Trip Temperature:        60 C

Manufactured in week 31 of year 2015
Accumulated start-stop cycles:  273
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
Elements in grown defect list: 0

Error counter log:
Errors Corrected by           Total   Correction     Gigabytes    Total
ECC          rereads/    errors   algorithm      processed    uncorrected
fast | delayed   rewrites  corrected  invocations   \[10^9 bytes\]  errors
read:          0        0         0         0          0     239726.212           0
write:         0        0         0         0          0     192266.352           0
verify:        0        0         0         0          0     498749.808           0

Non-medium error count:       10

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err \[SK ASC ASQ\]
Description                              number   (hours)

# 1  Background long   Failed in segment →       8   12845                 - \[-   -    -\]

# 2  Background long   Failed in segment →       8   12842                 - \[-   -    -\]

# 3  Background long   Failed in segment →       8   12841                 - \[-   -    -\]

# 4  Background short  Completed                   -   12839                 - \[-   -    -\]

# 5  Background short  Completed                   -   12815                 - \[-   -    -\]

# 6  Background short  Completed                   -   12791                 - \[-   -    -\]

# 7  Background short  Completed                   -   12767                 - \[-   -    -\]

# 8  Background short  Completed                   -   12743                 - \[-   -    -\]

# 9  Background short  Completed                   -   12719                 - \[-   -    -\]

#10  Background short  Completed                   -   12695                 - \[-   -    -\]
#11  Background long   Completed                   -   12673                 - \[-   -    -\]
#12  Background short  Completed                   -   12671                 - \[-   -    -\]
#13  Background short  Completed                   -   12647                 - \[-   -    -\]
#14  Background short  Completed                   -   12623                 - \[-   -    -\]
#15  Background short  Completed                   -   12599                 - \[-   -    -\]
#16  Background short  Completed                   -   12575                 - \[-   -    -\]
#17  Background short  Completed                   -   12551                 - \[-   -    -\]
#18  Background short  Completed                   -   12527                 - \[-   -    -\]
#19  Background long   Completed                   -   12505                 - \[-   -    -\]
#20  Background short  Completed                   -   12503                 - \[-   -    -\]

Long (extended) Self-test duration: 3600 seconds \[60.0 minutes\]

Background scan results log
Status: waiting until BMS interval timer expires
Accumulated power on time, hours:minutes 78382:00 \[4702920 minutes\]
Number of background scans performed: 96,  scan progress: 84.01%
Number of background medium scans performed: 96
Device does not support General statistics and performance logging

Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 5
number of phys = 1
phy identifier = 0
attached device type: expander device
attached reason: power on
reason: loss of dword synchronization
negotiated logical link rate: phy enabled; 6 Gbps
attached initiator port: ssp=0 stp=0 smp=1
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5002538a758029f1
attached SAS address = 0x500143803227a7bf
attached phy identifier = 6
Invalid DWORD count = 14796
Running disparity error count = 14698
Loss of DWORD synchronization count = 1
Phy reset problem count = 1
Phy event descriptors:
Received ERROR count: 7300
Received address frame error count: 0
Received abandon-class OPEN_REJECT count: 0
Received retry-class OPEN_REJECT count: 552505
Received SSP frame error count: 0
relative target port id = 2
generation code = 5
number of phys = 1
phy identifier = 1
attached device type: expander device
attached reason: power on
reason: loss of dword synchronization
negotiated logical link rate: phy enabled; 6 Gbps
attached initiator port: ssp=0 stp=0 smp=1
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5002538a758029f2
attached SAS address = 0x500143803227a7bd
attached phy identifier = 6
Invalid DWORD count = 2484
Running disparity error count = 2336
Loss of DWORD synchronization count = 1
Phy reset problem count = 1
Phy event descriptors:
Received ERROR count: 1514
Received address frame error count: 0
Received abandon-class OPEN_REJECT count: 0
Received retry-class OPEN_REJECT count: 0
Received SSP frame error count: 0

N.b. I managed to confuse myself a little by rebooting the server earlier, which re-enumerated the disks. But I’ve checked through and confirmed I’m working with the correct disks.

All four disks (sdk, sdn, sdw and sdx, as they are now) fail within about 30 seconds when asked to run a long test.

You can change that timeout for the Long tests if you desire.

If you edit the multi_report_config.txt file, near the bottom look for Short_Drives_Test_Delay=130 and change this to the longest amount of time for the drives to complete a Long test. For example, if it takes 825 minutes for the recommended polling time, I’d add another 10 minutes due to possible pool activity, then it becomes (825 + 10) * 60 Short_Drives_Test_Delay=50100 seconds.

The Short Drive Delay is at the end of the script just to allow the drives running a Short test time to complete.

The side effect: The script will continue to run in the background for that extended amount of time before returning control to the main script and generating the report. If you run the CRONJOB at 2AM, it will be almost 14 hours before the report is generated.

So, this is an option you can use if you desire.

In the next version of drive_selftest (the portion of the script that runs these tests), I plan to have the option to monitor for when each drive is no longer running the SMART test, then pass control back to generate the report. I use to have that feature but it was just for myself at the time, then I determined I didn’t want it anymore. It came down to complicating is recognizing all the different drive reporting possibilities which can be a real pain. I’m over that now.

1 Like

If all the drives are in the same vdev and all drives actually have the same Failed in segment –> 8 then what is the chance of all the drives failing on the exact same block? Maybe you just actually have one bad drive?

Make sure your backups actually work. Recently had a recovery issue on a database (not TN related) where it was not working though everyone thought it was.

1 Like

I believe “segment 8” refers to the portion of the Long Test that has failed, rather than a particular block on the device that has failed.

I’ve searched high and low for documentation explaining what that “segments” are in SMART testing and never managed to find anything, so it’s a rather opaque error message on the part of the drive firmware/SMART system.

I did a search on the specific message for smartctl before I posted yesterday but This is the general take I got on the error. The Error in segment is from what I gathered is a ssd block or sector but because of the architecture differences between spinning rust (pie shaped) and memory cell layout (square) it is a segment but means the same thing overall.

A smartctl error Failed in segment xx indicates a hardware issue, likely a bad sector or otherwise failing or failed area on the drive, indicating a possible drive failure soon. Since it is an SSD that has no moving parts, I take that to mean it cannot read a section or segment of the drive since it is a SSD and means the drive is failing. On a SSD there are are no moving parts so it can’t read a section of memory cells. It could be for a number of reasons, but that area can no longer be read. If there are no replacement sectors due to all the replacments being used up the drive is worn out and dies.

Now I did find a lot more discussion about this error on Truenas forums including the old forums. Possibly IMO indicating that either this is more likely to occur on a cow system. General recommendation seemed to be replace the ssd So maybe cow is just hard on storage drives in general.

Industrial robots are famous for this, using a ssd without running smartctl to provide any pre-warning until the ssd just dies, usually in the middle of production.

1 Like

While it is nice to try to understand exactly what the problem is with the drive, the SMART test failed to complete and pass the test. That is all anyone “needs” to know. But like you, I like to understand more than the basics.

Questions:

  1. Are all 4 drives sharing something such as the same backplane, the same power connector, something like that. I ask because four drives, failing the SMART test so close together is kind of crazy. There must be something common.

  2. Next question, how do you know that the drives failed within minutes of each other?

  3. Do you know how many drives were being tested and if anything like a SCRUB was occuring at the same time? I’m looking at a weak power supply as a possible cause for this scenario.

  4. Can you send me a dump using -dump email so I can examine the data and test run it against Multi-Report to see what is happening?

I’m not doubting you, I am trying to put this into writing so I can digest it and try to understand what is going on.

1 - no, two of the disks (sdk and sdn) are in the server’s internal drive cage, sdw and sdx are in an external disk shelf

2 - those four disks were all commanded to run long tests by Multi-Report that morning, and TrueNAS emailed me as each disk alerted “Self-Test Log error count increased from 0 to 1”. The email time stamps and the entries in the TrueNAS event log were all pretty much at the same time

3 - those were the only disks being tested that day. That pool was last scrubbed on the 14th of December, so it wasn’t scrubbing then. The server’s an old HP DL380 in good condition, the PSUs in it typically run around 25% capacity. Likewise, the external disk shelf is an HP DP2700 with dual PSUs in good condition. This kit was designed for spinning disks with a much higher power draw than these SSDs

4 - sure thing, I’ll DM you that once it’s complete edit Ah, no need, I see it automatically emails you :+1:

Based on @PhilD13’s post above, I’m now a little more worried that these disks are all actually failing. I could start replacing and resilvering, but am worried that the added stress of resilvering four times will provoke an outright failure

Does “eurybia” sound familiar? Just making sure, found this in my SPAM filter. It happens.

Yeah, that’s me :grinning_face:

Thinking about this last night, I know why Multi-Report always tells me the last test passed successfully, even with the failed long tests.

  • I have short tests of all disks scheduled to run every morning, before Multi-Report runs
  • The staggered long tests don’t complete before Multi-Report runs

Therefore, with these disks passing short tests but failing long tests, whenever Multi-Report runs there will always have been a passed short test. The long tests fail after Multi-Report runs, then there is a short test the following day which passes, just before Multi-Report runs again.

Well I’m glad this topic came up now, not that you have drives which are failing.

It has been a long time ago when I wrote the script for checking the SMART test results, I need to somehow run a check on the last Long test and verify it passed of failed, then report that result before running another test (short or long) so the user knows the test had failed. I may do that now but I don’t think so.

I would recommend that you run multi-report once a day to test all your drives. Do not use anything else. The default values would run a short test every day unless a long test is scheduled to happen.

I think this would work (I have not tested it in years):
Next you can run multi-report using the `-m’ switch to “monitor” the results of the last tests and if there is a problem, you would get a new report, if no problem, no report. You could setup a CRONTAB for this to run once an hour, just make sure it does not run at the same time as your daily multi-report script. I’d at least run it 30 minutes later given the number of drives you have.

DELETE your drive_selftest_tracking.csv file. It “should” have been replaced with a new version. A new file will be generated and will track the date/time each drive was ‘commanded’ to run a test. In the next version I will verify the drive is really In Test. This is normally not an issue, however with drives that are on USB or some strange interface that smartmontools does not recognize automatically, it is best to verify. I still don’t know why this file is replaced on some systems and others it is not. With the next version, we will start with a blank slate and migrate only the minimum data we need to keep.

You will note that you have SMART data for your NETAPP drives under the NON-SMART section. This is because the script did not see the drive as SMART capable, but tried anyway.

TROUBLESHOOTING:
This is my recommended troubleshooting, unless you have already done this:

  1. Have you powered down the system and powered back on, then run a Long test to see if it passes? If not, do this first. Power everything off.
  2. Run smartctl -t long /dev/sdk (S/N: S20JNWAG800456) to run a Long test. We expect it to fail. Also run a long test on a known passing drive, only to verify it still passes.
  3. Periodically check if the test is running or failed, or maybe even passed smartctl -a /dev/sdk.
  4. Assuming the drive still fails, after the “good” drive has completed its testing, verify it was a pass, now power down and swap the drive with the known passing drive.
  5. Rerun the long tests on both drives.
  6. What are the results?

I find it odd that 4 drives exhibit the exact same failure at the same time. I do not think you have 4 drives that died at the exact same time, and not ever the same number of POH time.

Think back, what changed in your system?

Thanks for continuing to put time and effort into this @joeschmuck, it’s really appreciated :smiling_face_with_three_hearts:

I’ve deleted the cron job that runs a short test every day and will let multi-report handle all the test scheduling (I’m not sure why I did it this way in the first place if Mulit-Report will automatically run a short test if no long test is scheduled; maybe I misunderstood the documentation).

I shut down the server and powered it up again. NB. I have no remote control of power to the disk shelf, so only the server chassis was powered down. Some of the disks have changed drive letter, so we now have sdm and sdn in the server chassis, and sdv and sdab in the disk shelf.

I’ve run multi_report.sh -m and that’s sat monitoring now.

I’ve deleted drive_selftest_tracking.csv

I’ve issued commands to run long tests on all four problem devices.

sdm and sdn (the internal devices that had a proper power cycle) are still running their long tests after about 15mins. sdx and sdab, the drives in the external shelf which have not been power cycled, failed within less than 60s. I think this is the issue: if I power cycle the disk shelf when I’m back in the datacentre on Monday I’m confident this will solve the problem and they’ll pass a long test again.

Re. changes to the system: nothing significant. It was running in an air conditioned datacentre over the Christmas break with no interruptions. It was running short and long tests as per my usual schedule throughout that time. January the 5th, when the errors occurred, was my first day back in the office. There have been no hardware changes for months, the OS was last updated in October 2025. The only thing I can think of is that shortly before the disk alerts were raised the pool that they’re part of went over 80% usage so I deleted a load of files. But this server has a busy storage subsystem all the time.

I’ll report back on Monday when I’ve been able to power cycle the disk shelf.

Okay, it’s really not happy now :grimacing:

sdm and sdn (the devices in the server chassis, S20JNWAG800456 and S20JNWAG800828) eventually failed their long tests, same result as before “Failed in segment 8”, they just took a lot longer to fail this time.

It’s also now flagging sds, sdaf and sdah as failing long tests (which it triggered to run when I ran multi-report with ‘-m’ switch earlier).

NB. these devices are all on the external disk shelf. This is still looking like the commonality, here. The external shelf is also on a different HBA to the internal disks. More to investigate next week when I’m back on site.

It’s also telling me that Pool1 is in a degraded state as sdaj (which is still running a long test) has a read error!

Times passes…

It’s now telling me the pool is online (it appears to have resilvered, but very quickly considering it’s a 33 disk, 38TB pool)(it only resilvered 182k of data, not the whole pool, which is why it was so fast). But I have still have an exclamation mark against “Topology” in the UI (“Pool is not healthy”), and an X against ZFS Health. But the front page dashboard says everything is online with no errors. The UI is giving me conflicting information :man_shrugging:

The command line seems to think everything’s okay (bar one read error):

root@eurybia[/mnt/Apps/scripts]# zpool list Pool1
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Pool1    48T  38.0T  10.0T        -         -     4%    79%  1.00x    ONLINE  /mnt
root@eurybia[/mnt/Apps/scripts]# zpool status -v Pool1
  pool: Pool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 172K in 00:00:00 with 0 errors on Fri Jan  9 16:14:50 2026
config:

	NAME                                      STATE     READ WRITE CKSUM
	Pool1                                     ONLINE       0     0     0
	  raidz2-0                                ONLINE       0     0     0
	    ca05d5aa-54a6-40db-b4a3-4cc5dbc54077  ONLINE       0     0     0
	    6b398a80-7701-489a-89c7-32ef46778a63  ONLINE       0     0     0
	    1d77d204-4e2e-4a96-93c7-6d5eb890a22d  ONLINE       0     0     0
	    1f63f681-444e-4d1e-b1c0-3cd5868118b9  ONLINE       0     0     0
	    a077f95b-e2d7-49df-b4f5-9b8213a68bd9  ONLINE       0     0     0
	    f8f2fc2c-7030-463c-80f9-aec818154b21  ONLINE       0     0     0
	    5d735252-cf4b-4804-b7a6-441f9503070a  ONLINE       0     0     0
	    c86a5bfa-16c9-4384-9147-91e8f28a0c12  ONLINE       1     0     0
	    0fcac123-7533-4777-ac20-3c27f8fc242c  ONLINE       0     0     0
	    d97f4c42-0663-454b-81bc-79310c433431  ONLINE       0     0     0
	    3b0c4262-5606-4156-b41c-03d504a7fe09  ONLINE       0     0     0
	  raidz2-1                                ONLINE       0     0     0
	    ec7d3874-3c13-43fe-ae6a-0e56ace5cdcb  ONLINE       0     0     0
	    ec91f07e-a21a-43cd-8283-adc2bcc04c0a  ONLINE       0     0     0
	    afc7b971-e3e2-4f99-b40f-a059648d61c9  ONLINE       0     0     0
	    660934ad-af7e-4f43-911f-7181a5a3ad58  ONLINE       0     0     0
	    0b2bff21-acdb-446a-8036-c5cd1c5b5d2a  ONLINE       0     0     0
	    91312505-40cb-45d6-9068-d73aea270852  ONLINE       0     0     0
	    067d7be6-d336-42de-8b29-b8a129c882fd  ONLINE       0     0     0
	    94d0d685-0ffa-4828-a6c2-2b725044db59  ONLINE       0     0     0
	    fec62dd4-f536-4ade-97d1-11f334684e6a  ONLINE       0     0     0
	    633fd2d6-2e43-4098-8b3d-8c2297a9d656  ONLINE       0     0     0
	    295074ff-0941-409a-84fe-2607a800e24e  ONLINE       0     0     0
	  raidz2-2                                ONLINE       0     0     0
	    da683169-268b-42a3-89f1-4250dbdc910f  ONLINE       0     0     0
	    2df44b77-2c11-41cc-802c-6e21f524f82e  ONLINE       0     0     0
	    8273cbcd-97c7-4452-8c39-74eb032b7874  ONLINE       0     0     0
	    eb678a85-09c4-4e6f-a106-9cae17781890  ONLINE       0     0     0
	    65d90bdf-f7a2-4701-9d99-e1012fc0a9b6  ONLINE       0     0     0
	    5db54936-ac13-4343-8c0a-df702858f42c  ONLINE       0     0     0
	    a3ee9eab-f49b-484d-a78e-b46119f8caf1  ONLINE       0     0     0
	    0c5a32c6-5e5c-4345-9516-9d02f2b0c791  ONLINE       0     0     0
	    94c36df2-afd3-4aa6-96f4-01b8c1e890ce  ONLINE       0     0     0
	    d4429931-8911-4784-947a-887c28f1d23e  ONLINE       0     0     0
	    046bd705-222a-4201-9368-52dd00390cfb  ONLINE       0     0     0

errors: No known data errors

The cascading failures are puzzling and can also point to hardware failure issues.

It’s unfortunate but I don’t find the GUI to be that overly accurate in a lot of places that it really needs to be. I have never found it all that accurate in drive issues especially. The view of a specific widget or page may show everything is good when it is not.

Where I usually go that is more accurate in the GUI for drive issues is
Storage > Pool > Manage Devices. This gets you a view of the vdevs that make up each pool and will likely show a vdev there with errors. If it does or even if it does not, Open the drop down for each vdev and it will list drives by sd?. Click on each drive and it will open the side panel showing some drive info there. You get the idea. This is the most likely spot to find drive issues and get them fixed.

How clean is the server room? There may be dust buildup internally in the disk shelves or server. I would also if possible power down and reseat the drives, cables, cards that connect the disk shelves to the server. It’s also a good time to check if all fans are operational. Maybe one is dead and not reporting as dead. If you have redundant power supplies in the server make sure both are working. If the server is backed by an ups, make sure it is not overloaded.

As a suggestion. Use only one method of setting or triggering tests such as SMART tests. I personally use multi_report because it makes things easy for me and my methods of tracking and working and provides loads of drive info and statistics. But, whatever method used, use that method exclusively otherwise it is just confusing and you can wind up with drives always performing or seemingly stuck in long tests which is bad overall.

1 Like

I really dislike writing documentation. I’m actually pretty sure it could be done much better.

I just want to make sure I was clear:

  1. Run multi_report.sh normally, no switches, once a day. This will run the SMART tests.

  2. You can run multi_report.sh -m (with this switch) as often as you desire (maybe once an hour or every 2 hours?, if you are tracking heat related issues then once every 15 minutes it reasonable) and it should ONLY send you a report if it sees a problem. The -m will not run a SMART test. It is only a Monitor.

  3. The normal run multi_report.sh cannot be run at the same time as the -m version, the script will check to ensure there is not two instances running and if there is, it will abort. I have run into this scenario earlier on when some people were running the test too often. Crazy days.

Brother, you have some problems to figure out. I really hope it is resolved with a simple power down and power up sequence. But the HBA or the enclosure, as you said, are common. Those make more sense.

Good luck.

1 Like

If there is a rollover error at 65khrs of operating time, some of my drives will be past that sometime this summer. I’m posting here so I’ll know how to find and re-animate this thread.

1 Like