Can't trim pool of SMR drives?

gedavids · February 18, 2025, 10:45pm

I have a pool that I’m just dumping data to (packet captures from my Security Onion VM) and given the way data churns here I’d like to be able to use trim. Trouble is truenas is telling me nothing supports trim even when they clearly do. Any idea why this might be?

Specifics

Version: Scale 24.10.2
Drives: Seagate BarraCuda ST2000DM008-2FR102
Pool: 1 mirror vdev

george@truenas42:~$ sudo zpool trim nsm
cannot trim: no devices in pool support trim operations
george@truenas42:~$ sudo smartctl -x /dev/sdg
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5 (SMR)
Device Model:     ST2000DM008-2FR102
Serial Number:    ZFL0BKH8
LU WWN Device Id: 5 000c50 0b5ad5bda
Firmware Version: 0001
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
TRIM Command:     Available
Device is:        In smartctl database 7.3/5660
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Feb 18 16:01:24 2025 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

SmallBarky · February 18, 2025, 10:59pm

Please don’t use SMR drives with ZFS / TrueNAS. Plenty of posts on the forums also.

Steven_Ellis · February 19, 2025, 3:24am

Spinning disks don’t support TRIM

SmallBarky · February 19, 2025, 3:31am

Maybe confused it with the ZFS Auto TRIM?

Alexey · February 19, 2025, 6:13am

SMR spinning disks do support and use TRIM to optimize the SMR part.
See, e.g. Defragmenting SMR Hard Drives – TRIM Enabled – HTWingNut Tech Blog

RetroG · February 19, 2025, 8:56am

just to also state… ZFS has two ways to TRIM:

Auto TRIM, ZFS will TRIM free space as files are deleted, generally reccomended on any device that supports it.
and there’s manual TRIM, initiated by running sudo zpool trim poolname which will start a TRIM all free space at the time this command is run (you can see (trimming) in the zpool status output, during this operation.). you should run this once if you have a used pool that you forgot to enable Auto TRIM on.

with regards to SMR disks in general… the other posts are right, they really are a poor choice for ZFS, but it should also be stated that SMR disks are just horrible for write workloads in general. more specifically SMR disks that expose TRIM use a sort of a CMR cache where it will (in the background) later re-write the data in SMR zones… and the disk uses this to mask the miserable write speed. this only works if your writes are occasional… that’s how it works on WD SMR disks anyways…

it’s weird that it states TRIM support as I haven’t seen a Seagate SMR disk with TRIM. I’ve only used the ST5000LM000 and in passing, some which just skips all the TRIM and re-writing crap, choosing to only buffer a zone directly on write and you get slow (but consistent!) write speeds of around 30MB/s (read is ~100-130MB/s).

I do know that Seagate’s rosewood (7mm 2.5) family do employ this re-writing crap but without TRIM (ie these drives tend to lock up your system if you write a lot of random I/O and you’ll never be able to fix it) but not the 15mm 2.5 drives. I do not know if your 3.5 drives have this or not… but if they do have this re-write algorithm you should just buy something that isn’t defective by design.

the TLDR is… TRIM won’t help your performance, much if at all.

Protopia · February 19, 2025, 10:09am

Not true. Some SMR HDDs do indeed support TRIM, however IMO it is of limited use for ZFS, and here’s why:

Trim on SMR HDDs works similarly but also very differently than for SSDs.

On both technologies the disk is divided into large areas which are written at once, and in both cases the TRIM helps the firmware to make more efficient writes - but once you get into the details they work very differently indeed.

SSDs have cells containing many sectors and can read and write each sector individually, however you can only write to an erased sectors - when you want to overwrite a used sector, then you need to:

read the contents of the existing cell;
change the sector you want to overwrite;
write the contents of to a completely erased cell;
send the original cell to a background queue to be erased for reuse

This need to read an entire cell just to write one sector is called “write amplification”. And there is a mapping that allows cells to be mapped to LBA ranges to make this work.

When you delete files, the TRIM command tells the SSD firmware that the relevant sectors are no longer in use, and the firmware can then:

check if cells have all sectors TRIMmed and if so send the cell for erasure;
for partially empty cells, decide whether it is worth copying the non-TRIMmed cells to an empty cell ready for the TRIMmed sectors to be written to at a later time.

SMR HDDs are different in that you can read sectors individually, but you can only write the entire cell. So, to write a single sector, if a cell is used then you need to read the entire cell, change the sector and write the entire cell back again similar to SSDs, but if the disk knows that the cell is completely empty it can simply write to it without reading the existing contents which is twice as fast. SMR drives have a CMR cache where writes are written first, and then in the background it destages these writes to the main SMR area. (And if you are doing bulk writes of random blocks, you can fill up the cache far faster than it can be destaged in the background, and then everything slows down by a factor of (say) 100.)

Having now thought about this, IMO I consider it likely that this means that for SMR HDDs:

TRIM is only effective when you have large contiguous empty areas of disk - this happens only when the disk is relatively new or when you have created such areas by running defrag. If you don’t do defrag, it becomes increasingly likely that every cell has at least one sector used, and then bulk write performance becomes universally poor - but if you do defrag to create large areas of contiguous free space, then bulk performance will be better. ZFS does to have any ability to defrag, so over time it is likely that every cell will have at least one used sector and trim will have zero impact on performance and bulk writes with will slow to a crawl.
The performance of SMR drives during resilvering of redundant RAID1/5/6 or ZFS mirrors or RAIDZ will very much depend on whether the resilvering code TRIMS the entire drive before starting to write and whether the writes are then done sequentially across the drive or randomly. Hardware RAID does resilvers sequentially across the disk and so performance can still be reasonable. Now that mirrors are resilvered similarly by ZFS (in order to reduce seek times not for SMR reasons) ZFS mirror resilvering on SMR drives with TRIM might have reasonable times if TRIM is invoked first (but I haven’t seen any benchmarks or anecdotal evidence to support this). For resilvering of ZFS RAIDZ the writes are done randomly and thus the time for resilvering SMR drives (with or without TRIM) are usually exponentially larger.

Finally the zpool autotrim parameter simply does a TRIM as and when ZFS moves blocks to the free pool and only when the drive supports TRIM. So if the zpool trim commands doesn’t recognise that the pool supports TRIM, I doubt that autotrim will either.

I suspect that since ZFS doesn’t work well with SMR drives, and the only types of HDD that support TRIM are SMR drives, I suspect that no attempt has been made to allow ZFS to detect trim on HDDs.

gedavids · February 19, 2025, 3:58pm

All,

I’m aware of the information regarding ZFS and SMR drives. These are the drives I have lying around and this data is low priority. It’s a 2TB pool storing a 1TiB Zvol that is a rolling window of network packet captures. Just a home lab playing with this data and I don’t want to waste my primary pool storage on it.

The reason I’m wanting to trim is that I’m going to be deleting data as much as I’m writing data, it’s a rolling window. While every track might not get fully erased, the idea with trim is that it would only have to re-write what is actually being stored and the rest can be trashed.

So the theory you have is that when checking if a device supports trim, rather than just checking if trim is available, the code first checks if this is a HDD or a SSD and exits if it’s a HDD. Valid I suppose, but why would they do the extra work?

I feel like these used to trim a few years ago, back when I ran Core still. I wonder if this is a difference of FreeBSD vs Linux, or just an OpenZFS code change.

HoneyBadger · February 19, 2025, 4:15pm

Do you have non-zero results from the output of sudo lsblk -D for these drives?

NAME        DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda                0        0B       0B         0
└─sda1             0        0B       0B         0
sde                0      512B       2G         0
├─sde1             0      512B       2G         0

In this example, sda is a spinning HDD without TRIM, and sde is an SSD with TRIM.

Protopia · February 19, 2025, 4:18pm

More likely: Linux (lsblk) knows whether it is spinning or not so if it isn’t spinning then it has TRIM, and if it is spinning then it doesn’t.

However if you are dumping packet captures to this disk then that sounds like bulk writes to me, and for the reasons stated I suspect that bulk writes to a non-defraged ZFS pool is not going to be helped by TRIM.

gedavids · February 19, 2025, 4:21pm

Ohhhhhh, I don’t have non-zero results. Interesting.

george@truenas42:~$ sudo lsblk -D /dev/sdg
NAME   DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sdg           0        0B       0B         0
└─sdg1        0        0B       0B         0

smartctl says trim, lsblk says no.

HoneyBadger · February 19, 2025, 4:28pm

smartctl says what the vendor tells it to.

lsblk says what the drive will actually do.

IIRC, TRIM support exists on WD SMR drives to some extent - what it actually does as far as internal drive garbage-collection is a bit of a black-box, but those drives have their own SMR related issues (IDNF errors).

In your case, writing 1T (and never more than that) sequentially to a pair of mirrored 2T drives, is likely to align fairly well with the 256M SMR zone size if you’re only writing new and never doing a logical “overwrite in place” - this isn’t a blanket endorsement of SMR, but rather that you are significantly less likely to get bit by the issues inherent to the recording technology.

gedavids · February 19, 2025, 4:47pm

Well does it, or is it as @Protopia suggests (IF hdd THEN no trim)?

I don’t disagree and I’m not having a performance problem, even with snapshots. Well I had to add a SLOG because the SMR drives were not OK with the ZIL being written to them, but I had a 16GB Intel Optane drive so problem solved. Mostly I’m just trying to figure out why the vendor says this should be supported, but the OS is saying no.

Y’all aren’t wrong with trying to dissuade people from using these, I would not recommend this to anyone. This is very “do as I say, not as I do.”

gedavids · February 19, 2025, 5:09pm

Oh hey, I put the SMR drive in Fedora 41 machine and got a different result.

NAME   DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda           0        4K       2G         0

Protopia · February 19, 2025, 5:12pm

That is more indicative of you doing unnecessary synchronous writes which has a very detrimental impact on write speeds because the number of writes is orders of magnitude higher. This does NOT sound like the kind of data where you would be concerned about 5s-10s of data loss in the event of e.g. a power failure - and if not then you should set sync=never on the dataset you are writing this data to.

(IMO - so this is an entirely personal non-expert view, so @HoneyBadger might disagree - you should set the root dataset to sync=never and only set sync=always on datasets or zVolumes that definitely need it.)

Somewhat interesting - however as I said before, unless you can defrag the disk (and ZFS cannot) eventually all areas of the disk will have at least one used sector and then the drive will never be able to use the TRIM to avoid write amplification and achieve almost native speed. So I wouldn’t worry too much about whether TRIM is working or not.

Protopia · February 19, 2025, 5:21pm

I think this assumes ZFS storage allocation allocates storage contiguously and that the 256M zones don’t have any blocks in use and either: a) TRIM has told the drive that the zone is completely empty, or b) the CMR cache holds data for the entire zone in which case the drive can reasonably assume that the zone doesn’t need to be read first.

To put this in perspective, if the zones are 256MB and blocks are 4KB, there are 65,536 blocks in a zone, so you only need 1 block in 65,536 blocks or 0.0015% space utilisation to cause write amplification to happen.

So as you write the packet dumps and later erase them, you may well end up with metadata blocks spread around causing write amplification and slow-down.

So unless ZFS has block allocation strategies for metadata and files to avoid even the smallest free space fragmentation, I suspect that performance is going to be great only when the disk is new (or for a new partition created after a TRIM of the partition space) or by rarely because the zone happens to be completely unfragmented and the ZFS allocation has been contiguous.

gedavids · February 19, 2025, 6:59pm

I have sync set to standard and the VM is requesting the sync writes. Ignoring sync writes is a dangerous game. Yes I can afford to lose 5-10 secs of packet captures, but this is a VM with a dozen or so containers running within it. There’s a lot in play and if I lose the wrong bit of data I could be left with a corrupted filesystem on the guest. Setting the root ZFS to sync=never seems even more risky. I’m not sure if that turns off sync for ZFS metadata, but I don’t wanna risk it.

@HoneyBadger I’m curious what your thoughts are about the above. Since the trim functionality is available on other linux distros is this a bug I should report?

HoneyBadger · February 19, 2025, 7:07pm

It does indeed, so keeping sync=standard and honoring the VM’s requests for sync writes is the right thing to do.

Possibly. The TRIM command might be blocked at the libata driver level for a specific reason, similar to how other drives are listed in the QUIRKS table:

github.com/torvalds/linux

drivers/ata/libata-core.c

master


      
          
          	kfree(str);
          }
          
          struct ata_dev_quirks_entry {
          	const char *model_num;
          	const char *model_rev;
          	unsigned int quirks;
          };
          
          static const struct ata_dev_quirks_entry __ata_dev_quirks[] = {
          	/* Devices with DMA related problems under Linux */
          	{ "WDC AC11000H",	NULL,		ATA_QUIRK_NODMA },
          	{ "WDC AC22100H",	NULL,		ATA_QUIRK_NODMA },
          	{ "WDC AC32500H",	NULL,		ATA_QUIRK_NODMA },
          	{ "WDC AC33100H",	NULL,		ATA_QUIRK_NODMA },
          	{ "WDC AC31600H",	NULL,		ATA_QUIRK_NODMA },
          	{ "WDC AC32100H",	"24.09P07",	ATA_QUIRK_NODMA },
          	{ "WDC AC23200L",	"21.10N21",	ATA_QUIRK_NODMA },
          	{ "Compaq CRD-8241B",	NULL,		ATA_QUIRK_NODMA },
          	{ "CRD-8400B",		NULL,		ATA_QUIRK_NODMA },

gedavids · February 19, 2025, 7:41pm

Well I was able to execute a blkdiscard command against the whole drive while it was in the Fedora machine. It didn’t seem to complain there.

HoneyBadger · February 19, 2025, 8:43pm

Does a regular Debian Bookworm installation report the same as Fedora, or as TrueNAS?