NVME Performance issue?


root@srv-nas[~]# zpool status Rust      
  pool: Rust
 state: ONLINE
  scan: scrub repaired 0B in 3 days 06:56:57 with 0 errors on Fri Apr 26 00:20:36 2024

	NAME                                            STATE     READ WRITE CKSUM
	Rust                                            ONLINE       0     0     0
	  raidz1-0                                      ONLINE       0     0     0
	    gptid/9dabf660-7f2f-11ee-ae12-839c959460e3  ONLINE       0     0     0
	    gptid/9dc10c14-7f2f-11ee-ae12-839c959460e3  ONLINE       0     0     0
	    gptid/9ddb8504-7f2f-11ee-ae12-839c959460e3  ONLINE       0     0     0
	  raidz1-3                                      ONLINE       0     0     0
	    gptid/f057a601-7f47-11ee-ae12-839c959460e3  ONLINE       0     0     0
	    gptid/f06d221a-7f47-11ee-ae12-839c959460e3  ONLINE       0     0     0
	    gptid/b2211429-c985-11ee-b2ed-ad885b299040  ONLINE       0     0     0
	    gptid/f1d6c78b-7f47-11ee-ae12-839c959460e3  ONLINE       0     0     0
	  raidz1-4                                      ONLINE       0     0     0
	    da18p2                                      ONLINE       0     0     0
	    da20p2                                      ONLINE       0     0     0
	    da33p2                                      ONLINE       0     0     0
	    da37p2                                      ONLINE       0     0     0
	    da32p2                                      ONLINE       0     0     0
	  raidz1-5                                      ONLINE       0     0     0
	    gptid/17788027-83a0-11ee-a730-e7170bfbda7b  ONLINE       0     0     0
	    gptid/1773a2c6-83a0-11ee-a730-e7170bfbda7b  ONLINE       0     0     0
	    gptid/17534344-83a0-11ee-a730-e7170bfbda7b  ONLINE       0     0     0
	    gptid/e3cf0815-c985-11ee-b2ed-ad885b299040  ONLINE       0     0     0
	    gptid/177dae0e-83a0-11ee-a730-e7170bfbda7b  ONLINE       0     0     0
	    gptid/177b6029-83a0-11ee-a730-e7170bfbda7b  ONLINE       0     0     0
	  raidz1-6                                      ONLINE       0     0     0
	    gptid/b52f8ebc-981c-11ee-bc8b-f330015eb421  ONLINE       0     0     0
	    gptid/1e3785fd-c986-11ee-b2ed-ad885b299040  ONLINE       0     0     0
	    gptid/94d59324-ed42-11ee-bb72-7b614cbb9075  ONLINE       0     0     0
	  mirror-2                                      ONLINE       0     0     0
	    gptid/9d0eb61f-7f2f-11ee-ae12-839c959460e3  ONLINE       0     0     0
	    gptid/9d10dbed-7f2f-11ee-ae12-839c959460e3  ONLINE       0     0     0
	  gptid/9d330358-7f2f-11ee-ae12-839c959460e3    ONLINE       0     0     0

errors: No known data errors

I’ve been getting a warning about an nvme drive in my pool causing slow IO:

Device /dev/gptid/9d0eb61f-7f2f-11ee-ae12-839c959460e3 is causing slow I/O on pool Rust.
2024-04-29 09:59:02 (Europe/Brussels)

Looking into this, I see this is my NVME:

| Device |	 DISK DESCRIPTION	 |  SERIAL  NUMBER  |				   GPTID					|
| nvd0   | WD Red SN700 2000GB      | 23202J800156     | gptid/9d0eb61f-7f2f-11ee-ae12-839c959460e3 |

This is a special device in a mirror with 2 of these NVME’s.
Is there a way for me to troubleshoot this? find out what is causing this?

Couple of things come to mind.

For one, we do not know what the motherboard is nor how the drives are connected. I presume this is a PCIe device based on the nvd0 monicker which I see with my Optane stick in a PCIe 3.0x4 slot.

If it is in a similar slot I would look into two things:

  1. Is this on one of those weird Atom boards where the PCIe slots are shared with SATA controllers / drives / whatever? (I hope not)
  2. Is your Motherboard perhaps configured at the BIOS level to conserve energy? IIRC, some of those settings caused other users all sorts of headaches, as reported on the old forum.

What’s the other drive in the mirror? Another SN700 or a faster drive?
If this a “nvme pool”, or the spinning rust the name implies?

No matter what, with raidz1 vdevs of different widths and a mere 2-way mirror as special vdev, you’re living on the edge.

@itractus Your pool appears to be made of many different RAIDZ widths (3/4/5/6/3) as well as some devices using non-GPTIDs. Was this created through the TrueNAS GUI, the CLI, or imported from a non-TrueNAS solution?

Regarding the slow performance, my first guess would be around cooling, and my second around the lane configuration. What motherboard are you using?

Was this pool built manually for like testing/learning reasons? I can kind of see what you might have been doing. Did you do some of this in the UI and some of this in the CLI?

Aside from that, I do suspect the pool topology is severely limiting you.

Please post the output from terminal zpool iostat -vvylq 30 while hitting the NAS with load.

Have to say I haven’t seen a setup like this before. It’s quite diverse, RAIDZ1 with 3, 4, 5, 6 vdev sizes. You have it all!!!

Alright, So about the hardware.

First of all, it’s a VM on my HP DL380 gen9. I have assigned it 6 cores, 258GB of RAM.
The NVME’s are in a carrier card, with a bifurcated pcie 3.0 8x slot and hardware assigned to the VM. As is the raid card, wich is an HP H220 or 221, I don’t remember entirely.
This Raid card is connected to a NETAPP DS4246, filled with some 2TB, some 4TB, a 20TB and 2 18TB drives.
That’s why I have such a weird pool… I want to get some more 18TB and 20TB drives, and make a more sensible pool… But since I just bought a house, my storage will have to wait for now…

The only path to sanity is building a new pool and destroying the current monstruosity.

As for the “Raid card”, let’s hope you really mean “HBA” and properly passed through to the VM.

HBA, indeed…

Well… There is currently no time frame for me to get same sized disks. So this is what we’re working with…

So you confirmed that the BIOS is not turning off power to the PCIe bus occasionally? I’d go in, save the current config, turn off every energy savings feature in the bios, then see if that makes a difference.

As another user noted, cooling may also be an issue, we do not know what temperature that red WD SSD is operating at. At least it’s not a SMR drive given the OEM that manufactured it. :slight_smile:

Cooling would not be an issue in a DL360 set to static high performance… Good thing i have a basement, or the wife would not be happy with the noise :smiley:
Though I might have to dig deeper into the BIOS to see if there are any remaining power saving features.

There is the reporting pane for disk drive temperatures … trust your fans but verify.

I’ve designed my NAS’ around excellent cooling capability and the reporting pane was invaluable re: figuring out what worked and what didn’t.