I have a very basic question. I recently deleted a large zvol around 16TB. I deleted its snapshots. So that zfs list doesn’t show this zvol. When I type zpool get freeing I see that there is a 15.4TB of data to be freed. On the other hand, The should-be-freed space doesn’t get smaller for days.
I tried to update to 25.04. I tried to export and import strategy. I tried several scrubs. I think I tried everything I can find online. Is there a way to force free space?
I don’t mind if I lose several bytes of data if it is related to some corrupted data. Disks are in raid10 mode and their smart status looks normal.
I uploaded the outputs of those commands. I can scrub it again. I have a problem with the errors saying list of errors unavaliable because I deleted those files and their snapshots, I did export/import, tried many things. The NAS system works perfectly, I just want the free space back
Those ticks are related to ‘Preformatted text’ ticks. This is my first post here. I did a mistake.
I started adding vdevs like this, I’ve been using Freenas/Truenas for a long time, back in the day I added vdevs as mirrored to the list. The first 2 vdev were recently ‘changed’, meaning I removed one of them replaced with a higher capacity one, resilvered, then did the same thing with the other low capacity one. I thought the replaced HDDs will be named like ‘sda’ or ‘sdb’. I was wrong but I don’t mind the current state. The data corruption problem existed before HDD upgrade. I deleted those files.
I’m willing to do another scrub but everytime I do it something weird happens. It scrubs the populated space (obviously). On the other hand when it gets to the last 15.4TB part, it finishes suddenly. Disk may show 2 or 4 checksum warnings. I tried clearing before the scrub I can do after the scrub but the warnings come back.
Errors like this with a pool refusing to free space might mean metadata corruption. Did you have any hexcodes in your “list of errors” in a 0x0c7a style format?
Yes exactly, the scrub sometimes clear those errors, I think it means a reference to a file that no longer exists. Those errors can occur after deleting a file, I know that.
I have some routine long test jobs, the report says they are fine but I know what you are saying, the drives exposed to high heat, in a server room, I especially witnessed after copying very large amounts of data, I/O errors can occur, I’m well aware of the fact that I need a smarter cooling system. On the other hand, there is no problem with the current state, so I’m trying to avoid copying all the data to somewhere else and then recopying it to a new pool
I worked with the AI and it figured out it was the zfs_free_leak_on_eio variable to tweak. So I switched it to 1 and now it started freeing. After that I’ll do a scrub. Do you have any other recommendations after the scrub?
Do you have full system specs here? I’d like to ensure we’re not dealing with failed or failing hardware - especially if we’re looking at either drive or storage controller/HBA faults. I/O errors under load are not normal or expected.
Can you show a zpool get leaked please, to make sure it didn’t just toss that space off into the ether?
Sure my specs are like this:
Motherboard: Asus X99 Rampage V extreme; it has a lot of sata ports
CPU: Intel 5960X, first 8-core CPU I believe,
RAM, 8 x 16GB= 128GB ddr4 ram
Disks: The mirrored disks are specifically chosen to be different from each other in terms of brand or line-up but not in size. This is due to prevent any dual failure at the same time.
Cache and Logs: I have 4 cache nvme drives and 1 nvme log drive.
I checked the leaked value like you said and it says it is 125MB so it is not a very small number I believe.
Since there are many drives in the system, at first I thought the I/O faults could be due to shared sata power cables so I used more more sata power lanes, then I thought it could be due to bad sata cables, improper connection issues, so I changed some sata cables. I added some small fans onto the RAM sticks to cool them off more (I thought maybe it is due to that).
In short, I believe either on-board sata controllers or disks are causing this issue due to high volume traffic, or heat due to that. I’m not particularly concerned about some corrupted files, ofcourse I regularly backup the most important data to an offline drive. I just want to get rid of error messages when this sort of thing happens. Then I can try to add more fans to somewhere, or try new things to see if it occurs again.