Hard Drive Troubleshooting Flowchart

What is this:
This is a very simple to follow (well for me) set of flowcharts to diagnose drive problems/failures. There is a small bit of ZFS included as many people jump on a drive failure prematurely when it is actually a ZFS issue, so a very small part of ZFS is included, this is not a ZFS troubleshooting guide.

What is included you ask?

  1. ZFS ERRORS
  2. CRITICAL DRIVE ERRORS
  3. NON-CRITICAL DRIVE ERRORS
  4. SUSPECT FOUL PLAY (ALTERED DRIVE DATA) - The Seagate Drive Issue Saga
  5. Specific commands used from the CLI/SSH to support these flowcharts so there “should be” no guesswork required. The nvme, nvmecontrol, and zpool commands can be destructive if you decide to do your own thing or listen to CHAT-GPT or other AI. If you follow the instructions, they are perfectly safe.
  6. Examples of SMART and FARM outputs.

If you have problems or find an issue (something wrong) with the flowchart, please reach out to me and let me know the issue so I can fix it.

(I unfortunately cannot upload a PDF or convert to a proper format, so it will be a link to my Github site)
Drive_Troubleshooting_Flowchart_v3.pdf

Good Troubleshooting !
-Joe

UPDATED 12 MARCH 2025 with Version 2 of the Flowcharts.
UPDATED 13 MARCH 2025 with Version 3 of the Flowcharts - Inputs by @Alexey

3 Likes

I have a few ideas

– “Critical Drive Errors sheet 1 of 1”

  1. There is no distinction between value and raw.
    For “Reallocated Sector Count”, low value is bad; but high raw is bad. The flowchart says “ID 5 < 5”, and this should be amended to “ID 5 raw < 5”. Yes, I now know it is explained in the Appendix A, but still. I have seen too many cases of confusion, so I recommend it be spelled out explicitly every time.

  2. “Spin Retry Count”. Either a drive fault, or the PSU fault. Assuming PSU fault is somehow eliminated, raw of 1 is “replace drive soonest practical” and raw>1 is “replace immediately”.
    Well, at least in my experience, nothing good comes of these. Also, spin retries on multiple drives indicate likely PSU fault. (is it in scope of the flowchart though?)

– “Non-Critical Drive Errors sheet 1 of 2”

  1. “Hardware ECC Recovered” on Seagate HDDs is kind of weird and probably should be ignored unless value falls below thresh.

– “Non-Critical Drive Errors sheet 2 of 2”

  1. “Current Pending Sector Errors” advice seems strange? The reference to Seagate makes me expect it will be about “Hardware ECC Recovered”, but is it?

  2. Anyway, whatever it is, you should add that one should be looking at raw, not value. The same applies to UDMA CRC Errors.

– “Appendix A - SMART”

  1. Should it go before the drive troubleshooting? Should it at least be mentioned on the troubleshooting pages (“on how to read SMART data, refer to Appendix A”)?

  2. Generally it maybe worthwhile to mention the convention that if the description ends in “Count”, then raw typically shows actual count for the corresponding condition. Then again, is it relevant?

1 Like

@Alexey
Very good comments and I appreciate them. I have incorporated most of your comments. As for the purpose of this set of flowcharts, I would prefer to only address drive issues. I do not want to expand on troubleshooting an HBA for example, or the PSU. I have no issues at all at informing a person that they should be looking in one of those directions, I just don’t want to tell them step by step how to do it. Then it makes this set of flowcharts a beast to maintain. I’m looking to solve a small issue where people confuse ZFS and Drive Errors. I’d like to at least tell them if a drive should be replaced but not until we can sort of prove the drive is at fault.

Incorporated

Incorporated

Incorporated

Yes, it is the same thing. Seagate has a nice little white paper on this but they do not explain every detail unfortunately, but they lead you to the reality that it is calculated like (could be exactly but I don’t know that for certain) the Error Rates are calculated. In fact, these are the only three values using logarithmic calculations. The reason I added the explanation for the Error Rates is because I am asked this question more than you know. It is added information so the end user understands what is going on.

I do not concur.

Pending Sector Count is not a logarithmic calculation. It is a count and it can return to zero, unlike the UDMA CRC Errors. It is not technically a prefailure warning however we do use this as a gauge with additional data to determine if the drive is suspect.

But, your comments had me change Non-Critical Drive Errors Sheet 2 for the Current Pending Sector Errors blocks to add in a SMART Long test. >5 is not really a failure on it’s own and >5 is subjective, I always recommend a Long test.

UDMA_CRC_Errors (also called SATA Receive Error Count) are not logarithmic, they are factual increasing numbers for every error transmission they recieve. It never reduces in value. I have never seen a “VALUE” in this field ever really drop as in most cases the failure is not the drive itself, it is a SATA cable or the HBA, but typically the SATA cable. However I did update the final bubble to state that the value will never decrease and it lives with the drive for life.

Appendix’s always follow the documentation that references them. Just something I learned as a Technical Writer. But I do point out what Appendix A and B are on the first page of the document.

Not sure it is needed and I’m not 100% positive this would be true in every situation.

Anyway, I do appreciate the comments and the updated flowcharts will be posted shortly.

1 Like

Yeah, I agree with that; but you can call it a Prependix to circumvent the limitation. Generally I would suggest repeating the reference to the appendix on every flowchart page which mentions SMART, but if you don’t like it, never mind me then.

Everything else I mostly agree with, maybe for a want of a partially-dead more-or-less modern Seagate drive. The once I have are either 160 GB, or fully dead, so no experiment I can do.

In any case, I am happy it was useful.

I sincerely appreciate the comments. While we may not agree on everything, that is okay. However if you can provide some documentation to backup a recommendation that I may not agree with, please provide it. First of all, I enjoy learning, even if I am wrong, Second I by far am not all knowing, if I were then I’d be filthy rich :slight_smile:

You can also add comments to this resource page to point thing out comments/concerns to other people. It is a good use of the space.

Take care.

Since newer scams are also faking the FARM numbers, maybe you should add that just because SMART == FARM does not mean it is a new drive.

That is a fair point and also because other drives do not support FARM that are apparently affected, I can add that in somehow.

I plan to also update my Multi-Report script to check for more than just the power on hours. I plan to check for LBAs read & written, and any other common data that should match or be within a reasonable tolerance.

I will be looking at WDDA as well however from what I have read so far, it is not the same as FARM. But if I can examine WDDA, then I will add that to the flowchart as well, if it actually has validity.

1 Like