Reporting only shows 7 days worth of data

We indeed did, at common enterprise customer request. The netdata backend gave customers per-second reporting for that one week period (at a much higher cost of data on disk). The old system was closer to 5 (10?) min granularity IIRC, pretty worthless for troubleshooting “real-world” issues. I think most folks didn’t even realize that when they looked at the graphs before :slight_smile:

You must not understand how this all actually works if you honestly thing swapping back would be trivial. :stuck_out_tongue: Would be far harder than just adjusting the new system to allow longer retention at reduced fidelity (like before). We are looking at that now and I’ll be happy to update the group once we have something tangible to discuss. Considering the number of users running Cobia and how long before somebody even bothered to notice its no wonder this didn’t get any attention until now. Anyways, be patient, we’re looking at it.

1 Like

Let me retort simply that the CLI also replied that the only option was to destroy the pool. All I am asking for is a simple list of disks that were found, disks that are missing, and disks that are corrupted / unavailable.

In my limited experience, if a pool import fails due to a sVDEV failure, NONE of that information is available, whether by CLI or GUI. zpool -F poolname returned zilch, nada, nothing

I consider that a glaring issue. If I had not intuited that I had a electrical problem on my hands, my only other option would have been to destroy the pool and start over. I don’t consider giving admins known-BAD advice a good default behavior for a problem that appears to be readily replicable.

But management here has other priorities and I respect that, even if I don’t understand why promoting unnecessary data loss for a file system whose entire philosophy seems to be built around preventing data loss is a good idea.

We need to be clear here that the TrueNAS GUI is a layer on top of standard Debian / ZFS functionality. If Debian and / or ZFS cannot tell you from the CLI what is wrong, then neither can TrueNAS.

FreeBSD in the case of CORE but tomatoe tomatoe.

I don’t think letting iXsystems this easily off the hook is wise though. They contribute to ZFS. Thus, if a sVDEV with all the metadata is missing and the pool import fails, it is possible, and likely desireable, for iXsystems to prevent unnecessary data loss by not having the OS / ZFS / TrueNAS give demonstrably bad advice.

For example, also playing devil’s advocate, how hard would it be to change the default ZFS error message of “your pool is dead, destroy and rebuild from backup” to “Here are the drives I expected to find, here are the ones that are missing. Only if these drives are unrecoverable destroy and rebuild the pool.”

Now you’re giving the admin something to work rather than “I got nuttin”. Whether one is a advanced admin or a newbie, more relevant diagnostic data trumps NO diagnostic data when it comes to troubleshooting a failed pool import. Ditto benefitting tech support at iXsystems if a paying enterprise customer comes calling.

1 Like

Sorry, but that last comment probably says more about your lack of understanding of the timeline of this than the issue itself or the users.

We install Cobia, and lose all reporting history. Then a couple of weeks later we report zero data - and find that a second reboot is needed to make it work. Another couple of weeks later someone reports missing data. We get a fix in a later minor version - so we upgrade again and due to another bug the reporting history is lost again. So that gets reported and we get another fix. And by this time a couple of months have passed without any opportunity to spot that there is only ever a week’s data.

Small users like us do not have big IT departments and sophisticated monitoring technologies which continually look at the reports and spot something wrong. We do an upgrade, we check things immediately afterwards to see if they are working - perhaps we come back a few days later to double check. After that we leave it until something makes us go back and look at things.

So give all of the above, I don’t think it is surprising at all that this wasn’t spotted until now.

Now to the first comment. As I have said several times in forums, it is your Enterprise customers who largely pay the bills, and it is understandable that you give their needs the greatest weight.

Free users like me (who are enthusiasts and supporters and envangelists for TrueNAS), contribute nothing towards paying the bills, but we do contribute in other ways. I am prepared to bet that most Beta testers are from this group, and that a large proportion of bug reports are from this group. We do contribute in many ways, just not financially. Our own needs are clearly not as important as Enterprise users, but IMO our contributions are sufficient that our needs should be considered in your product plans alongside Enterprise users.

But then there is the other group of paying users - the consumers and small businesses who buy your hardware but don’t buy software support. They are paying customers too. And their needs are broadly the same as the free users because we are all small users - but as paying customers, their needs should definitely be explicitly considered.

From what you have said is that (presumably some but not all) Enterprise users said they wanted 7 days of single second granularity, and you immediately went with that regardless of the impact on users who don’t want or need this.

Do you seriously believe (with hindsight) that this was the right way to approach this?

Actually, I had zero illusions that this was that easy. I just wanted to put forward an alternative that sounded easy to give you a prompt to give the solution you want to deliver higher priority.

See above for my comment on why it wasn’t noticed until now. There are pretty good reasons (i.e. several other bugs and upgrades) why this wasn’t spotted until now, and trivialising the impact and blaming users for it taking this long to be highlighted rather than accepting responsibility that a complete lack of communication is the genuine cause is not an appropriate response IMO.

1 Like

I would say that @Protopia point is not wrong, but maybe heightened by our increased sensitivity due to the recent events. Which I wrote about in one of my many posts in the forum switch announcement on the XenForo.

1 Like

I’m not opposed to investigating this. Do we have a ticket opened up somewhere that we can have an Engineer take a look at? This probably should be a different thread as well.

2 Likes

I think folks know I’m a pretty patient guy, but this really is an exercise in futility. I’ve stated several times now how the reporting is being actively investigated right now to see what we can do to give some additional flexibility in local long term retention. We’ll update the aforementioned tickets as soon as we have something ready to review that alters the default behaviors. As for the earlier bugs in reporting, sure I wish we caught those earlier as well. We review the escapes and look at which tests need implementing to catch something like that next time. That has been and is being done.

In the meantime, the pages of back and forth has contributed exactly zero new lines of code towards fixing the underlying issue. So I’ll sit back and let the Eng team do their thing and we’ll get the actual problems addressed addressed soon. Keep an eye on the ticket for updates.

3 Likes

Hi kris,
I’ve raised it on the old forum and in Jira multiple years ago, and while @MorganL said you guys would look into it, I never got any follow-up.

I recently had a similar issue with a bad PSU and my previous experience thankfully pointed me in the right direction, that is to ignore the advice from ZFS / TrueNAS re: destroying my pool and investigating potential electrical / cabling issues first instead. The command line offered ZERO helpful advice when the sVDEV went belly-up due to a cabling issue related to a PSU swap.

Anyhow, here is my ask: I presume that the drive UUIDs / SNs are stored as part of the pool content. If the pool has difficulty importing, it would be AWESOME if the good and the bad / missing drives are listed automatically. I don’t think this is a huge ask and I also contend that it would be useful to you, your customers, and to the rest of us. To my knowledge, this Jira request is still with the Triage Team, suggesting it will never get acted on since it’s been in that status since 12/2021.

1 Like

Ahh, I see that now. its a suggestion ticket and those only get reviewed when they get enough votes to put them on our radar. I got it open now though and will give it a review since you asked :slight_smile:

I’d bet if the info is in ZFS already, perhaps this can be a middleware crafted alert that brings attention to which drive, what role, etc. I’m sure there’s something that could be done better here.

2 Likes

Well - there were a lot of posts from Davvo on the old forum, so I am not quite sure which one he was referring to, but he obviously feels strongly about the switch from XenForo to Discourse.

I don’t share that feeling however - a forum is a forum as far as I am concerned, and things like font colouring is pretty minor IMO. Yes, there is always some hassle and learning curve associated with switching tools, and this can be annoying, but in the big scheme of things it is both relatively unimportant and a normal part of life’s changes. But that is a personal opinion, which has neither more nor less validity that @Davvo’s.

For me, what is important about a Forum is that it is a communications channel whereby users can share opinions and give valuable feedback to the company - but that only works if the company is willing to listen properly and admit when they have made a mistake.

I can understand why @kris is getting tired of me harping on, but the reality is that I am harping on only because ixSystems will not admit that it could have done things better and thus learn from them. So I have no idea whether in private they recognise this and have pledged to do better in the future and are just not willing to say so publicly, or whether they are in denial and are thus likely to repeat the same mistakes again in the future.

Hopefully if I point it out bluntly, @kris will get to realise that this apparent complete disregard for the needs of small users when Enterprise customers want something and of keeping this change secret creates some trust issues, and has the potential of disenchanting or even driving away the small users that (as I pointed out before) do deliver a lot of benefits to ixSystems. All I am seeking is some reassurance that ixSystems understands that they should have developed the NetData functionality differently and communicated more openly, in order that I can be confident that this type of situation will not happen again in the future and that my choice of TrueNAS won’t turn out to have been a poor one. And giving this reassurance is not difficult - it only needs a few words of apology spoken with sincerity.

It’s not the same, but the recent Redis license change fiasco is an example of just how quickly things can go downhill when an open source developer doesn’t listen. Again it’s not the same but Freenode was one of the largest IRC networks until they did something that made them untrustworthy and their user base shrunk to zero practically overnight. And some of us still remember Open Office, such memories being all that is left of it after all the open source enthusiasts left. My own personal bete noir is Fabrik, an open source extension of the open source CMS Joomla, which pretty much died when it was bought out. And I am sure that some people here will know much more than me about some of the open source dramas in the Unix community. Personally I just hope, as I think all open source enthusiasts do, that the travel of the free software they select to use is a smooth one rather than a bumpy one.

About the impact iX’s actions might have in how the community is perceiving the recent changes. It’s one of the several posts I made in the Announcement’s thread.

Well. Personally, I upgrade from Core to Cobia, a couple of weeks ago. It zapped my logs. Fine.

But I’m only noticing now for obvious reasons :wink:

I do think that improving pool diagnosis is a fine thing, but does it really belong in this thread?

1 Like

Wonder how I missed this thread :-/

Just put the blame on the new UI!

1 Like

Just a Postscript to my previous comment about this creating a loss of trust…

Some people might think Reporting is unimportant - and I would certainly agree that it is less important to small users than (say) our ability to run Kubernetes apps or VMs.

Enterprise customers tend to dedicate servers to a single task - they will have a NAS server solely dedicated to holding files, they will have another server for a database server. Or maybe they will virtualise, but if so they will run dedicated server instances and run TrueNAS as a virtual machine and the database server as a virtual machine.

So it is IMO NOT beyond the boundaries of reality to imagine a situation where Enterprise customers say that the Kubernetes and VM functionality represents additional complexity that reduces reliability (even if not used) and increases the security surface area and they would prefer to see it removed.

So the question becomes whether we can trust ixSystems not to automatically do as Enterprise customers ask and remove other functionality that we smaller users rely upon?

And THIS is why we need to understand ixSystems perspective on this.

As supporting evidence here is a quote from the TrueCharts Blog:

While there is apprehension about iX-Systems’ stance on Kubernetes, our thorough analysis leads us to believe that a complete removal of Kubernetes-based SCALE Apps from TrueNAS SCALE is unlikely to occur within 2024.

Could a statement from TrueCharts be more uncertain? Their comment is based on analysis rather than a confirmation of continued support from ixSystems, it is only a belief and not certainty, and it is limited only to 2024!!

From a personal perspective I have no evidence to suggest that removal of Kubernetes support is planned or imminent, but equally I have not been able to find a statement that it will continue to be supported.

But the question I find myself asking myself is this: If I was a new potential user considering which NAS software to choose, would this secretive removal of functionality cause me to distrust TrueNAS and go with Open Media Vault or UnRaid instead?

But not in this instance - in this instance, as @kris has explicitly stated, ixSystems replaced an existing feature (reporting) which previously held 6 months data with one that only holds 7 days of data and (at least as far as I am aware) did not deprecate it, or announce it or anything.

We found out only by noticing that data was truncated to 7 days, raising a bug ticket and being told that this was now functioning by design. And then when it started being discussed here, it turns out that some Enterprise users wanted it this way, and ixSystems ignored all the smaller users who might have different requirements and just reduced the reporting period from 6 months (or maybe 60 months) to just 7 days, and didn’t put it in the change log.

So whilst you might like to believe that ixSystems doesn’t operate this way, in this instance according to Kris, this is exactly how ixSystems did things.

And then when ixSystems refuses to acknowledge that this was a mistake and shouldn’t have happened (but instead blames users for not noticing it earlier), we are left wondering whether this is what might happen to other functionality that Enterprise customers don’t want such as Apps (because regardless of previous “too slow” processes, apparently policies have changed).

So, if someone senior could please reassure us that this change from 6 months to 7 days (without any feedback requests or notice or documentation) was absolutely a blip and that it won’t happen again, then I think we can stop being worried.

@MorganL

Sorry to be blunt about this, but let’s talk facts rather than mis-information shall we…

  1. The NAS-122665 ticket is only accessible by ixSystems personnel and is NOT a public document - and even if it was a public document, it doesn’t constitute a communication as part of a change log because you have to go and hunt it out.

  2. “Actually the information is NOT in the release notes for Cobia”!!! Here are the entire statements about netdata in the Cobia release notes quoted verbatim - not a single mention of a reduction in reporting days:

“System reporting has been overhauled and now uses Netdata as the backend to provide system statistics to the Reporting screens.” (Stated twice)

“As part of the netdata implementation and overhaul of the reporting features, Graphite support is no longer built-in with TrueNAS SCALE 23.10 (Cobia) NAS-123862.”

So I stand by my original statement that this was NOT communicated AT ALL.

And it genuinely appears that ixSystems is in denial about this and is not actually listening here - and it is this that I find the most worrying aspect and IMO it is really not a good omen at all.

I have literally just realised that there is an official ixSystems Scale Chart for a Netdata app. I haven’t used this, so can anyone give me any insight into how this relates to the change to using Netdata in the base Cobia release?

The Netdata app was also available in Bluefin according to the 22.12 documentation so this app predates the Cobia implementation and would therefore presumably have run alongside and in parallel with the Bluefin reporting.

I can understand that availability of this App would NOT affect the decision that Netdata was the way to go for base reporting (which still seems to me to have been a good long-term decision), but it does seem to me that for Cobia:

  1. The Netdata app might benefit from being updated to work with the Cobia implementation rather than in parallel - I have no idea whether this has been done, but I suspect that the Netdata chart is based on someone else’s docker / helm image and so unlikely; and

  2. The Netdata app could have been - and still could be - the way to provide users with the ability to adjust how data is recorded and tiered and summarised etc. And if so, then Cobia could have shipped with a default Netdata configuration that did tiering and summarising in order to provide 6 months data logging within the same disk space usage as Bluefin.

  3. Would installing this be a short-term fix for those of us worried about the lack of trend reporting pending a proper solution from ixSystems? Also, would implementation of this also give better retention of reporting stats over version upgrades? And can anyone comment on the memory / CPU impact of installing this? Edit: There are several posts on the Netdata community about Netdata memory usage being at least 500MB. So I think we would need to establish a common starting configuration that was not going to use either too much memory or too much disk space. And we would need to decide whether running two instances in parallel would be a good idea or not.

As I say, I am not knowledgeable about the details of the Netdata app, so if anyone can answer these questions I think that would be helpful.