1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Storage Is SSD reliability going to get better or worse in the future?

Discussion in 'Hardware' started by boiled_elephant, 2 Dec 2020.

  1. boiled_elephant

    boiled_elephant Merom Celeron 4 lyfe

    Joined:
    14 Jul 2004
    Posts:
    6,343
    Likes Received:
    672
    Kind of a big one I know, but anyone got a clue?

    The basic landscape, for those unfamiliar, is that they're making SSDs bigger and faster so quickly by stacking more bits into each cell in NAND chips by increasing the number of voltage states a cell can have, to increase storage density. From SLC (single layer) to MLC (multi, i.e. two layer), TLC (triple layer) and now QLC, quad layer - ambiguously called '3D NAND' and various other euphemisms, and sometimes denoted by a Q in the model nomenclature, or just not mentioned at all in product descriptions. Most vaguely say "MLC" which can mean 2, 3 or 4 bits per cell, and much digging is required to find out more precisely what architecture they're using.

    Which is annoying because for reasons I don't understand, this stacking of more and more bits in a cell has a negative impact on reliability: the most reliable SSDs are still the single layer ones, but they're very expensive (and possibly slower?).

    The question is, is this going to get better or worse? Are there mitigation techniques around the corner that might improve reliability again, or is the ongoing push for more storage density and speed going to mean that SSDs just get less reliable, or stay as unreliable as the current crop of QLC?
     
  2. Anfield

    Anfield Well-Known Member

    Joined:
    15 Jan 2010
    Posts:
    6,564
    Likes Received:
    790
    No big breakthrough is going to happen in endurance (at least in consumer drives).
    They'll work out some teething issues and there will be some marginal reliability improvements from that, then it'll be on to rinse repeat with 5 bits per cell NAND.

    In the enterprise market there is already a workaround:
    https://en.wikipedia.org/wiki/NVDIMM

    The use of non volatile DIMMs allows for a massive reduction in writes to the "proper" storage, Intel has their own thing (Optane Persistent Memory), but neither will be cost competitive in consumer applications for a very very long time.
    But even that is technically just a band aid (as it basically just throws a buffer of fancy pants durable (and expensive) NAND at the problem).

    As for speed, more bits per cell doesn't equal better or worse performance, reality is far more complex as in some ways it gets faster and in others slower and the real world impact varies depending on things like controller channel utilisation, DRAM caching, software etc.
     
    boiled_elephant likes this.
  3. RedFlames

    RedFlames ...is not a Belgian football team

    Joined:
    23 Apr 2009
    Posts:
    13,733
    Likes Received:
    2,162
    You're getting levels and layering confused.

    SLC, MLC etc, is Single- Multi- Triple- Quad-Level Cells, and refers to the number of bits stored per cell.

    V-NAND, 3D-NAND etc, refers to stacking layers of cells on a chip to improve density/increase capacity.


    As for speed/reliability, iirc - the fewer bits per cell, the faster and long-lived it is. It's also the most expensive as to hit a given capacity, you need more cells and more chips [even jamming the layers in].

    Plus most drives lie to you, or at least hide their deficiencies with controller trickery [and/or a DRAM cache]. Often, to the best of my knowledge/understanding, by running a portion of the NAND as SLC [or having a dedicated wadge of SLC, though iirc this is less common now] for the speed and then shunting data off to the slower xLC bits as/when needed. This is also why performance tends to tank as the drives get full.

    Then there's longevity-eking stuff like wear levelling and/or over-provisioning [the latter is partly why enterprise SSD are funny capacities like 1.92 TB as they tend to feature more over-provisioning].


    Basically, as far as I'm aware, any improvement will probably come from the controllers managing/hiding any deficiencies of the NAND, as those aren't going away.
     
    boiled_elephant likes this.
  4. RedFlames

    RedFlames ...is not a Belgian football team

    Joined:
    23 Apr 2009
    Posts:
    13,733
    Likes Received:
    2,162
    IIRC NV-DIMM/PMEM is as more about reducing latency than it extending the lifespan of the storage. You just load your entire DB [iirc it's mostly DBs that benefit from it atm] into the PMEM and shuffle it between there, any proper RAM, [as it's all on the memory bus] and the CPU without having to waste time going back and forth to the storage for anything.
     
  5. wyx087

    wyx087 Homeworld 3 is happening!!

    Joined:
    15 Aug 2007
    Posts:
    11,050
    Likes Received:
    350
    In related news, and speaking of enterprise drives, I see Greenliant are saying their "EnduroSLC" can endure more erase cycles
    https://www.techpowerup.com/275456/...age-solutions-enable-high-reliability-systems

    So there may be 2 directions. Consumer drives getting lower and lower endurance rating, "pro" drives in the middle as we are now, and enterprise drives become super expensive but less worry on drive endurance.
     
  6. Spraduke

    Spraduke Lurker

    Joined:
    23 Sep 2009
    Posts:
    575
    Likes Received:
    104
    At the moment I don't see endurance being a major issue for home users. We simply don't read/write enough data to be a significant issue and the pace of capacity growth means drives are replaced at a frequent enough rate. My impression is that the reliability of SSDs is so much better than HDDs that this far outweighs the read/write endurance argument.
     
  7. Paradigm Shifter

    Paradigm Shifter de nihilo nihil fit

    Joined:
    10 May 2006
    Posts:
    2,242
    Likes Received:
    70
    The primary issue for SSDs for consumers, at least as I see it, is more on sustained write speeds.

    SLC/MLC/TLC can sustain write speeds at levels which clearly exceed HDDs.

    QLC cannot. Sustained writes (read: writes larger than the SLC cache) absolutely axe-murder write performance, where I see ~90MB/s write on 2TB Samsung 860QVO drives. Sustained combined reads/writes will drop that to ~65MB/s, but that is not exactly a common scenario. Fast internet, and game installs exceeding 100GB means that exceeding the SLC cache in a sustained write is actually fairly feasible.*

    I avoid recommending QLC drives for this reason, although if they got cheap enough (eg: half the cost of equivalent capacity TLC drive) I might entertain using them for specific purposes where the limitations do not give such a severe impact.

    Quintuple Layer Cell is likely to be even slower. SLC cache is expensive, so I doubt consumer drives will have more than they currently do.

    Basically, if you're not writing hundreds of GB each day, an SSD will likely last beyond its warranty period.

    That said, when an SSD dies, it does it suddenly and without any warning whatsoever... at least in my experience. Normally when a HDD dies, I have time to make sure I have backups, and pull anything off if not. With an SSD? So far I've never had the chance. They just die, never to work again. I've not lost anything (backups and a bit of luck) yet, fortunately.

    * Actually, I've just thought of a scenario where a normal consumer might see sustained read/writes: decryption and installation of a Steam/(insert store of choice) game pre-load (if it has been encrypted).
     
    boiled_elephant likes this.
  8. Anfield

    Anfield Well-Known Member

    Joined:
    15 Jan 2010
    Posts:
    6,564
    Likes Received:
    790
    Regardless of intention: It still has the effect of reducing the need to access the storage, so it does extend the lifespan of NAND storage.
     
  9. wyx087

    wyx087 Homeworld 3 is happening!!

    Joined:
    15 Aug 2007
    Posts:
    11,050
    Likes Received:
    350
    How does SSD's die?

    I thought its provision would slowly reduce and the SSD should revert to read-only when over provision flash cells run out?
     
  10. Paradigm Shifter

    Paradigm Shifter de nihilo nihil fit

    Joined:
    10 May 2006
    Posts:
    2,242
    Likes Received:
    70
    If the drives "dies" by wearing out, yes, that is theoretically what happens.

    However, I've had three SSDs die on me. Two "died in the night" - turned off PC, came back the next morning to "No OS Found". One died while running the OS - that was fun, first I knew about it was when everything just started complaining it couldn't read or write to the filesystem.

    I tested them in other systems and via a USB/SATA adaptor, they're not even detected any more.

    One was an OCZ drive (I forget the model) two were Sandisk Ultras (which died different ways). I bought three of the Sandisk Ultras - the third has seen daily use for the last four years and is still going happily - it's my Steam games drive so spends a lot of time 80-90% full (not good I know) and gets quite a workout when installing something new.

    Oh, I had a Crucial drive that would randomly vanish from my X99 board as well; but on any other motherboard its perfectly well behaved so I guess it didn't like the SATA controller or something...
     
    wyx087 likes this.
  11. boiled_elephant

    boiled_elephant Merom Celeron 4 lyfe

    Joined:
    14 Jul 2004
    Posts:
    6,343
    Likes Received:
    672
    This mirrors my experiences. I'm feeling a bit shortchanged on the SSD failure situation. I've had about 20 or so drive deaths out of maybe 500 sold, which is a good failure rate but ALL of the failed drives were low activity, relatively new (less than 3 years in the field) and died extremely suddenly and totally. In a couple of cases I recognised the symptoms quickly enough to get an emergency clone done.

    I'm still not sure what causes this, they have spanned across PNY, Crucial, Drevo, Samsung, SanDisk and Plextor so I don't think lousy QC can take the blame. Something in SSDs is volatile and susceptible to sudden catastrophic failure in a way that is not accounted for by the "gradually deteriorating NAND cells" narrative.

    (In the cases where drive health could be checked, they usually showed no explicit warning signs in the health stats, with Disk info etc. still reporting 100% life remaining.)
     
    Paradigm Shifter and wyx087 like this.
  12. RedFlames

    RedFlames ...is not a Belgian football team

    Joined:
    23 Apr 2009
    Posts:
    13,733
    Likes Received:
    2,162
    IIRC that depends on the controller... most drives, once they hit their rated endurance limit they just keep plodding along til they crap out completely.
     
    Paradigm Shifter and wyx087 like this.
  13. RedFlames

    RedFlames ...is not a Belgian football team

    Joined:
    23 Apr 2009
    Posts:
    13,733
    Likes Received:
    2,162
    Relevant to the conversation -



     
    Last edited: 3 Dec 2020
  14. Paradigm Shifter

    Paradigm Shifter de nihilo nihil fit

    Joined:
    10 May 2006
    Posts:
    2,242
    Likes Received:
    70
    Yes, the lowest "life remaining" I had on an SSD that died was 99%... I checked it (by luck!) the day before its fateful death-in-the-night. Do I believe those "life" ratings? Not really.

    I've never had "symptoms" giving warning for an SSD. I've generally found that all computer hardware will survive until obsolescence if it survives the first 3-6 months. Well, and that magical period of about two weeks after the warranty runs out when washing machines always seem to break. ;)

    I want to blame the controller (as I think it's controller death that causes them to "just vanish") and I think the NAND chips might be OK if the controller could be replaced (as with HDDs sometimes). That said, controller death might take the NAND with it, or each controller might have a different encryption key. But I don't have the skill to even try something like that.

    Makes sense I guess; the modern wear levelling algorithms do help.
     
    boiled_elephant likes this.
  15. RedFlames

    RedFlames ...is not a Belgian football team

    Joined:
    23 Apr 2009
    Posts:
    13,733
    Likes Received:
    2,162
    The pretending to be a hdd, albeit a fast one nature of SATA SSDs probably doesn't help. SMART isn't the most reliable method of catching or conveying imminent drive death.

    IIRC NVMe can do more on the error reporting front. Instead of hoping something/someone is paying attention to SMART data iirc the drives can [in theory at least] actively tell the OS 'yo, something isn't right...'
     
  16. Paradigm Shifter

    Paradigm Shifter de nihilo nihil fit

    Joined:
    10 May 2006
    Posts:
    2,242
    Likes Received:
    70
    Really? In my (admittedly limited) experience with NVMe drives, they report even less than SATA ones do. At least in a user-accessible way.

    And I think the point that both I and boiled_elephant were making was that SSDs don't usually give you any indication of trouble. They just die without any warning at all.
     
  17. RedFlames

    RedFlames ...is not a Belgian football team

    Joined:
    23 Apr 2009
    Posts:
    13,733
    Likes Received:
    2,162
    They can, in that the NVMe spec allows for it... whether they do or not is again, usually up to whoever designed the controller, and whether or not the OS knows what to do with the information [IIRC windows doesn't yet, but the feature is in the previews].

    There also could be the assumption amongst SSD mfrs that consumers don't need to know or don't care that their drive is about to die, as most of the drives won't be around/in use long enough for it be an issue.
     
    Last edited: 4 Dec 2020
  18. boiled_elephant

    boiled_elephant Merom Celeron 4 lyfe

    Joined:
    14 Jul 2004
    Posts:
    6,343
    Likes Received:
    672
    All very interesting. I suppose the follow-up would be to reframe my question: is the phenomenon of sudden no-warning SSD deaths likely to get better or worse?

    To answer that we need to know what causes controllers to fail, and what measures (if any) are already in the specs to guard against it. This might also inform whether I tell customers to bother with data recovery companies for it or not. I have just recently sent one sudden-death-SSD victim to a data recovery company, if he comes back with anything I'll update here.
     
  19. RedFlames

    RedFlames ...is not a Belgian football team

    Joined:
    23 Apr 2009
    Posts:
    13,733
    Likes Received:
    2,162
    In the consumer space, probably unchanged. If any progress is made you're more likely to see it in enterprise drives first. Especially with stuff like NVMe over fabric, as not only will you need to know a drive has died, but you'll probably also need some kinda of clue as to where the damn thing is.
     

Share This Page