1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Build Advice SSD without HDD

Discussion in 'Hardware' started by esteban915, 9 Jul 2012.

  1. esteban915

    esteban915 What's a Dremel?

    Joined:
    18 Oct 2009
    Posts:
    34
    Likes Received:
    0
    Hi, this may well be a really daft question but,.....

    In a new PC, is it ok just to use a SSD, and have no HDD?

    I'm just thinking I could go with something like this http://www.overclockers.co.uk/showproduct.php?prodid=HD-077-OC , a 240GB SSD, to begin with to save on cost and then add a HDD at a later date.


    Are there any issues this would cause?

    Thanks in advance
     
  2. murraynt

    murraynt Modder

    Joined:
    6 Jun 2009
    Posts:
    4,234
    Likes Received:
    128
    esteban915 and PocketDemon like this.
  3. Harlequin

    Harlequin Modder

    Joined:
    4 Jun 2004
    Posts:
    7,131
    Likes Received:
    194
    ssd as main drive and hdd as storage
     
    esteban915 likes this.
  4. PocketDemon

    PocketDemon Modder

    Joined:
    3 Jul 2010
    Posts:
    2,107
    Likes Received:
    139
    lack of any backup - if your SSD were to fail (as x %age of any device will fail early on & you could be unlucky) or there were any issues such as malware, accidental deletion, etc, you'd potentially lose everything.


    Oh, & naturally +1 on the Samsung 830 - the async nand SFs are really not great... Whereas the 830 is a bargain.
     
  5. Posicoln

    Posicoln What's a Dremel?

    Joined:
    10 Jun 2012
    Posts:
    21
    Likes Received:
    1
    Generally from what I have heard SSDs are more reliable than HDD. Everyone has had a HDD fail, but not a lot of SSDs fail. If you are worried about the fact that flash memory can only be written so many times - how many memory sticks have you used, and how many died from that (ie not in the washing machine XD). HDD are mechanical, and for that reason are generally more likely to fail.
    TBH you should be backing up anyway, and if it is very eraly on you don't have a lot to loose - just return to manufacturer and reinstall.
    Samsung is good with SSDs, less are likely to fail, generally speaking.
    The only reason you would want a HDD now is for large file storage, as they are cheaper per GB of data. I see no other reason for choosing a HDD.
    You can always turn off some temp files, are get them written to your RAM...
     
  6. PocketDemon

    PocketDemon Modder

    Joined:
    3 Jul 2010
    Posts:
    2,107
    Likes Received:
    139
    No, i'm not worried about the r-e-w cycle count at all with regard to what i wrote, & this it is also *not* what i described in the slightest...


    The OP talked about only having a single drive - & it's completely immaterial that it's a SSD...

    ...&, whilst i've personally not had a SSD fail, there are more than enough reports of it happening online - i have, however, had faulty memory & a faulty mobo in the past (along with HDDs being doa or fail).


    Firstly, with almost all electonic components/products (inc SSDs) the failure rate follows a 'bathtub curve'; ie -

    [​IMG]

    - where there are comparatively high failure rates both early on & at the eol; with, hopefully for the manufacturer, very low ones during the warranty period.

    Now, whilst, in this instance, the efr will include doa SSDs, there will also be a proportion which appear fine & then fail relatively quickly & also those that will fail at some point during the ifr - & without a backup then the data's lost.


    Secondly, even if the drive itself is fine, this doesn't protect it at all from things like malware, user error, etc - & without a decent backup regime then data is far more likely to be lost.


    So, the OP asked "is it ok just to use a SSD, and have no HDD?" & so i believe that i answered it properly.
     
  7. echo three

    echo three What's a Dremel?

    Joined:
    26 Jan 2012
    Posts:
    98
    Likes Received:
    0
    in my current rig I have just an ssd and I have no real need for mass storage on this, everything except games is on my storage tower with back ups and even game saves etc go to back ups just in case
     
  8. tyepye

    tyepye Minimodder

    Joined:
    20 Dec 2010
    Posts:
    982
    Likes Received:
    47
    I just moved to dual SSD usage, to make my system quieter + preperation for new case. Have a 120GB for boot and 240GB for some storage Pictures/Documents/Downloads/Save Games.

    However, I do have my DroboFS where I keep the majority of my higher GB stuff like videos and music.
     
  9. 3lusive

    3lusive Minimodder

    Joined:
    5 Feb 2011
    Posts:
    1,121
    Likes Received:
    45
    Off-topic a little... but I'm almost certain that the biggest ever study into drive failure rates proved that to be completely false and in fact failure rate is steadily correlated to amount of time used.

    I'll try and find the study and edit this post later, may be worth an interesting read. I was reading it the other day.

    EDIT: Well it seems there's been two serious studies, both which did not show a typical bathtub curve failure rate. I'm not trying to catch you out or anything; I know the bathtub curve is regularly touted as fact across the net. I'm just saying I'm not sure that's the case and I'm sure someone like you would like to have all the facts presented.

    There was Google's and this Carnegie Mellon University study, both published in 07. As far as I'm aware, they're the largest conducted studies of this question ever undertaken.

    The Carnegie one found that the rate of failure was pretty consistent with the age of the drive, so the older it is the more likely it is to fail:

    [​IMG]

    And thus summarize their findings as follows...

    Google's found slightly different results but still hardly a bathtub curve:

    [​IMG]

    There's still an early failure rate increase, but there isn't a 'safe' period (or useful life period) of about 5 years, like the bathtub curve predicts, whereby drives are less likely to fail and then enter an end-of-life phase. Instead it seems the chances of failure begins around 2 years and is moderately constant for the proceeding ones.

    I'm just surprised you would quote it as 'fact' when someone like you is usually on the ball. I don't think there has been any study, at least of drives, which has demonstrated a bathtub curve.
     
    Last edited: 9 Jul 2012
  10. esteban915

    esteban915 What's a Dremel?

    Joined:
    18 Oct 2009
    Posts:
    34
    Likes Received:
    0
    Thanks everyone for your replies, it really is useful to know what other, more up-to-date, people are doing, cheers
     
  11. PocketDemon

    PocketDemon Modder

    Joined:
    3 Jul 2010
    Posts:
    2,107
    Likes Received:
    139
    Yeah, i wasn't suggesting that the curve would literally be flat during the ifr period - it was just the first picture i found that basically showed what a bathtub curve looks like along with naming the parts properly...

    ...& it's what i've always seen used to describe failure rates for electronics (& many other) products.


    it's also not to say that manufacturers may not release inherently faulty products - well, whilst the ideal for them would be that the ifr is at a low level until after the warranty period's passed, it's always possible to find examples where this kicks in far earlier on a large scale...

    i'm not saying this is definitely the case, but something like the 7200.11 'might' be a very good candidate for this...

    ...&, since it was a pretty popular model at the time (as the 7200.10 was good & it appeared to be better), it 'could' have been enough to skew a curve that looks at all HDDs returned to whoever to such that it stopped the curve being as flat.


    Still, i'll look forward to having a look at the study as it's always useful to look at alternative evidence. :)
     
  12. Posicoln

    Posicoln What's a Dremel?

    Joined:
    10 Jun 2012
    Posts:
    21
    Likes Received:
    1
    Sorry man I didn't mean to 'argue' with you. Naturally the bathtub curve makes sense, and if your data doesn't exist in three places, well, it doesn't exist at all, does it?
     
  13. PocketDemon

    PocketDemon Modder

    Joined:
    3 Jul 2010
    Posts:
    2,107
    Likes Received:
    139
    Thanks for posting them. :)

    Right, it's going to take too long to go through them completely atm, but... ...pulling some very immediate observations -

    - Carnegie relate that Seagate report that 43% of all returns are not failed drives in their eyes & that there are different metrics for measuring failure...

    Now, that doesn't mean that we can assume that this 43% is spread equally across the lifespan of the drives - once we get passed any early failure rate that might exist, it is more likely that an arbitrary testing metric assigned by an organisation would be triggered without a drive actually having failed...

    ...& they specifically state that "we will report the annual replacement rate (ARR) to reflect the fact that, strictly speaking, disk replacements that are reported in the customer logs do not necessarily equal disk failures."


    indeed, from what's suggested, it is only when something goes wrong with a system that the techs test an array of components, & then (beyond out & out HDD failures) if the arbitrary metric has been exceeded on a HDD then it's marked as failed - rather than testing all of the HDDs regularly to see if they've reached the metric which would naturally give very different results.


    There's also stuff we don't know...

    An easy, off the top of my head, example being that we also do not know how the drives were deployed.

    Esp if they placed into legacy systems there would be a higher chance of something else failing during the drives' lifespans -> leading to either

    (a) the HDDs being checked despite the fact they were working properly --> an arbitrary metric leading to non-failed drives being returned.

    (b) the failure of something else actually causes the HDDs to fail --> it is not the HDDs fault.


    So, very quickly from the first couple of pages, we have -

    1. a 43% error rate in the reporting of failures & arbitrary metrics that do not equate to failures,

    2. no pre-emptive testing of drives, so we don't know how many drives were not returned that would have failed the same arbitrary metric at any point in the lifespan.

    3. & stuff that we don't know about the deployment that is reasonably likely to effect things.


    Now, on the basis that the longer things are running, the more likely that something in the system is going to 'fail' -> the testing of drives...

    ...the 43% error rate is far more likely to compound over time rather than remaining constant.

    This then pulls their figures into question as you 'could' reasonably get something approaching a bathtub curve by weighting the mistaken returns more heavily as time increases.


    [NB it's also interesting to note what else apparently fails within a system on a regular basis in table 3...

    Well, from two of the companies providing data, memory (which there will be less components of) is ~equally likely to fail as HDDs... ...with the PSU being almost twice as likely in one of them...]

    * * * * *

    - Then, just to note two things -

    1. Google's results were using consumer grade HDDs with a 3yr warranty that aren't rated for 24/7 usage, & testing them over 5 years with 24/7...

    2. ...& whilst we had 43% from Carnegie, Google refer to a 20-30% figure of drives being returned that had not failed...



    & otherwise, whilst i'll come back to it all when i've got more time (it's almost 5am & i need to sleep), the original use of the bathtub curve was relating to SSDs...

    ...whereas these are all for HDDs - so there's a couple of effects that wouldn't affect the lifespan to the same extent (ie temperature, moisture...) &, similarly, as there are no moving parts which removes a significant set of components within SSDs that couldn't fail as they're not there.

    & no worries at all. :)
     
    Last edited: 10 Jul 2012
  14. PocketDemon

    PocketDemon Modder

    Joined:
    3 Jul 2010
    Posts:
    2,107
    Likes Received:
    139
    Double post...

    Meant to add this alt version of a bathtub curve as it better explains things -

    [​IMG]

    from here.


    Yeah, looking at the two reports still needs more work to analyse whether what they're actually saying holds up (beyond the issues already noted), but as i'd double posted anyway then...
     
    Last edited: 10 Jul 2012
  15. PocketDemon

    PocketDemon Modder

    Joined:
    3 Jul 2010
    Posts:
    2,107
    Likes Received:
    139
    Right, to start with, along with accepting that this is a bit of a read, i've also been out drinking for a couple of hours - okay, not 'a session', but 4 pints of Guinness & the odd typo's bound to have crept in.


    Anyway, that wasn't the best analysis of why the curve might be wrong from the bits i'd looked last night...

    Before re-explaining what i was trying to describe, 3 quick notes -

    (i) the 3, 6 & 1 year bars in Google's bar chart would need to be added to get the full Y1 afr - this 'appears' to make a drive less likely to fail in Y4 than Y1 if the data was 100% accurate,

    (ii) there's a useful bit in the Carnegie one that states -

    "Intuitively, it is clear that in practice failures of disks in the same system are never completely independent. The failure probability of disks depends for example on many factors, such as environmental factors, like temperature, that are shared by all disks in the system. When the temperature in a machine room is far outside nominal values, all disks in the room experience a higher than normal probability of failure. The goal of this section is to statistically quantify and characterize the correlation between disk replacements."

    - which, i would suggest, reinforces my points regarding both (a) non-fatal disk 'failures' (since all the disks in that box are 'apparently' then being tested at this point under a metric which leads to a high %age of false failures) & (b) that there 'appears' to be a strong link between other thing & bunches of disks being 'failed' simultaneously (which, alongside temp & whatnot, could have other components in the system being the cause of the initial failure of the box).

    (iii) &, quite importantly, we need to note that at least the Carnegie one appears to be looking at servers which will have both very different operating conditions from a home user &, generally, a very different reason for the false reporting of failures.

    Well, 'a' reason for a home user/workstation fail rate over time might be excessive power cycling above the spec...

    ...whereas environmental factors are more likely to affect & massive semi-simultaneous losses are more likely to occur within a large scale server environment.

    * * * * *

    Anyway, back to the plot...

    Perhaps the easiest way to re-explain some of last night's post is to set out some 'assumptions' based upon what's given.

    1. We can, imho, reasonably subdivide 'failures' into 3 loose groups -

    (a) actual mechanical/controller/whatever failures which cause the drive to stop operating entirely.

    (b) failures which would be accepted as such by the manufacturer, but have not caused the drive to stop operating - ie i'm sure many of us have had HDDs which have failed smart testing, but still operate perfectly for a no of years.

    (c) failures which are based upon an arbitrary testing metric/protocol that are not actual failures in terms of the manufacturing process - ie i'm not saying that this is a protocol that's used, but all of will have had data relocated on HDDs due to media verification errors, but to a point that's well within the spec of the HDD... ...it 'could' be the case, for example, that x no of array consistency check errors automatically triggers sending all of the drives in that box back, whether or not the drives are actually faulty.


    Anyway, the only thing we actually know is that somewhere between 20-43% of reported failures fall into category (c)...


    2. We can then reasonably, again imho, look at the point in time when a return will happen -

    (a) for drives in category 1(a), this is likely to occur shortly after the drive fails - providing a reasonably good correlation between time & failure rate.

    (b) for drives in category 1(b), this is somewhat unknown - how often does each company perform a raid dump of some description? how likely is it that the failure of another component will cause multiple HDD failures? etc...?

    So how long actually is it before a drive 'fails' in terms of the manufacturer's specs & it is actually discovered that this is the case?

    (c) for drives in category 1(c), this is even more unknown - well, since it cannot be from an actual drive failure ((a) or (b)), then clearly either it's (i) part of a wholesale return on the basis of x no of drives bought at the same time having failed, (ii) testing that is caused by some other issue using a testing metric/protocol that is far more stringent than the manufacturing spec or (iii) they are unused drives that are returned for some reason.

    [NB this is assuming that drives are returned as soon as they actually fail, are discovered (either (b) or (c)) to have failed, or are returned for other reasons under (c) - 'if' a company choose to 'save up' returns until there were x number then this could potentially shift 'failures' from one year to the next...

    ...albeit that, it's only reasonable to imagine that it will only lower Y1 (since there's no neccesity to get the drives returned asap) & increase Y3 & Y5 (as these are the two warranty periods that would create an incentive to get them returned).]​

    So, whilst we have a range for the summation of (a) + (b), these are aggregated %ages over the whole 5 year test rather than being split on age in any meaningful way. They also do not account for any drives in (b) that do not actually stop working but have failed according to the manufacturer's specs but are not picked up.

    Then, we neither have any distribution figures for (b) or (c), nor have any indication as to when the actual failure for (b) happened, nor what actually happened in each case to cause x amount of (b) or all of (c) to occur.


    3. So, ignoring drives in group 1(a) that will clearly cause an immediate error, we can (again imho) only state that -

    (b) depending on either the regularity of a raid dump or something (that may or may not be a HDD - HDDs not being a majority %age according to Carnegie) in the box failing, drives are tested & found to be faulty in terms of manufacturer specs.

    (c) &, dependent upon either something in the box failing or wholesale return or too many drives of a type failing so unused ones being returned, a testing metric/protocol is used that wrongly states that 20-43% have failed when they haven't.


    4. Now, in a multi-component system -

    (i) the chance of failure due to one of the many components increases over time (esp with legacy systems where some components may be close to exceeding or have exceeded their lifespan already),

    (ii) as we're excluding independent HDDs failures which are catastrophic (ie 1(a)) in 3, it is only when something like a raid dump is carried out or 3(i) occurs that any checks are done on the HDDs then there is the potential accumulation of years of, to the manufacturer's spec, actually failed drives still running that is not accounted for.

    (iii) & with wrongly diagnosed drives, any of the checks that might give rise to using an alt metric/protocol are only likely to occur following the failure of other components (be they HDDs or anything else).


    5. What 4 then describes, imho, is a situation where, notwithstanding actually failed drives from 1(a) & (b) which are picked up semi-instantly -

    (a) due (for example) to the increased chance of checks actually being carried out that either will show failures under the manufacturing spec or under an arbitrary metric/protocol (or, for example, unused rejections based upon the experience of drives of the same model no d.t. the previous problems) over time, as drives get older there will be more checks carried out on them.

    (b) & both environmental factors & usage which is outside of spec will have a cumulative effect over time - what the Google report terms (quite reasonably) 'survival of the fittest'; albeit that it then doesn't adjust for the potential consequences.


    Now, naturally, (a) will lead to drives being found to be faulty (whether they are or not) at some point after the manufacturer's spec or the arbitrary metric/protocol has been exceeded... ...whilst (b) leads to a shorter lifespan d.t. misuse of the drives.

    So, for (a), what it will do is shift earlier 'failures' (whether they are or not) to later within the lifespan... ...& similarly, for (b), it will increase the failure rate unfairly to later in the lifespan.



    in short, what this all means is that there's an inherent bias toward finding failures that are not catastrophic (ie 1(b) & (c)) later in the lifespan, than there is earlier...

    ...&, since at least 20% (up to 43%) of the drives returned have not failed at all - & these are (obviously) non-catastrophic failures, the finding of these are also going to be biased toward the end of the lifespan.

    So, whilst it is actually quite reasonable that the reported failure rate appears to increase over time, this, i would argue, is a case of 'causality does not equal correlation'.


    i am in no way suggesting that this either proves categorically that the underlying trend that these studies show is wrong or that the bathtub curve would automatically take its place if it actually is...

    ...however i believe that i have raised sufficient guesting quesiton marks to make the automatic assumption that these 2 studies are correct flawed - & have (i hope) reasonably argued that the specific data shown in the tables/curves/graphs presented need to be reevaluated as they do not appear to hold up to scrutiny.


    Either way though, what does appear to be the case is that, if you were to go back in time for ~6 years before 2007 (given that there was naturally a need for time to do the studies), the quoted afrs by the manufacturers were not valid based upon either actual failure rates experienced by the companies involved in the study (since even if we shift the curve & discount 43% of the entire failures over 5 years then there's still too high a failure rate) or the protocols/metrics that the techs at the time in those companies were using...

    ...at least within the environments they were within & with the uses that they were put to.
     
    Last edited: 11 Jul 2012
  16. 3lusive

    3lusive Minimodder

    Joined:
    5 Feb 2011
    Posts:
    1,121
    Likes Received:
    45
    The problem with your analysis is that you're looking at the results with a pre-conceived notion of what should be happening, and then trying to mould, or make excuses, as to why this isn't demonstrated in the data, instead of just describing what the results represent.

    My point is that there is no data or study which suggests a bathtub curve exists in relation to hard drive failure rates (and of course not in relation to SSDs which is a newer technology and hasn't been studied robustly), so why even use that as your framework of how hard drives live and die in the first place? If there is such evidence, show us.

    It seems more likely that this theory has been pushed because of how other electrical equipment operate, and it was taken for granted that this would be the case in drives (it also suits manufacturers who can downplay the risks of failure which their drives really exhibit).

    I mean, if it were true that there existed separate studies which suggested that their chance of failure follows such a pattern (decreasing at the beginning, constant for about 5 years, then increasing at the end of life period, instead of just increasing steadily with the age of the drive), then maybe there would be some weight behind it, but instead I don't see why you would believe such a thing to be true in relation to drives, especially when serious studies have contradicted it.

    Imagine if you'de never heard of a bathtub curve. It's very unlikely you would then suggest, after seeing the two studies posted above, that they follow such a pattern. You wouldn't look at the results and go, 'hmm, it seems we have a non-linear pattern emerging in the shape of a bathtub'. No, it seems more likely you would suggest that it's linear with the age of the drive.

    Or alternatively, if you focussed on Google's study, that there is a kind of early infant mortality pattern (which standard doctrine predicts), but the chances of failure increases much more quickly than after the first 5 years where it would be, according to what your saying, entering into a 'normal operating life'. In fact, in Google's study after 2 years chance of failure increases dramatically and increases slightly at 3, then drops a little but stays moderately high. That is hardly a bathtub curb.

    The Carnegie one suggests that it is in fact linear and that the chance of failure does steadily increase with age, which makes more sense in my eyes anyway. The more you use a drive's internal components, the more you wear away at them and the more you increase it's chances of failure.

    And by the way, if 43% (or 20/30% in Google's) of the drives returned were not really failed drives, why may I ask would they be particular to a period of the drive's life? Why would drives ageing two years, for instance, be wrongly returned as 'failed' than drives ageing 5 years (for example)? I have a funny deja vu of your last analysis of the SSD return rates from the French etailer, remember (lol)?

    So far, I think there's good evidence to suggest the bathtub curve fails to properly account for the effect age (use) has on HDDs.

    EDIT: I have not yet read your directly above post - this is in reponse to your one posted at 04:57.
     
  17. PocketDemon

    PocketDemon Modder

    Joined:
    3 Jul 2010
    Posts:
    2,107
    Likes Received:
    139
    Whilst the bathtub curve is a standard model that describes, reasonably accurately, the failure rates which should be expected from a wide range of devices & components irl, this is part of the reason why analysing the studies needed re-writing... ...which i've done...

    ...& hence why i stated that -

    "i am in no way suggesting that this either proves categorically that the underlying trend that these studies show is wrong or that the bathtub curve would automatically take its place if it actually is...

    ...however i believe that i have raised sufficient question marks to make the automatic assumption that these 2 studies are correct flawed - & have (i hope) reasonably argued that the specific data shown in the tables/curves/graphs presented need to be reevaluated as they do not appear to hold up to scrutiny."


    (although i have altered the main conclusion at the end of this post having realised exactly what the studies are actually giving evidence for)

    * * * * *

    As to the 20-43% mistaken return rate, you simply cannot assume that this will be proportionate over the lifespan - but instead have to look at what the studies actually say.

    (Using the numbering from the previous post) whilst drives in 1(a) will clearly be picked up semi-instantly - there has to be a delay for those in 1(b) & 1(c) since, without any testing at all, they would appear to be working perfectly well.

    [NB naturally accepting that *some* drives in 1(b) will be picked up through raid dumps - though we have no data as to the frequency that these were taking place.]​


    in a multi-component system, i assume we can agree that, as you add components, the chances of a single occurrence of a failure of any one component increases over time...

    ...this obviously being why, when looking at raid arrays, for the HDDs alone we nominally talk about the risk of adding a 2nd drive multiplying the chance of a single drive by 2, for a 3rd drive by 3, & so on...

    [NB though this naturally ignores environmental factors that are outside of spec or some other component failing & either 'taking out' large nos of drives or making them appear to have failed - where this is arguably not the HDD's 'fault'.]​


    Now, it is clearly unreasonable to imagine that all of the drives in a massive array would be pulled from the array & entirely scanned individually for any possible error (again beyond raid dumps) that frequently (esp since Carnegie states that "the number of components
    in a single cluster approaches a million"
    ), but one thing we do know from the Carnegie document is that -

    "For example, a common way for a customer to test a drive is to read all of its sectors to see if any reads experience problems, and decide that it is faulty if any one operation takes longer than a certain threshold. The outcome of such a test will depend on how the thresholds are chosen. Many sites follow a “better safe than sorry” mentality, and use even more rigorous testing. As a result, it cannot be ruled out that a customer may declare a disk faulty, while its manufacturer sees it as healthy. This also means that the definition of “faulty” that a drive
    customer uses does not necessarily fit the definition that a drive manufacturer uses to make drive reliability projections. In fact, a disk vendor has reported that for 43% of all disks returned by customers they find no problem with the disk."



    So, from this we can deduce that it will usually only be at a point where a problem of some kind is detected with either a drive (1(a) or 1(b) in the case that a raid dump suggests that there may be an issue) or elsewhere within 'the box' that any drive may be tested individually.

    This naturally introduces a delay in some drives from 1(b) & 1(c) being found to be faulty.


    Also, referring back to the note above, we have "environmental factors that are outside of spec or some other component failing & either 'taking out' large nos of drives or making them appear to have failed."

    in these cases again, the effects of using a drive outside of spec has a cumulative effect upon longevity (shortening the lifespan) & the chance of the occurrence of another component failing that has one of these types of effects increases over time.

    * * * * *

    Now, as 2 quick asides, two things i neglected to add in are that -

    1. both of these studies neglect to include any data from the HDD manufacturers as to the 'burn in' failure rate.

    Since i assume you would reasonably agree that some drives would fail this testing, this will increase the early failure rate somewhat - albeit one that the end user would never see.

    Taking Google's figures (as, i believe, they show the most extreme failure rate) & assuming that -

    (a) the mis-reported failure rate is only 20% (rather than the 20-30% or 45%)

    (b) that the 20% is equally shared across the whole data set over time (which i do not agree with as i have argued that it should be weighted later) & discounted

    (c) all actual failures (under 1(a) or (b)) are discovered instantly within the data set (which i do not agree is remotely possible & i have argued that this biases everything later)

    (d) & that all of the drives are returned on the day that they actually are percieved to have failed (which i do not believe is remotely viable) & that this equals actual failures.

    - we only need appear to require ~0.3% of the drives to fail the manufacturer's 'burn in' tests (& never reach the outside world) for the total of the Y1 tests to equal the Y3 tests... So it would not be difficult to imagine that this would have a significant effect...

    in fact, it would potentially be easy enough to actually create the basis of a bathtub curve solely from this - & this data is 'supposed' to be included when looking at actual failure rates.


    2. & i have completely forgotten to include the fact that brand new replacement drives, that have sat in boxes for x years, will be added in as drives fail...

    [NB naturally, if you're building an array of, say, 200 drives you don't just buy 200 drives as any failure without an immediate replacement then creates an increased risk of failure for the entire array...

    ...& if we're talking about approaching a million drives in a cluster, you'd maybe need more than a couple of spares(?).]​

    ...& some of these are likely to contribute to a later failure rate in the overall lifespan simply by virtue of when they were purchased, despite them being (effectively) brand new & so actually failing early in their lifespan.

    * * * * *

    Now, the whole purpose of publishing a study is (a) to put forward a theory & (b) so that theory can be reviewed for its efficacy.

    Without being cross or anything about it personally, if anyone wants to ignore the flaws in the two studies & look at the 'pretty graphs' then that's fine...

    ...but, to attempt to understand what is actually being shown, you also have to look at the premises that are used to create the data itself & any underlying trends that will skew the data any which way.

    (& this is the same with the figures from the French e-tailer last year btw)

    * * * * *

    Conclusion.

    That in mind, having just nipped out for a cigarette, i've realised that i need to change things slightly.


    Beyond that fact that the aggregated afrs given by the manufacturers for the lifespan were apparently very wrong between, roughly, 2001 & 2006 based upon the overall evidence (even if we took off 43% from the total as being false positives)...

    ...the *only* thing that these 2 studies might possibly show (assuming that drives were returned the instant that they 'failed' so there was negligible crossover from one time period to the next , etc) is the *perceived* failure rate of HDDs over time within large scale storage arrays.

    What they *do not* show in any way, for the reasons given, is the point in the drives' lifespan where failures, be they actual (1(a) or (b)) or erroneous (1(c)), have *actually* occurred.


    So, given this, whilst it may be reasonable to use their models within massive arrays so that, when failures are *perceived* to have occurred (rightly or wrongly), there are sufficient spare drives ready & waiting over a 5 year lifespan...

    ...there is no evidence that this provides any kind of model for *actual* drive failure - & to suggest so would be incorrect thinking.

    instead, the types of issues that i have raised have a reasonable likelihood of altering the pretty curves/charts/whatever very significantly if we were to be able to look at *actual* failures - again shifting data to earlier time points.


    We also need to accept that they *do not* suggest that they are modeling the average likelihood of one of a very small no of drives in a home user's setup failing at a given time in any way - & it would not be reasonable to assume that this is the case...

    ...not least since there is a greater likelihood of an issue occurring as time increases due to a vast increase in the no of components in massive arrays.

    (albeit, again, there is an increased likelihood of failure d.t. the increase in power cycles with the home user - though this 'should' be accounted for in the design & manufacturing processes when making consumer drives)

    But, even if we pretended that they did, again it would only show the *perceived* failure time (which again may or may not be erroneous) rather than the *actual* failure time...


    Now since, (again) based upon the evidence in the two studies it's clear that the figures/curves/whatever do not represent *actual* failure rates in either massive arrays or for the home user, they also do not disprove the bathtub curve as a reasonable approximation for *actual* failure rates.

    This is not to say that the bathtub curve is therefore automatically correct to describe the point of *actual* failures either within massive arrays or for a home user...

    ...but, since it is a good enough model for most mechanical & electronic components & systems, there's still a fair chance it *may* still prove to be a useful model in looking at approximating when *actual* HDDs failures may occur.


    However, if we then include the manufacturer's 'burn in' failure rate as described above - which is supposed to be done when looking at actual failure rates - then there is a reasonable argument for this to further increase the early failure rate to a point that exceeds any other data point... Making it far more likely that the bathtub curve *may* actually be correct.

    Well, there is no evidence to the contrary provided by the studies - though, to be 100% clear, (again) this *does not* automatically make the bathtub curve correct.

    * * * * *

    As to SSDs, okay, there's no evidence that they follow any model but, since it's necessary to have some premise to work from when looking at replacement times & the need for backups & whatnot, would it not be sensible to choose one that works well enough for most mechanical & electronic components & systems?

    Not least since, as we are both well aware, the *normal* trend with SSDs is that, after any major initial teething problems are solved, issues have occurred spasmodically with SSDs (from intel, all SF oems, Crucial, OCZ, etc, etc) which are then corrected through f/w updates which makes them less likely to 'fail' - either in a perceived or actual sense - & lead to returns.


    Or should we imagine that a model which only provides info on perceived problems for HDDs in massive arrays will apply to actual issues for the home user?
     
    Last edited: 11 Jul 2012
  18. Zinfandel

    Zinfandel Modder

    Joined:
    2 Aug 2010
    Posts:
    3,243
    Likes Received:
    198
    "Infant Mortality" is a particularly crass way to head up data on a graph about hard drives.
     
  19. Nexxo

    Nexxo * Prefab Sprout – The King of Rock 'n' Roll

    Joined:
    23 Oct 2001
    Posts:
    34,731
    Likes Received:
    2,210
    Like "Operation aborted" is making light of abortions? "Dead on Arrival" is being disrespectful to the many RTA or cardiac arrest victims who don't make it to hospital in time?

    Let's not get carried away in political correctness, please. Back on topic.
     
  20. esteban915

    esteban915 What's a Dremel?

    Joined:
    18 Oct 2009
    Posts:
    34
    Likes Received:
    0
    Wow, these must be the most in-depth replies I've ever had to a daft question! :thumb:

    So would it be fair to say, Yes, it is ok to build a new PC with just a SSD. But sometime in the future it wil fail the same as a HDD will sometime in the future. :D

    Thanks for your help guys, it certainly makes an interesting read.
     

Share This Page