We present analysis and findings from one of the largest disk failure prediction studies covering a total of 380,000 hard drives over a period of two months across 64 sites of a large leading data center operator.
Our proposed machine learning based models predict disk failures with 0.95 F-measure and 0.95 Matthews correlation coefficient (MCC) for 10-days prediction horizon on average.
Findings:
SMART attributes do not always have the strong predictive capability at long prediction horizon windows for all disks
The value of performance metrics (related to capacity, throughput, etc.)
Exhibit more variations before the actual drive failure
Show distinguishable behavior from healthy disks
Prediction can be further improved by incorporating the location information. (site, room, rack, and server)
Disks in close spatial neighborhood
Affected by the same environmental factors (such as humidity and temperature)
Experience similar vibration level (known to affect the reliability of disks)
Data:
ML Models:
Bayes classifier (Bayes)
Random forest (RF)
Gradient boosted decision trees (GBDT)
Long short-term memory network (LSTM)
Convolutional neural network with long short-term memory (CNN-LSTM)
Conclusion:
SPL group performs the best across all ML models (performance and location features improve the effectiveness of prediction)
The improvement of adding location info is limited and pronounced only in the presence of performance features
CNN-LSTM performs close to the best in all situations
Trade-off between models with respect to different availability of feature sets
Based on our analysis on a typical cloud block storage system, approximately 47.09% writes are write-only, i.e., writes to the blocks which have not been read during a certain time window
We propose an ML-WP, Machine Learning Based Write Policy, which reduces write traffic to SSDs by avoiding writing write-only data.
Main challenge in this approach is to identify write-only data in a real-time manner. Last choost Naive Bayes algorithm.
Appropriate features:
last access timestamp
last address information
average write size
Big request ratio (> 64KB)
Small request ratio (< 8KB>)
Experimental results show that, compared with the industry widely deployed writeback policy, ML-WP decreases write traffic to SSD cache by 41.52%, while improving the hit ratio by 2.61% and reducing the average read latency by 37.52%.