• Index for AI-4-Systems researches.
  • Some aspects in intelligent storage field.
  • Continuous update.

AI For Systems

Disk Failure Prediction

  • FAST20 - Making Disk Failure Predictions SMARTer!
    • Slides
    • We present analysis and findings from one of the largest disk failure prediction studies covering a total of 380,000 hard drives over a period of two months across 64 sites of a large leading data center operator.
    • Our proposed machine learning based models predict disk failures with 0.95 F-measure and 0.95 Matthews correlation coefficient (MCC) for 10-days prediction horizon on average.
    • Findings:
      • SMART attributes do not always have the strong predictive capability at long prediction horizon windows for all disks
        20200714205716
      • The value of performance metrics (related to capacity, throughput, etc.)
        20200714205757
        • Exhibit more variations before the actual drive failure
        • Show distinguishable behavior from healthy disks
          20200714205311
      • Prediction can be further improved by incorporating the location information. (site, room, rack, and server)
        • Disks in close spatial neighborhood
          • Affected by the same environmental factors (such as humidity and temperature)
          • Experience similar vibration level (known to affect the reliability of disks)
      • Data:
        20200714210227
      • ML Models:
        • Bayes classifier (Bayes)
        • Random forest (RF)
        • Gradient boosted decision trees (GBDT)
        • Long short-term memory network (LSTM)
        • Convolutional neural network with long short-term memory (CNN-LSTM)
      • Conclusion:
        • SPL group performs the best across all ML models (performance and location features improve the effectiveness of prediction)
        • The improvement of adding location info is limited and pronounced only in the presence of performance features
        • CNN-LSTM performs close to the best in all situations
        • Trade-off between models with respect to different availability of feature sets
      • Prediction Horizon:
        20200714210613

Storage System Tuning

Optimize IO Behavior

  • DATE20 - A Machine Learning Based Write Policy for SSD Cache in Cloud Block Storage
    • Source Code
    • Based on our analysis on a typical cloud block storage system, approximately 47.09% writes are write-only, i.e., writes to the blocks which have not been read during a certain time window
    • We propose an ML-WP, Machine Learning Based Write Policy, which reduces write traffic to SSDs by avoiding writing write-only data.
    • Main challenge in this approach is to identify write-only data in a real-time manner. Last choost Naive Bayes algorithm.
      20200714201440
    • Appropriate features:
      • last access timestamp
      • last address information
      • average write size
      • Big request ratio (> 64KB)
      • Small request ratio (< 8KB>)
    • Experimental results show that, compared with the industry widely deployed writeback policy, ML-WP decreases write traffic to SSD cache by 41.52%, while improving the hit ratio by 2.61% and reducing the average read latency by 37.52%.
      20200714201124
      20200714201729
      20200714201749