Improving Efficiency And Resilience In Large-Scale Computing Systems Through Analytics And Data-Driven Management