Case Studies

Managing a Large-scale Analytics Cluster

 

The Challenge

Our customer faced many challenges maintaining and monitoring a large-scale Hadoop big data cluster. They were experiencing a range of network, hardware and capacity issues, amongst others. Performance became a real issue as data grew in the cluster, and the sheer volume of nodes in the cluster posed major commissioning/ decommissioning challenges. In addition, our customer needed monitoring tools that worked effectively at scale to detect critical issues in real time. There are few organizations with the comprehensive knowledge or skills required to understand, identify and solve these challenges together.

 

Our Solution Features

ATG’s team of engineers and solutions architects has led many problem-solving projects of this kind.  In doing so, team members deploy their commanding knowledge of the latest analytic technologies and techniques to identify the right tools for the job. In this case, the team implemented an automated deployment of the Cloudera Manager monitoring software, with its advanced features for monitoring the health and performance of the analytics clusters themselves, and the tasks running within them. We established comprehensive monitoring and reporting on custom dashboards, giving our customer complete visibility into the cluster with built-in health checks and alerts, customized to what mattered most to them. We enabled a centralized log management capability to allow our customer to aggregate logs across all services and hosts and make them searchable for simple troubleshooting, including integrated error alerts.

 

Benefits to the Customer Mission

 

Our solution strengthened the customer team’s ability to achieve robust planning and troubleshooting across the analytical cluster, advancing their mission with sustained high availability across their large-scale multi-tenant environment. Key benefits included:

  • The operations team was able to manage multiple clusters more effectively, tune configurations and resourcing and manage a wide range of user roles for self-service access.
  • The customer could address their scaling needs for the cluster and ensure zero downtime during scheduled cluster maintenance.
  • The built-in backup and disaster recovery capability allowed them to run even their most critical workloads risk-free.

 

 

 

An Acacia Group Company