Infrastructure Monitoring is the key to effective management for any enterprise in today’s hyper connected world. With the advent of new technologies and their associated threats, risks and vulnerabilities, it is vital to have a monitoring tool in place which can alert you on time and send the notification to required resource(s) for any action(s) required. Today many enterprises have invested in building monitoring tools; and some enterprises outsource infra support monitoring to a vendor to remotely monitor and manage the infrastructure (RIMS).
When it comes to monitoring, many enterprises make the grave mistake by assuming that if they deploy a monitoring tool it will alert them properly so that right information is passed on to right people at right time. But in reality they end up getting innumerable alerts, unnecessary alerts, and delay in alerts, and issues like unclear alert, no correlation to infrastructure objects etc. These have even driven enterprises to assume that the monitoring tool is not good, which is not true in most of the cases.
It is important to have a hygiene factor to manage any monitoring tool so that it will give right results. Following are the best practices one must follow in effective infra support management.
- Monitoring set up - Monitoring setup is not a one time job. It is an ongoing process. Enterprises that consider that monitoring is just a onetime activity have not understood Infra support management.
- Set Clear Goals - Clearly understand the monitoring goals and set the monitoring tool to meet the goals.
- Classify Objects or CIs based on Importance – All the objects and CIs in your infrastructure are not critical. Hence is it important to identify the critical objects, and monitor them, set thresholds properly and maintain them on priority basis.
- Reduce the noise of alerts – Set proper polling interval, retry count and retry wait time. Ensure event management tools are integrated with monitoring tools so that event correlation, event masking and event filtering are managed seamlessly. It is also always recommended to drive this through Configuration Management Database (CMDB) so that related CIs alerts are correlated properly.
- Provide information for troubleshooting in multiple channels – It is important to alert the resources in multiple channels like Email, Phone, chat, so that they have timely access to information. All channels should only be used for critical issues. Notification Schedule is the best way to address this issue.
- Build to Scale – Always deploy and maintain the tool which can scale at any point of time. If you are building it for only today, the tool will not help you when the enterprise grows.
- Never allow for single point of failure – Always build redundant monitoring tool set or enable same tool monitoring of your infrastructure from multiple DCs/Regions/Zones to avoid single point of failure.
- Ensure your tool has the reporting capability – The tool should have real time reporting capability as data is key to take any correction and preventive actions.
Provide context of Monitoring Information – An alert should always have required information to the resources so that they understand the object or CI, source of the problem (metric), dependency (based on CMDB), and threshold value for actions.
We have implemented these practices in our infrastructure support management and it has helped businesses in various ways. Do you follow such practices in your enterprise?