As you can imagine there was everything from Icinga, Nagios, Zabbix, Graphite, etc (on the open source front) to what we call the big-4: IBM, CA, HP, BMC, and even islands of VMware vCenter Operations Suite, Oracle Enterprise Manager, and Microsoft System Center Operations Manager. I tried to offer some prescriptive advice, but the task at hand was daunting to say the least. The CIO who created this project was responding to the massive cost of licenses and people the business was essentially wasting by not managing this centrally, and not leveraging it’s scale.

I had an interesting day at work today trying to untangle some legacy shit and merging old stuff into a brand spanking new monitoring cluster.

It was a lot of iterating through the same stuff over and over. Breaking stuff into pieces and finding out what it’s supposed to do. And trying to read the manuals available.

I felt like the dog meme above:

"LET ME IN I NEED TO GO BACK OUT AGAIN!"

I’ve spent some time wrecking havoc on a op5 installation today while trying to setup a load balanced cluster with a couple of servers. The Scalable Monitoring documentation was straightforward but I have a lot of tweaking to do before it works as we want it. Using a existing setup is always harder than installing from scratch, I guess.

The strangest thing from the instructions was the official way of pushing configuration with their clustered setup.

mon restart ; sleep 3 ; mon oconf push

Why would you use mon restart;sleep 3 instead of stopping the monitor service, pushing the conf and starting when you’re done? And why is the sleep exactly 3 seconds? I can’t leave this be…

There has to be some sort of underlying story that made them add a 3 seconds sleep to the official documentation.