Is agent-based or agentless monitoring best?
Rob Young has posted a few blog entries lately on the MySQL Enterprise monitoring software. His latest post claims that agent-based monitoring is equivalent to extensibility (MySQL Enterprise Monitor: Agent = Extensibility).
I think this is conflating two completely distinct properties of a monitoring solution. Cacti is extremely extensible, with a plugin-based architecture and templates and all kinds of other goodies; yet it is not agent-based (actually it lets you choose — now that’s extensibility). innotop is not agent-based, and it’s extremely extensible too. Basically everything inside innotop is a lookup table of anonymous subroutines and data structures that you can tweak pretty much infinitely with plugins and configuration files that get merged into the running code dynamically. Extensibility is completely orthogonal to whether the architecture is agent-based. What about WordPress? It’s ridiculously extensible and it has nothing to do with agents.
So now that we’ve clarified that, what can we say about agent-based or agentless architectures? Which is better?
It depends. What do you want to do? How big is your system? How close to real-time do you need? What other properties do you need?
Scaling? Need precision?
One claim might be that agent-based software scales better because it reduces the number of network connections. You have all these agents running independently on each monitored system; they collect the data they need to and that relieves the central system of doing it. Now, the theory goes, there are fewer connections between the central system and the monitored systems. Except that this completely misses the point: the agents have to connect to the central system to report their results (or be connected to and queried for results) — you don’t save any connections this way. So that’s not a valid argument.
Note: I do realize that some of the viewpoints I mention are absurd. I’m mentioning them because I’ve heard people earnestly say them as though they could be true. So hold the flame-throwers for just a bit…
Next we might point out the time-sensitive nature of monitoring. If you’re going to collect stats every minute, you may want them done exactly on the minute. A non-agent-based monitoring system may have to reach out to these remote systems and query for results, then wait; you could easily see each one-minute cycle beginning to take more than a minute with many systems! In fact, this is a problem with Cacti. Monitor too many systems (or monitor them in a silly way) and you can get overlapping executions. But this isn’t really about agent-based or agentless either, is it? It’s just about multi-threading, or the lack thereof. Systems that poll in sequence will always suffer from this problem. Doing it asynchronously is much smarter.
Here I want to be careful to point out that if you need to measure each system exactly on the minute, even asynchronous polling won’t save you. Get enough multi-threading going, and you can run into problems with too many threads, too. So there’s something to be said for agent-based monitoring if you have a lot of systems.
This also feels like a good time to mention that the MySQL Cacti templates I wrote will make your Cacti monitoring a lot more efficient — they get all their info in one go, so they don’t make many silly repeated calls to the MySQL server. (This is accomplished via some caching code that works around Cacti’s limitations.) And lest we forget, this type of monitoring generally does not need to be real-time or even close to it. Some of what the MySQL Enterprise Monitor can do does need to be precisely timed; but Cacti monitoring does not.
So if you need precise time-sensitive monitoring on a lot of systems, you might want to think seriously about agent-based monitoring. (By “a lot” I probably mean on the order of hundreds of servers.) However — all that data still has to come back to your central monitoring system somehow, so there’s no silver bullet; as long as you have a centralized monitoring system you will have scaling limitations. The only way around this is to decentralize, and I don’t know of a system capable of doing that today. I’m sure commenters will set me straight if I’m wrong.
Management overhead
On the other hand, you might also think seriously about the risks and management overhead of agent-based systems; what happens when you have 1000 servers each running an agent, consuming 1000 times however much memory and other resources, and opening 1000 security holes simultaneously when a flaw is found? What if the central system dies — is your agent-based system smart enough that the agents don’t all have to be reconfigured to talk to its replacement? Have you ever run a large-scale agent-based system of any type? What about a large-scale agentless system? These are the questions you should be weighing for yourself.
Personally I like polling if possible, and I want my servers to be absolutely bare-bones, especially if they are exposed to the Internet. For example, I don’t permit anything such as SNMP to be running on those servers. I want SSH and nothing else. Anything that wants to talk to that system and get information from it can SSH in and execute some standard Unix commands, like cat /proc/vmstat, and work from there. Standard Unix user-management, and sudo, can lock down precisely what that SSH user is permitted to do.
It’s all marketing anyway
Going back to Rob’s post, there are a number of other claims about the benefits of agent-based monitoring, including “Minimal connections to the backend MySQL database” and “Application data sharding across replicated slaves.” I’m skeptical that these things can’t be achieved without an agent running (what does that last one even mean, in the context of a monitoring app?) I think I can (and to some extent I have done this already) build systems with these properties without agents. It strikes me that MySQL has taken a lot of hard questions about why they went with an agent-based architecture, and there’s some stiff competition from agentless systems who shall not be named — the post sounds a little defensive, if you ask me.
In a funny way, I think it’s kind of because they had a product on the market before the unnamed competition; they chose an agent-based architecture; now the competition is taking potshots at them. If they’d chosen agentless, I bet someone would have built an agent-based system, then pointed at them and snickered and put “we’re agent-based!” in their marketing materials. Everybody’s got a right to market. It’s not as though any of these “X method is best” claims is objective. It’s all a matter of convincing people that what you’re trying to sell them has value.
Summary
So what’s my opinion?
For small to medium-sized installations, I like the combination of Nagios and Cacti.
For anything above that, I don’t believe a truly fantastic solution exists yet.
What I’ve seen of the MySQL Enterprise Monitor has not overwhelmed me, and it’s special-purpose — if I monitor my MySQL servers with that, then I have to have something else to monitor all my other servers, like my mail server and my LDAP server and so on. If That’s just more work for me. And yet, I can’t presently get all the MySQL fanciness I want with these more-generic systems. So, I conclude you currently cannot have your cake and eat it too.



Thanks Baron. You have made some great points.
We wanted to write a similar post ourselves, but that would have been considered biased.
While we were deciding the architecture of MONyog, we did a survey among 15,000 paying customers of SQLyog MySQL GUI. One of the most important feedback that we got from the survey was that most admins would avoid installing new software on a production system, unless it is absolutely critical (like security patches, etc).
We also figured out that for an agent based monitoring tool, the feature set of the software depends directly on the functionality provided by the agent. Therefore if we add more features in the monitoring software, we have to constantly force our customers to upgrade the agents! For example – MONyog 1.x used to collect data from “SHOW VARIABLES” and “SHOW STATUS” only. MONyog 1.5x started storing results from “SHOW PROCESSLIST”. The latest MONyog 2.x can now parse slow query logs and general query log. Our customers are able to take advantage of the increased feature set quickly because they were able to install the new versions of MONyog without touching the production servers.
Now consider the opposite – if MONyog was agent-based. We have to frequently ask our customers to install new software on the production servers as the new agents shipped with the newer versions are responsible for sending data required to support the new features. Of course, one can argue that you should have planned about all these features in the first version itself but such “perfect” architecture can only exist in dreams! Software products rapidly evolve over time.
Historically, developers have created agent-based systems because it was somewhat cumbersome to collect OS data without having an agent installed on the same host. With SSH/SFTP on Linux and WMI on Windows becoming ubiquitous all monitoring data can be retrieved easily from a remote host. In MONyog, we read /proc directly to collect all OS data on Linux systems.
Today, if you want to roll your own monitoring tool and need to collect OS data then SSHD and WMI are your agents. They come pre-installed and they are at your command!
In summary, the claim that agent based monitoring tools are more “extensible” and “scalable” is not compelling as it is not backed by strong technical facts or benchmarks.
Rohit Nadhani (Webyog)
21 Aug 08 at 2:44 am
Just because it is agent based doesn’t mean it isn’t going to be stupid. Hyperic’s agent is particularly annoying in this respect, if you start turning up the amount of data it collects mysql it starts being a pretty major drag on the database.
Mysql’s agent isn’t much of a picnic either. We’ve had several cases of the agent eating 1GB of memory and an inordinate amount of cpu.
In the end I think our simple gmetric script for ganglia has been the most useful.
Jason
Jason Cook
21 Aug 08 at 2:54 am
In general I agree with the sentiment that agent vs agentless isn’t really an important point. However one nice thing about the agent based approach is how they are using that to integrate with mysql-proxy for the Enterprise Monitor 2.0 product currently in beta.
You can install the agent to start up proxy with query analyzer scripts on a different port (default 4040). Then your application can point to 3306 like normal but if hell breaks loose or in non peak hours you can point it to 4040 and get some analysis — without having to restart mysqld or even tweaking any system variables. The agent relays the query analysis to the central monitor,and it’s actually pretty slick.
Ryan Thiessen
21 Aug 08 at 4:00 am
Baron,
Rob was not only talking about the monitoring aspect of the product, but also about the other possibilities the agent opens. That was what he meant by “Application data sharding across replicated slaves.â€
Ryan mentions some of the stuff we are working on, and those are pretty hard to pull off without having software running in the middle. In fact, I can’t really imagine that being done by ssh’ing to the server at all. But that’s maybe just ENOCOFFEE :)
Kay Röpke
21 Aug 08 at 4:40 am
Quite a few of the architectural choices in the mysql enterprise monitor have something to do with the fact that the product is proprietary (not OSS) and in the way it’s marketed.
Some features could much better be handled from inside the MySQL server (such as the new query analysis foo), the elaborate hacks only come about because MySQL doesn’t want it in the server, they want it separately so they can leverage the mysql enterprise offering with it. Valid business choice, but we mustn’t try to find technical sense in it ;-)
Arjen Lentz
21 Aug 08 at 7:36 am
Arjen,
I have to say, both of your points are complete nonsense.
There is nothing in the architectural decisions that made it go agent based because it is non-OSS.. What exactly do you think that would be? I’ve been involved with the architectural decisions for that product for a *long time* now, and nothing has ever been because ‘it is closed source’.
Regarding ‘the elaborate hacks only come about because MySQL doesn’t want it in the server’, no no no no. They came about in MEM because we can get them in MEM *NOW*, and not have to stay on the Server’s release schedule. It’s all about adding benefits for customers *NOW* and not in 2 years time.
All others,
The real point nobody seems to mention:
Single central monitoring server == single point of monitoring failure.
That’s a great big FAIL right there.
Mark Leith
21 Aug 08 at 8:37 am
Mark,
I have to disagree with your point mentioning “Single central monitoring server == single point of monitoring failure”.
1. What prevents someone from starting another instance of an agent-less monitoring tool and point to the same set of servers?
2. What happens when the agent fails? How does a single central monitoring server help?
3. What happens when the agent starts consuming inordinate amount of memory and resources (as mentioned in the comment by Jason Cook)? You have no option left but to kill the agent. Keeping your production server alive is more important than running the monitoring agent! In an agent-less scenario a rogue application will not drain the resources of a production server.
Rohit Nadhani (Webyog)
21 Aug 08 at 9:03 am
Hey Baron,
I’m not sure how *anybody* could disagree with my pretty simple phrase of “Single central monitoring server == single point of monitoring failureâ€. ;)
However, here goes:
1. What prevents someone from starting another instance of an agent-less monitoring tool and point to the same set of servers?
– Nothing.. But.. You cause more load on the monitored server.. However again this is not a ‘single central monitoring server’.
2. What happens when the agent fails? How does a single central monitoring server help?
– Doesn’t happen in MEM yet, but, having a system that can monitor from the agent *or* the central server is the right approach in an ideal world. Not many tools do this yet.
3. What happens when the agent starts consuming inordinate amount of memory and resources (as mentioned in the comment by Jason Cook)? You have no option left but to kill the agent. Keeping your production server alive is more important than running the monitoring agent! In an agent-less scenario a rogue application will not drain the resources of a production server.
– Yes this *was* an issue in MEM (it stored all of it’s collected data in memory until it could send it to the central server. Now it does not, it stores a backlog of x minutes, in a rolling window). What the agent needs now is to be able to raise alerts when the central server is down (sendmail, SNMP), as well as buffering the data collection to disk.
Mark Leith
21 Aug 08 at 9:19 am
Oops, that was to Rohit, not Baron.
Mark Leith
21 Aug 08 at 9:21 am
Baron,
Thanks for your post on this, I think we agree that client vs agent-based architectures both have pros/cons depending on the environment to be monitored. No need to go into the details here, my thoughts are hashed out in my earlier posts. Ryan T. makes some good points on how the MEM agent easily enables advanced use of the MySQL Proxy; talking with customers, this is what most are looking for in terms of agent-based extensibility.
Thanks again for the lively debate, RobY
Rob Young
21 Aug 08 at 10:13 am
what happens when you have 1000 servers each running an agent, consuming 1000 times however much memory and other resources, and opening 1000 security holes simultaneously when a flaw is found? What if the central system dies — is your agent-based system smart enough that the agents don’t all have to be reconfigured to talk to its replacement?
Well, firstly, how do you install 1000 servers? 1000 agents? I’ve never been at a place that has manged over 200 servers that is still installing servers by hand; some kind of jumpstart/cfengine or kickstart/puppet combination is used (to create and maintain servers). In this way, reconfiguring or even reinstalling 1000 servers is not difficult at all, and I had to do this years ago while working at a university (jumpstart/cfengine there) when apache and openssl both had vulnerabilities.
Also, how easy is it to restore your monitoring server from a backup (assuming you’re taking a backup)? Is it harder to scp a configuration file 1000 times, if the central server dies, or is it harder to build a new servers and add in 1000 configurations? It’s much more likely that you have an easy way to set up 1000 machines that have a common agent, than you have an easy way to set up 1 (or 2) central monitoring servers. (this is based merely on the fact that in your environment, you’ve found a solution to installing and configuring 1000 machines because there are so many of them, and you probably built the monitoring server by hand).
And my final question — why nagios *and* cacti? Cacti has the ability to monitor and page (including dependencies), as well as graph — so why nagios as well?
Sheeri K. Cabral
24 Aug 08 at 11:09 pm
Have author heard about Zabbix (www.zabbix.com)?
It supports either agent-based or agentless-based monitoring, supports decentralized monitoring (proxy-servers) and many more nice features.
P.S.: this is not an ad, i’m just a user of this system
Ivan
18 Nov 08 at 7:02 am