C.S. Lee had a post called NIDS: Administration, Management & Provisioning that asked some good questions about managing large numbers of NSM sensors. I have managed large numbers of sensors in the past, so thought I would take a shot at describing some of the ways I eased management as well as other methods I still look forward to trying. Since my post is long, I thought it better to write it here than stuff it all into a comment on geek00l's blog.
A couple of things to remember; first, there are almost always ways to improve complex systems management. Second, "perfect" is the enemy of "good enough". At some point you reach the point of diminishing returns, so the cost of additional improvement of the management or administration of the systems may not be worth the reward.
1. What tools do you use to manage all the NIDS, and why you choose them over others?SSH is obviously going to be one way to login to systems and do certain things. If it is something that you must do consistently, then scripts or other system management methods that I will discuss later are likely more appropriate. When using SSH for a large number of systems, don't forget that SSH keys and ssh-agent are your friend. With ssh-agent, you can login to all your systems with your SSH key after entering your passphrase only once. This simplifies running scripts that require logging into or copying files to each system.
- For example ssh, however I would like to know more about tools you use to manage massive NIDS instead of one, and the reason you choose it.
Also, when I talk about using SSH along with scripts, I'm also talking about using programs that support SSH as the transport protocol, for example rsync and rdist. Expect scripts are also a common way to roll your own centralized management of systems, but for C.S. Lee's 50+ system question, a dedicated application seems to be a better answer than only using scripts and logging in manually.
2. How do you perform efficient administration securely? For examples,I think these types of changes and updates will require a combination of tools, and the tools could depend in part on the operating system(s). If you have multiple operating systems then it also makes the management more complex, so ideally you want to standardize on an operating system as well as keeping the release versions identical whenever possible.
- System changes/updates
- NIDS tools' changes/updates
- NIDS rules' changes/updates
- NIDS Configuration files' changes/updates
- NIDS Policies' changes/updates
One thing I've mentioned in the past for system management is puppet.
Puppet lets you centrally manage every important aspect of your system using a cross-platform specification language that manages all the separate elements normally aggregated in different files, like users, cron jobs, and hosts, along with obviously discrete elements like packages, services, and files.Although I haven't yet had the chance to use puppet, it seems to have a good reputation. Another option is cfengine, though most people I have talked to that have experience with both seem to prefer puppet. Change management of configuration files, cron scripts, and other files like NIDS rules can definitely be handled by one of these central management tools.
Another thing to consider is whether your operating system or its vendor includes anything for these tasks. For instance, Red Hat Network Satellite can handle a lot of centralized management, including package management. NIDS/NSM sensors often need configuration changes from the standard distribution package for certain software, so being able to roll your own packages and push them to sensors automatically can drastically reduce system management overhead.
Although puppet seems to handle users, I've also written three posts about OpenLDAP for centralized management of users and groups [1, 2, 3]. With most current Linux or BSD, once LDAP is configured it is pretty easy to manage users, groups, and even sudo. Since I've worked in environments with not just large numbers of Linux systems, but also large numbers of users, LDAP was definitely useful. With a small number of users on large number of systems, I'm not sure that it would be needed.
For the security requirement, any good centralized management system better have some sort of authentication and encryption. Puppet supports a CA and SSL, cfengine supports RSA and Blowfish along with public-private keys, and Red Hat Satellite suports SSL and GPG. Other basics including host-based firewalls like iptables can also be useful for limiting exposure and access from the network.
Truthfully, I have mostly relied on home-grown scripts combined with SSH, rsync and/or rdist to push files or commands to Linux systems. However, with the number of systems I have managed, the up-front cost of implementing something like puppet, cfengine, or Satellite would be worth the long-term benefits.
3. Which method you like to use in order to manage them, and why? For example,I think this question is largely moot because it will usually be determined by the management tools you are using. For instance, Red Hat runs a daemon on the individual systems that will check in either with Red Hat Network or with your local Satellite Network.
- Server pushes rules update to all the sensors(Push)
- Sensors pull the rules update from server(Pull)
When using scripts, I will usually use a push simply because I like to login to one system then run the script that will connect to all the other systems to copy files or run a command.
3. NIDS health monitoring and self-healingThe obvious answer to monitoring processes is something like Nagios, an open source solution. Nagios can also handle restarting services or processes through event handlers. Realistically, any software that monitors services should have the ability to restart those services if needed. Another example of process monitoring and restarting is daemontools, but it does not really meet monitoring needs for an enterprise and is fairly limited. There are additional choices of monitoring software, as well.
- I'm talking about something like this, if the system is in incosistent state, operators will be notified. If certain process die, it should recover by itself.