25 June, 2009

Building an IR Team: Organization

This is my second post in a planned series. The first is called Building an IR Team: People.

How to organize an Computer Incident Response Team (CIRT) is a difficult and complex topic. Although there may be best practices or sensible guidelines, a lot will be dictated by the size of your team, the type and size of network environment, management, company policies and the abilities of analysts. I also believe that network security monitoring (NSM) and incident response (IR) are so intertwined that you really should talk about them and organize them together.

A few questions that come to mind when thinking of organization and hierarchy of the team:

  • Will you only be doing IR, or will you be responsible for additional security operations and security engineering?
  • What is the minimal amount of staffing you need to cover your hours of operation? What other coverage requirements do you have dictated by management, policies, or plain common sense?
  • How will the size of your team effect your hierarchy and organization?
  • Since being understaffed is the norm, how can you organize to improve efficiency without hurting the quality of work?
  • Can you train individuals or groups so you have redundancy in key job functions?
  • Referencing both physical and logical organization of the team, will they be centralized or distributed?
  • What is your budget? (Richard Bejtlich has had a number of posts about how much to spend on digital security, including one recently).
IR and other Security Operations
The first question really needs to be answered before you start answering all the rest. There are two basic models I have seen when organizing a response team. The simpler model is to have a response team that only performs incident response, often along with NSM or working directly with the NSM team. Even if the response team does not do the actual first tier NSM, the NSM team usually will function as a lower tier that escalates possible incidents to the IR team.

The more complex, but possibly more common, model is to have incident responders and NSM teams that also perform a number of other duties. I mentioned both security operations and security engineering in the bullet point. Examples of security operations and engineering could be penetration testing, vulnerability assessment, malware analysis, NSM sensor deployment, NSM sensor tuning, firewall change reviews or management, and more. The reason I say this model may be more common is the bottom line, money. It is also difficult to discretely define all these job duties without any overlap.

There are advantages and disadvantages to each model. For dedicated incident responders, advantages compared to the alternative include:
  • Specialization can promote higher levels of expertise.
  • Duties, obligations, procedures and priorities are clearer.
  • Documentation can probably be simplified.
  • IR may be more effective overall.
Disadvantages can include:
  • Money. If incident responders perform a narrow set of duties, you will probably need more total personnel to complete the same tasks.
  • Less flexibility with personnel.
  • Limiting duties exclusively to incident response may result in more burn-out. Although not a given, many people like the variety that comes with a wider range of duties.
Advantages of having incident responders also perform other security operations and engineering:
  • Money.
  • A better understanding of incident response can produce better engineering. A great example is tuning NSM sensors, where an engineer that does the tuning has a much better understanding of feedback and even sees the good and bad firsthand if the same person is also doing NSM or IR.
  • Similarly, other projects can promote a better understanding of the network, systems and security operations that may promote more efficient and accurate IR.
Disadvantages:
  • Conflicting priorities between IR and other projects.
  • More complex operating procedures.
  • Burn-out due to workload. (Yes, I listed burn-out as a disadvantage of both models).
  • Less specialization in IR will probably reduce effectiveness.
Staffing
Before deciding on the number of analysts you need for NSM and IR, you have to come to a decision on what hours you will maintain. This question is probably easier for smaller operations that don't have as much flexibility. If there is no budget for anything other than normal business hours, it is definitely easier to staff IR and security operations in general. Once you get to an enterprise or other organization that maintains some 24x7 presence, it starts getting stickier.

If you will have more than one shift, you will obviously have to decide the hours for each shift. It is important to build a slight overlap into the shifts so information can be passed from the shift that is ending to the shift that is starting. Both verbal and written communication, namely some kind of shift log, is important so any ongoing incidents, trends or other significant activity are not dropped. I will get into more detail when I write a future post, tentatively titled Building an IR Team: Communication and Documentation.

Organizing so each shift has the right people is significant. Obviously, the third shift will generally be seen as less desirable. Usually someone that is willing to work the third shift is trying to get into the digital security field, already has a day job, or is going to school. It is fine line between finding someone that will do a good job on the third shift but not immediately start looking for another job that has better hours, so you have to get a clear understanding of why people want to work the third shift and how long you expect them to stay on that shift. It can help to leave opportunities for third shift analysts to move to another shift since that can allow enough flexibility to keep the stand-outs rather than losing them to another job with more desirable hours.

I am not a big fan of rotating shifts. Though a lot of places seem to implement shifts by having everyone eventually rotate through each shift, I think it does not promote stability or employee satisfaction as much as each person having a dedicated shift.

Staffing can also be influenced by policy or outside factors. Businesses, government and military all will have certain information security requirements that must be met, and some of those requirements may influence your staffing levels or hours of operation.

Hierarchy
If you only have one or two analysts, you probably won't need to put much thought into your hierarchy. If you have a 24x7 operation with a number of analysts, you definitely need some sort of defined hierarchy and escalation procedures to define NSM and IR duties. Going back to the section on other security operations, you may also need to define how other duties fit into the hierarchy, procedures and priorities for analysts that handle NSM, IR, and/or additional duties.

At left is an example of an organizational chart when the IR Team also has other duties and operates in a 24x7 environment. In addition to rotating through NSM and IR duties, each analyst is a member of a team. This is just an example to show the thought process on hierarchy. There are certainly other operational security needs that I mentioned, may merit a dedicated team, but are not included in my example, for instance forensics or vulnerability assessment.

Each team has a senior analyst as the lead, and the senior analysts can also double as IR leads. It is crucial that every shift have a lead to define a hierarchy and prevent any misunderstandings about the chain of command and responsibilities.

For this example, let us say that your organizational requirements state two junior analysts per shift doing NSM and IR. You could create a schedule to rotate each junior analyst through the NSM/IR schedule, which means monitoring the security systems, answering the phone, responding to emails, investigating activity, and coordinating IR for the more basic incidents. You would also probably want one senior analyst designated as the lead for the day. The senior analyst can provide quality assurance, handle anything that needs to be escalated, do more in-depth IR, and task and coordinate the junior analysts. The senior analyst can also decide that the NSM and IR workloads require temporarily pulling people off their project or team tasks to bolster NSM or IR. Finally, it may be a good idea to have the senior analyst designated as the one coordinating and communicating with management.

While the senior analysts need to excel at both the technical duties and management, the shift leads need to facilitate communication between everyone on that particular shift, management, and other shifts. Though it is helpful if the shift lead is strong in a technical sense, I do not think the shift lead necessarily has to be the strongest technical person on the shift. He or she needs to be able to handle communication, escalation, delegation, and prioritization to keep both the shift members and management happy with each other. The shift lead is basically responsible for making sure the shift is happy and making sure the CIRT is getting what it needs from the shift.

The next diagram shows a group that is dedicated only to NSM and IR. Obviously, this model is much easier to organize and manage since the tasks are much narrower. Note that, even with this model where everyone is dedicated to NSM and IR without additional duties, proper NSM and IR may call for things like malware analysis, certainly forensics for IR, or giving feedback about the security systems' effectiveness to dedicated engineers.

As one last aside regarding the different models, I have to stress that vulnerability assessment and reporting is one of the biggest time sinks I have ever seen in a security operation. If you can only separate one task away from your NSM and IR team to another team, I strongly suggest it be vulnerability assessment. There are certainly a lot of arguments about how much or how little vulnerability assessment you should be doing in any organization, but most organizations do have requirements for it. As such, it is a good idea to have a separate vulnerability assessment team whenever possible because of the number of work-hours the process requires. Note that penetration testing is clearly distinct from vulnerability assessment, and requires a whole different type of person with a different set of skills.

Redundancy
Ideally, you want to minimize what some call "knowledge hoarding" on your team. If someone is excellent at a job, you need that person to share knowledge, not squirrel it away. Some think knowledge hoarding provides job security, but a good manager will recognize that an analyst that shares knowledge is much better than one that does not. From personal experience, I can also say that mentoring, training and sharing knowledge is a great way to reduce the number of calls you get during non-working hours. If I do not want to be bothered at home, I do my best to document and share everything I know so the knowledge is easily accessible even when I am not there.

Sharing knowledge provides redundancy and flexibility. That flexibility can also spread the workload more evenly when you have some people swamped with work and others underutilized. If someone is sick or too busy for a particular task, you do not want to be stuck with no redundancy. I suppose this is true of most jobs, but it can be a huge problem in IR. As an example, if a particular person is experienced at malware analysis and has automated the process without sharing the knowledge, someone else called on to do the work in a pinch will be much less efficient and may even try to manually perform tasks that have already been automated.

Certainly most groups of incident responders will have standouts that simply can't be replaced easily, but you should do your best to make sure every job function has redundancy and that every senior analyst has what you could call at least one understudy.

Distribution of Resources
If you are in a business that has multiple locations or it is a true enterprise, one thing to consider is the physical and logical distribution of your incident response team. Being physically located in one place can be helpful to communication and working relationships. Being geographically distributed can be more conducive to work schedules if the business spans many timezones. One thing that can greatly increase morale is providing as many tools as possible to do remote IR. Sending a team to the field for IR may be needed sometimes, but reducing the burden or even allowing work from home is a sure way to make your team happier.

Regardless, an IR team needs people in the field that can assist them when needed. Depending on the technical level of those field representatives, the duties may be as simple as unplugging a network cable or as advanced as starting initial data collection with a memory and disk capture. Most IR teams will need to have a good working relationship with support and networking personnel to help facilitate the proper response procedures.

I only touched on some of the possibilities for organizing both NSM and IR teams. As with anything, thought and planning will help make the organization more successful and efficient. The key is to reach a practical equilibrium given the resources you have to work with.

3 comments:

  1. This is a great post, thank you for posting.

    Can we really separate NSM and IR so cleanly and fairly ? From what I have read, NSM is just a set of technologies that provides access to alert, flow, content, statistics all correlated nicely.

    I would love to hear what you think about the relative work load of the two teams ? Wouldnt IR end up doing most of the work, including NSM stuff because they need it so badly to do their job ?

    ReplyDelete
  2. Thanks for the comment. Maybe I was unclear, but I definitely think that NSM and IR can't be clearly separated. In the first diagram, I should have really had "NSM/IR only" rather than "NSM only". If you use a tiered model, I think any escalated incidents definitely will require more thorough IR that may use a lot of additional resources besides NSM sensor data.

    On the other hand, some people may not view the first tier relying primarily on NSM as a true part of the IR team even when they are doing basic IR. I do see them as incident responders, but I also see them as primarily handling the simpler incidents and escalating more complex or serious incidents.

    ReplyDelete
  3. Thanks ! Great couple of articles.

    ReplyDelete