Eating Security: incident response

Showing posts with label incident response. Show all posts

11 March, 2013

Building an IR Team: Growth

This is a long overdue continuation of my posts regarding Building an Incident Response Team. I had a very rough outline of this post going all the way back to 2009! The good response I got on some of my previous posts on building IR teams made me come back and work on finishing the posts I had planned when I first started the series.

Previous posts:

I believe one of the hardest things to deal with when building a successful IR team is growth. If you build an IR team that is successful and gets management buy-in as a result, there is a good chance that responsibilities, the amount of work, the number of incidents detected, and the size of the team will all grow. This will invariably cause growing pains, setbacks, and reevaluation of procedures.

I honestly could go on and on about dealing with the growth of an IR team. There are so many things to consider that it is daunting to plan for growth ahead of time instead of just dealing with the hurdles as they come. However, if you have a team that is growing it really helps to take a step back and plan for both immediate and long-term growth. It is so important that a fair amount of this post will reiterate what I have explicitly or implicitly said in some of my previous "Building an IR Team" posts.

There are a number of questions to keep in mind when an IR team grows. What are the additional duties causing the addition of positions? Are the additional positions adequate to cover the additional duties and responsibilities? If not, how can expectations be managed so superiors understand what is actually feasible? Are the duties just a higher volume of what the team already is responsible for, or are there new areas that will require different types of team members and different types of training? What works well now but may be problematic with a larger team? Do we need to restructure? How do we maintain the success that led to the IR team growth? The last question is one of the most fundamental.

Relationships

At one point I worked on a team that, over the course of a few years, increased the number of personnel fourfold. This completely changed the dynamics of the team, from the lead all the way down to the most junior analyst. The more people you add, the more complex the relationships become. This applies not only to relationships within the team, but also relationships with other parts of your organization and management.

With such growth, it became a lot more important to clearly define roles and responsibilities, the command structure, and get management support of decisions.

Command structure: As the team grows, other groups in the company are less likely to know each person on the team. This means in a lot of cases it is helpful to have a few key people known to those other groups. These key people don't have to always be the ones to communicate with a specific group, but can be used as a fallback if the other group's first instinct is to be more adversarial with those people they don't know.
Intra-team relationships: The more people you have, the more you have to keep an eye on the working relationship between members. When you have a team that numbers single digits, it is almost natural to know all the ins and outs of the working relationships, for example who complements each other and who can be a good mentor to more junior analysts. It takes more conscious effort to track as you increase the number of people. Not only that, it requires more actively setting expectations about what you expect of them.
Management support and inter-team relationships: As a team gets bigger, its profile is raised throughout the company. This can make dealing with other groups easier, more difficult, or most likely a little bit of both. As we all know, IR teams sometimes need to make decisions or do things that are not popular and people outside the team view as irritating to say the least. It is very important to have management support when you invariably have conflicts with those outside the IR team. It's also important to have a manager that knows when to tell you that you're being unreasonable and the outside groups have a reasonable concern or complaint.

This is by no means a complete list of things to consider. The bottom line is that a larger team makes both intra- and inter-team relationships more complex.

Other Growing Pains

The simplest example I have from the past regarding growing pains was when I was on a team that was not gaining new areas of responsibility but was switching to coverage 24 hours a day seven days a week. As I covered in another blog post, it is important come up with the proper organization and make sure every shift was productive. Increasing the number of hours of coverage also obviously means hiring new analysts, plus the possibility of shifting current analysts to drastically different schedules.

Restructuring can often cause conflicts beyond those involving work schedules. On a small team, most people gravitate to a niche and can often be allowed to work in it as long as they also can handle the more generalized response duties. In a larger team, it's much harder to let members naturally gravitate towards certain areas while maintaining the ability to get all the work done. It certainly is nice to keep everyone happy and specializing in the areas they are most interested in, but it's not always realistic. One way to help with this is to make sure you follow the advice for redundancy in the "Organization" post, plus allow members to rotate through different areas of specialty. This means they won't be stuck in one particularly area in addition to providing redundancy of skills.

Another issue is making sure you formalize reporting to some degree. In a team of a few people, it's readily apparent what each person is doing. When you have a score of people, you need to get both formal and informal reporting from shift leads, team leads, mentors, and even individual analysts to properly understand who is doing what, workloads, what is working well, and what is not working. Regardless, the structure of a larger IR team probably needs to be more formal when it is larger. Notice the "probably." I think it is safe to say there may be exceptions to all these points! The key is to find the proper balance that enables useful reporting while avoiding unneeded bureaucracy.

Hiring can also create growing pains. I must stress that you should do everything possible to maintain standards when hiring. That said, a bigger team can mean more room and opportunity for less experienced analysts. One weak link among five people is a much bigger deal than one weak link among 30, so a larger team can allow you to take a chance or two when hiring. I've always been an advocate of getting smart people that can learn and are legitimately interested in the field over those who have experience but less potential for growth, and a larger team can sometimes make this easier to justify.

Evaluation of Procedures and Operations

This advice really applies to all IR teams, but becomes more important with growth. Incident response procedures that work well in a small team may not work as well with a larger group. Even if your team has not grown, you may want to regularly reevaluate IR workflow, reporting, or just about any existing procedures and standards of operations. Sometimes it may mean more clearly codifying what were once informal standards, while other times it may mean completely rethinking how you operate because you have several tiers of analysts. Having good metrics so you can try to make reevaluation more objective and less subjective also helps. Unfortunately, metrics is a huge topic that I can't address in this post, but there are many sites, papers, books, and more to help anyone interested in the topic.

Standards for working with the field may also need to change. If you are in an enterprise where the IR team often is reaching out to "boots on the ground" like local system administrators or IT staff, there may need to be changes in areas of responsibility when the IR team is larger. I partially covered this when mentioning inter-team relationships. Even if your IR team is comfortable contacting those in the field directly, those managing the people in the field may want a more formal command structure so they can track requests and other communications from the IR team. Contacts in the field may also want their roles and responsibilities more formally or clearly defined. This is easier to work through when the IR team only has a few people, but once there are dozens it can cause problems if those in the field don't know upfront what the IR team expects and what qualifies as an unusual request from the IR team.

Training

A larger IR team means the company is spending a lot more money on the team and security in general. It also means you may have enough team members to form a class-sized group. Whether you use in-house training, outsource, or a combination, a larger team means you will need to think about more formal training where a large group is in a classroom environment. This doesn't mean one-on-one or one-on-few mentoring and training should go away, but you will need to adapt to training larger groups. You also should consider setting aside money specifically for training if that was not done previously.

Be Flexible

Note that this is all based on my experiences in the past 10 or more years, but it is just the tip of the iceberg. Different teams may have different issues to consider when growing. Depending on the specific IR team, none of what I wrote may apply directly. I think there are two overriding concerns when an IR team grows. One is to be flexible as the team grows so your organization can really see what works and what does not. Two is to plan for the growth instead of just letting it happen haphazardly. Some teams do quite well with very little change after they've grown, while some may need drastic changes just because of adding a few people or analyst turnover.

Other Resources

There are some resources available to help deal with creating IR teams, and much of what applies at creation of a team can apply to the growth of a team. When a team goes from a few people to 20-30 people, you essentially are destroying the old team and creating a new one. Most of the questions considered when creating an IR team can be asked once again and reevaluated as the team grows.

Creating a Computer Security Incident Response Team: A Process to Getting Started from CERT.
Handbook for Computer Security Incident Response Teams (PDF) from CERT, which is not recent but still has a lot of useful information.
Building a Successful Security Operation Center from HP.
RFC 2350: Expectations for Computer Security Incident Response from 1998.

Richard Bejtlich has posted on his blog about many aspects of building and maintaining SOCs, and also mentioned that he will have a chapter in his new book titled "Network Security Monitoring Operations," focused on sharing "the author’s experience building and leading a global Computer Incident Response Team (CIRT), such that readers can apply those lessons to their own operations." I presume anyone regularly reading my blog is already reading Taosecurity, and also anticipate that his new book will be quite useful.

I hope to have at least one more post in my "Building an IR Team" series. I may also have additional material, or collate and improve all my existing posts if I feel it is worthwhile.

04 June, 2012

A Practical Example of Non-technical Indicators and Incident Response

Once upon a time there was a network security analyst slash NSM engineer who, like any sane person, ran full packet capture, IDS/IPS, session capture, and passive fingerprinting inline at the ingress/egress of his home network. His setup was most similar to diagram two in IDS/IPS Placement on Home Network.

This security analyst was casually going about his business one day when he opened the basement door of his house and found a tennis ball wedged between the door frame and the storm door. “That’s odd!” he thought. “Who would do that?”

After removing the tennis ball, he thought, “Well, this storm door is really loud when it closes those last few inches. Maybe someone did it to quietly enter or exit the house.” It just so happens that the daughter of said analyst was in high school and her bedroom was down the hall from the basement door. He promptly entered her room and took a quick look around. Lo and behold, the screen from her window was under her bed and the window itself was unlocked. Since this room was on the ground floor, the analyst immediately had some good ideas about what was happening with the window and the basement door. Someone was sneaking in or out of the house!

The analyst confronted his teenage daughter when she got home from school and received denial after denial about any possible wrongdoing. The denials did not sound sincere.

Enter the network security monitoring. He stated, “I told you I would respect your privacy with your email and other electronic communications unless you gave me a reason not to. I consider you in violation of these Terms of Service and I’m going to see what you’ve been up to lately.”

At this point it was late in the evening and the analyst had to get up early for work. This was some years back when AIM was quite common, so he briefly used Sguil to look at recent sessions of AOL Instant Messenger traffic. He decided to get some sleep for work the next day and put off additional investigation. In the meantime, his daughter's privileges were highly restricted.

A day or two later, after trying to manually sift through some of the ASCII transcripts of the packet captures the analyst quickly decided there was a better way. He whipped up a short shell script to loop through all the packet captures, run Dug Song’s msgsnarf, and pipe the output into an HTML file for later examination. This required a little tweaking to make the HTML easily readable, but it was fairly quick to write and test the script.
The next morning there were many hundreds of lines of AIM conversations to examine. He started working from the most recent and reading backwards. After a few minutes he quickly confirmed that his daughter had been sneaking out of the house to go to parties and get into other mischief.

Another conversation with his daughter finally led to her confession, a long discussion, and suitable punishment. Despite the severity of her actions, the HTML file containing the chat transcripts also contained a few endearing nuggets.

Daughter: OMG they know everything!
Accomplice: what do u mean everything
Daughter: my dad can read all my chats
Daughter: he does computer security for [company redacted]
Daughter: he’s a computer genius
Daughter: DAD I’M INNOCENT!

Upon telling this story to a current colleague, he mentioned that the last few lines are the best father’s day gift the analyst would ever receive.

I think there are a few obvious lessons here that can translate to network monitoring.

First, the initial indicator of the problem was in the physical world. Network security monitoring or any other type of technical monitoring and prevention will fail. I have experienced many times when phone calls from users have been one of the earliest indicators of malicious activity. Particularly in the case of insider threats, it's important to note that many initial indicators of malicious activity are non-technical, like a person's behavior, personnel action, or in this case a physical indicator of a security problem.

Second, sometimes you need to be flexible to solve a problem quickly and with minimal effort. The analyst could have manually looked at the AIM traffic, but because he judged that the threat of another incident was already mitigated by talking to the daughter, digging up the traffic wasn’t urgent. Instead, The analyst decided to write the script that would pull all the traffic and convert it to a readable format. The analyst also had the luxury of knowing that all the packet captures would still be there since his home bandwidth at the time meant well more than 30 days of pcap storage.

Third, network monitoring is a means to an end. In this case, there was a security problem that could be addressed with the help of technical means. In many obvious cases you are trying to protect data. In other cases, you can be trying to protect people or things in the physical world that could be harmed if the wrong information is revealed. It is important to stay focused on what really matters and not get caught worrying about the wrong things because your instrumentation or technologies push you towards priorities that don’t make sense.

Last, attackers are not static. The daughter definitely learned the value of encryption and even using out-of-band communication in the form of SMS over the phone network if she did not want the network sensor recording her conversations in plain text. Technology advancement also makes attackers evolve, for instance with the move to Facebook chat or SMS from older forms of IM.

15 July, 2009

Building an IR Team: Documentation

My third post on building an Incident Response (IR) team covers documentation. The first post was Building an IR Team: People, followed by Building an IR Team: Organization.

Good documentation promotes good communication and effective analysts. Documentation is not sexy, and can even be downright annoying to create and maintain, but it is absolutely crucial. Making it as painless and useful as possible will be a huge benefit to the IR team.

Since documentation and communication are so intertwined, I had planned on making one post to cover both topics. However, the amount of material I have for documentation made me decide to do a future post, Building an IR Team: Communication, and concentrate on keeping this post to a more digestible size.

There are quite a few different areas where a Computer Incident Response Team (CIRT) will need good documentation.

Incident Tracking
Since I am writing about computer IR teams, it is obvious that the teams will be dealing with digital security incidents. For an enterprise, you will almost certainly need a database back-end for your incidents. Even smaller environments may find it best to use a database to track incidents.You will need some sort of incident tracking system for many reasons, including but not necessarily limited to the following.

Tracking of incident status and primary responder(s)
Incident details
Response details and summary
Trending, statistics and other analysis

Tracking the status and who is responsible for specific incidents is one of the primary reasons for incident tracking. Some off-the-shelf software can support incident tracking, for instance help desk ticketing software or other tasking software. This type of software will certainly support the basic needs like status (assigned, in progress, open, closed, etc) and who the incident is assigned to.

However, off-the-shelf software may not have great support for the incident details. A great example is IP addresses and ports. Logging IP addresses, names of systems, ports if applicable, and what type of vulnerability was exploited can be extremely useful for trending, statistics, and historical analysis. A field for IP addresses can probably be more easily be queried than a full text field that contains IP addresses. If I see that a particular IP address successfully attacked two systems in the previous shift, or a particular type of exploit was used successfully on two systems, I want to be able to quickly check and see how many times it happened in the past week. I also want to be able to pull that data out and use it to query my NSM data to see if there was similar activity that garnered no response from analysts.

Reponse details can be thought of as a log that is updated throughout the incident, from discovery to resolution. Having the details to look back on is extremely useful. You can use the details for a technical write-up, an executive summary, to recreate incidents in a lab environment, for training, lessons learned, and more. My general thought process is that the longer it takes to document an incident, the more likely the documentation is to be useful.

Trending and statistical analysis can be used to help guide future response and look back at previous activity for anything that was missed, as I already mentioned. It is also extremely useful for reports to management that can also be used to help gain political capital within the organization. What do I mean by political capital?

Say you have noticed anecdotally that you are getting owned by web servers over HTTP, but the malicious sites are usually known to be malicious, for instance when searching Google or using a anti-malware toolbar. Your company has no web proxy and you recommend one with the understanding that most of the malicious sites would be blocked by the web proxy. The problem is that the networking group does not want to re-engineer or reconfigure, and upper management does not think it is worth the money. With a thorough report and analysis using the information from incident tracking, and by using that data to show the advantages of the proxy solution, you could provide the CIRT or SOC management the political capital they need to get things moving when faced with other parts of the company that resist.

Standard Operating Procedures (SOP)
Although analysts performing IR need to be able to adapt and the tasks can be fluid, a SOP is still important for a CIRT. A SOP can cover a lot of material, including IR procedures, notification and contact information, escalation procedures, job functions, hours of operation, and more. A good SOP might even include the CIRT mission statement and other background to help everyone understand the underlying purpose and mission of the group.

The main goal of a SOP should be to document and detail all the standard or repetitive procedures, and it can even provide guidance on what to do if presented with a situation that is not covered in the SOP. As an example, a few bullet points of sections that might be needed in a SOP are:

Managing routine malware incidents
Analyzing short term trends
Researching new exploits and malicious activity
Overview of security functions and tools, e.g. NSM
More detailed explanation and basic usage information for important tools, e.g. how to connect to Sguil and who are the administrators of the system

Although a SOP will not cover every situation, the goal should be to make the team more efficient and provide a reference for tasks or procedures that are used repeatedly. I'm not a fan of hand-holding and like analysts to try and figure things out on their own, so I don't mind if analysts use different methods as long as the end results are consistent in both accuracy and format.

I also like analysts to think about the most efficient way to analyze an incident. Some may gather information and investigate using slightly different methodology, but each analyst should understand that something simple should be checked before something that takes a lot of time, particularly when the value of the information returned will be roughly equal. The analysis should use what my boss likes to call the "Does it make sense?" test. Gathering some of the simplest and most straightforward information first will usually point you in the right direction, and a SOP can help show how to do this.

Knowledge Base
A knowledge base can take many different forms and contains different types of information than SOP, though there also may be overlap. There are specific knowledge base applications, wikis, simple log applications, and even ticketing or tasking systems that provide some functionality for an integrated knowledge base. A knowledge base will often contain technical information, technical references, HOWTOs, white papers, troubleshooting tips, and various other types of notes and information.

One of my favorite options for a knowledge base is a wiki. You can see various open knowledge bases that are using wikis, for instance NSMWiki and Emerging Threats Documentation Wiki, but if you want organization- and job-specific knowledge bases then you will also need something to hold the information for your CIRT.

The reason I pick those two wikis as examples is because they contain some of the exact type of information that is useful in a knowledge base for your CIRT. The main difference is that your knowledge will be specific to your organization. One good example are wiki entries for specific IDS rules as they pertain to your network, in other words an internal version of the Emerging Threats rule wiki. There may be shortcuts to take with regard to investigating specific rules or other network activity to quickly determine the nature of the traffic, and a wiki is a good place to keep that information.

Similarly, documentation on setting up a NSM device, tuning, or maintenance can be very effectively stored and edited on a wiki. The ease of collaboration with a wiki helps keep the documentation useful and up to date. If properly organized, someone could easily find information needed to keep the team running smoothly. Some example of documentation I have found useful when put it in a wiki:

How to troubleshoot common problems on a NSM sensor
How to build and configure a NSM sensor
How to update and tune IDS rules
List and overview of scripts available to assist incident response
Overviews of each available IR tool
More detailed descriptions and usage examples of IR tools
Example IR walk-throughs using previously resolved incidents
Links to external resources, e.g. blogs, wikis, manuals, and vendor sites

One of the best ways I can think of to effectively communicate the usefulness of both a knowledge base and a SOP to senior technical personnel is by pointing out that better documentation makes it less likely that you are needed for help. Additionally, it is much faster to resolve an issue for another analyst if you can do something like refer the analyst to step-by-step instructions in the knowledge base. If you want fewer calls during off hours, make sure the analysts have the documentation they need.

Shift Logs
In an environment with multiple shifts, it is important to keep shift logs of notable activity, incidents, and any other information that needs to be passed to other shifts. Although I will also discuss this in Building an IR Team: Communication, the usefulness of connecting the shifts with a dedicated log is apparent. Given the amount of email and incident tickets that are generated in an environment that requires 24x7 monitoring, having a shift log to quickly summarize important and ongoing event helps separate the wheat from the chaff.

Since my feeling is that shift logs should be terse and quick to parse, what to use for logging may not be crucial. The first examples that come to my mind are software designed for shift logs, forum software, or blogging software. The main features needed are individual accounts to show who is posting, timestamps, and an effective search feature. Anything else is a bonus, though it may depend on exactly what you want analysts logging and what is being used to handle incident tracking.

One thing that is quite useful with the shift log is a summary post at the end of each shift, and then the analysts should verbally go over the summary at the shift change. This can help make sure the most significant entries are not missed and it gives the chance for the oncoming shift to ask questions before the outgoing shift leaves for the day.

As usual, I can't cover everything on the topic, but my goal is to provide a reference and get the gears turning. The need for good documentation exists and documentation is important to use to the IR team's advantage.

07 July, 2009

The "Does it make sense?" test

I was composing the next installment of my series on building an incident response team and started to include this, but then decided it deserves a separate entry.

Some time ago, my boss came up with what he calls the "Does it make sense?" test as a cheat-sheet for help training new analysts and to use as a quick reference. When we refer to traffic making sense, we are asking whether the traffic is normal for the network.

This is very simple and covers some of the quickest ways an analyst can investigate a possible incident. Consider it a way to triage possible NSM activity or incidents. Using something like this can easily eliminate a lot of unnecessary and time-consuming analysis, or point out when the extra analysis is needed.

The "does it make sense" test:

Determine the direction of the network traffic.
Determine the IP addresses involved.
Determine the locations of the systems (e.g. internal, external, VPN, whois, GeoIP).
Determine the functions of the systems involved (e.g. web server, mail server, workstation).
Determine protocols involved and whether they are "normal" protocols and ports that should be seen between the systems.
When applicable, look at the packet capture and compare it to the signature/rule.
Use historical queries on NSM systems and searches of documentation to determine past events that may be related to the current one.

Based on the above knowledge, does the traffic that caused the alert make sense or is it abnormal? Simple examples:

A file server sending huge amounts of SMTP traffic over port 25 probably does not make sense, whether because of malicious activity or a misconfiguration.
Someone connecting to a workstation on port 21 with FTP probably does not make sense.
A DNS server sending and receiving traffic to another DNS server over port 53 does make sense. However, an analysis of the alert and the DNS traffic may still be needed to verify whether the traffic is malicious or not.

Remember, traffic that makes sense and is normal on one network may not be normal on another network. Having a good baseline of your network traffic is extremely important before you can accurately determine what traffic makes sense and what traffic does not makes sense. Even traffic that does not make sense is not automatically malicious.

Also remember, traffic that makes sense is not always friendly. A good attacker will make his network traffic look like it fits in with the the baseline traffic, making the traffic less likely to stick out.

25 June, 2009

Building an IR Team: Organization

This is my second post in a planned series. The first is called Building an IR Team: People.

How to organize an Computer Incident Response Team (CIRT) is a difficult and complex topic. Although there may be best practices or sensible guidelines, a lot will be dictated by the size of your team, the type and size of network environment, management, company policies and the abilities of analysts. I also believe that network security monitoring (NSM) and incident response (IR) are so intertwined that you really should talk about them and organize them together.

A few questions that come to mind when thinking of organization and hierarchy of the team:

Will you only be doing IR, or will you be responsible for additional security operations and security engineering?
What is the minimal amount of staffing you need to cover your hours of operation? What other coverage requirements do you have dictated by management, policies, or plain common sense?
How will the size of your team effect your hierarchy and organization?
Since being understaffed is the norm, how can you organize to improve efficiency without hurting the quality of work?
Can you train individuals or groups so you have redundancy in key job functions?
Referencing both physical and logical organization of the team, will they be centralized or distributed?
What is your budget? (Richard Bejtlich has had a number of posts about how much to spend on digital security, including one recently).

IR and other Security Operations
The first question really needs to be answered before you start answering all the rest. There are two basic models I have seen when organizing a response team. The simpler model is to have a response team that only performs incident response, often along with NSM or working directly with the NSM team. Even if the response team does not do the actual first tier NSM, the NSM team usually will function as a lower tier that escalates possible incidents to the IR team.

The more complex, but possibly more common, model is to have incident responders and NSM teams that also perform a number of other duties. I mentioned both security operations and security engineering in the bullet point. Examples of security operations and engineering could be penetration testing, vulnerability assessment, malware analysis, NSM sensor deployment, NSM sensor tuning, firewall change reviews or management, and more. The reason I say this model may be more common is the bottom line, money. It is also difficult to discretely define all these job duties without any overlap.

There are advantages and disadvantages to each model. For dedicated incident responders, advantages compared to the alternative include:

Specialization can promote higher levels of expertise.
Duties, obligations, procedures and priorities are clearer.
Documentation can probably be simplified.
IR may be more effective overall.

Disadvantages can include:

Money. If incident responders perform a narrow set of duties, you will probably need more total personnel to complete the same tasks.
Less flexibility with personnel.
Limiting duties exclusively to incident response may result in more burn-out. Although not a given, many people like the variety that comes with a wider range of duties.

Advantages of having incident responders also perform other security operations and engineering:

Money.
A better understanding of incident response can produce better engineering. A great example is tuning NSM sensors, where an engineer that does the tuning has a much better understanding of feedback and even sees the good and bad firsthand if the same person is also doing NSM or IR.
Similarly, other projects can promote a better understanding of the network, systems and security operations that may promote more efficient and accurate IR.

Disadvantages:

Conflicting priorities between IR and other projects.
More complex operating procedures.
Burn-out due to workload. (Yes, I listed burn-out as a disadvantage of both models).
Less specialization in IR will probably reduce effectiveness.

Staffing
Before deciding on the number of analysts you need for NSM and IR, you have to come to a decision on what hours you will maintain. This question is probably easier for smaller operations that don't have as much flexibility. If there is no budget for anything other than normal business hours, it is definitely easier to staff IR and security operations in general. Once you get to an enterprise or other organization that maintains some 24x7 presence, it starts getting stickier.

If you will have more than one shift, you will obviously have to decide the hours for each shift. It is important to build a slight overlap into the shifts so information can be passed from the shift that is ending to the shift that is starting. Both verbal and written communication, namely some kind of shift log, is important so any ongoing incidents, trends or other significant activity are not dropped. I will get into more detail when I write a future post, tentatively titled Building an IR Team: Communication and Documentation.

Organizing so each shift has the right people is significant. Obviously, the third shift will generally be seen as less desirable. Usually someone that is willing to work the third shift is trying to get into the digital security field, already has a day job, or is going to school. It is fine line between finding someone that will do a good job on the third shift but not immediately start looking for another job that has better hours, so you have to get a clear understanding of why people want to work the third shift and how long you expect them to stay on that shift. It can help to leave opportunities for third shift analysts to move to another shift since that can allow enough flexibility to keep the stand-outs rather than losing them to another job with more desirable hours.

I am not a big fan of rotating shifts. Though a lot of places seem to implement shifts by having everyone eventually rotate through each shift, I think it does not promote stability or employee satisfaction as much as each person having a dedicated shift.

Staffing can also be influenced by policy or outside factors. Businesses, government and military all will have certain information security requirements that must be met, and some of those requirements may influence your staffing levels or hours of operation.

Hierarchy
If you only have one or two analysts, you probably won't need to put much thought into your hierarchy. If you have a 24x7 operation with a number of analysts, you definitely need some sort of defined hierarchy and escalation procedures to define NSM and IR duties. Going back to the section on other security operations, you may also need to define how other duties fit into the hierarchy, procedures and priorities for analysts that handle NSM, IR, and/or additional duties.

At left is an example of an organizational chart when the IR Team also has other duties and operates in a 24x7 environment. In addition to rotating through NSM and IR duties, each analyst is a member of a team. This is just an example to show the thought process on hierarchy. There are certainly other operational security needs that I mentioned, may merit a dedicated team, but are not included in my example, for instance forensics or vulnerability assessment.

Each team has a senior analyst as the lead, and the senior analysts can also double as IR leads. It is crucial that every shift have a lead to define a hierarchy and prevent any misunderstandings about the chain of command and responsibilities.

For this example, let us say that your organizational requirements state two junior analysts per shift doing NSM and IR. You could create a schedule to rotate each junior analyst through the NSM/IR schedule, which means monitoring the security systems, answering the phone, responding to emails, investigating activity, and coordinating IR for the more basic incidents. You would also probably want one senior analyst designated as the lead for the day. The senior analyst can provide quality assurance, handle anything that needs to be escalated, do more in-depth IR, and task and coordinate the junior analysts. The senior analyst can also decide that the NSM and IR workloads require temporarily pulling people off their project or team tasks to bolster NSM or IR. Finally, it may be a good idea to have the senior analyst designated as the one coordinating and communicating with management.

While the senior analysts need to excel at both the technical duties and management, the shift leads need to facilitate communication between everyone on that particular shift, management, and other shifts. Though it is helpful if the shift lead is strong in a technical sense, I do not think the shift lead necessarily has to be the strongest technical person on the shift. He or she needs to be able to handle communication, escalation, delegation, and prioritization to keep both the shift members and management happy with each other. The shift lead is basically responsible for making sure the shift is happy and making sure the CIRT is getting what it needs from the shift.

The next diagram shows a group that is dedicated only to NSM and IR. Obviously, this model is much easier to organize and manage since the tasks are much narrower. Note that, even with this model where everyone is dedicated to NSM and IR without additional duties, proper NSM and IR may call for things like malware analysis, certainly forensics for IR, or giving feedback about the security systems' effectiveness to dedicated engineers.

As one last aside regarding the different models, I have to stress that vulnerability assessment and reporting is one of the biggest time sinks I have ever seen in a security operation. If you can only separate one task away from your NSM and IR team to another team, I strongly suggest it be vulnerability assessment. There are certainly a lot of arguments about how much or how little vulnerability assessment you should be doing in any organization, but most organizations do have requirements for it. As such, it is a good idea to have a separate vulnerability assessment team whenever possible because of the number of work-hours the process requires. Note that penetration testing is clearly distinct from vulnerability assessment, and requires a whole different type of person with a different set of skills.

Redundancy
Ideally, you want to minimize what some call "knowledge hoarding" on your team. If someone is excellent at a job, you need that person to share knowledge, not squirrel it away. Some think knowledge hoarding provides job security, but a good manager will recognize that an analyst that shares knowledge is much better than one that does not. From personal experience, I can also say that mentoring, training and sharing knowledge is a great way to reduce the number of calls you get during non-working hours. If I do not want to be bothered at home, I do my best to document and share everything I know so the knowledge is easily accessible even when I am not there.

Sharing knowledge provides redundancy and flexibility. That flexibility can also spread the workload more evenly when you have some people swamped with work and others underutilized. If someone is sick or too busy for a particular task, you do not want to be stuck with no redundancy. I suppose this is true of most jobs, but it can be a huge problem in IR. As an example, if a particular person is experienced at malware analysis and has automated the process without sharing the knowledge, someone else called on to do the work in a pinch will be much less efficient and may even try to manually perform tasks that have already been automated.

Certainly most groups of incident responders will have standouts that simply can't be replaced easily, but you should do your best to make sure every job function has redundancy and that every senior analyst has what you could call at least one understudy.

Distribution of Resources
If you are in a business that has multiple locations or it is a true enterprise, one thing to consider is the physical and logical distribution of your incident response team. Being physically located in one place can be helpful to communication and working relationships. Being geographically distributed can be more conducive to work schedules if the business spans many timezones. One thing that can greatly increase morale is providing as many tools as possible to do remote IR. Sending a team to the field for IR may be needed sometimes, but reducing the burden or even allowing work from home is a sure way to make your team happier.

Regardless, an IR team needs people in the field that can assist them when needed. Depending on the technical level of those field representatives, the duties may be as simple as unplugging a network cable or as advanced as starting initial data collection with a memory and disk capture. Most IR teams will need to have a good working relationship with support and networking personnel to help facilitate the proper response procedures.

I only touched on some of the possibilities for organizing both NSM and IR teams. As with anything, thought and planning will help make the organization more successful and efficient. The key is to reach a practical equilibrium given the resources you have to work with.

28 April, 2009

Building an IR Team: People

For some time I have been thinking about a series of posts about building an incident response team. I started in security as part of a very small Computer Incident Response Team (CIRT) that handled network security monitoring (NSM) and the ensuing security incidents. Although we were small, we had a very good core of people that helped us succeed and grow, probably well beyond anything we had imagined. We grew from a handful of people to four or five times our original size. While there were undoubtedly setbacks, we constantly got better and more efficient as the team grew.

As the first in this series, I definitely want to concentrate on people. I don't care what fancy tools, enormous budget, buy-in from management, or whatever else you have. If you don't have the right people, you'll have serious problems succeeding. Most of this is probably not unique to a response team, information security, or information technology.

Hiring
Of course, hiring is where it all starts. What do you look for in a candidate for an incident response team? Here are some of the things I look for.

Initiative: The last thing I want is someone that constantly needs hand-holding. Certainly people need help sometimes, and sharing knowledge and mentoring are huge, but you have to be able to work through the bumps and find solutions. A NSM operation or CIRT is not a help desk. Although you can have standard procedures, you have to be flexible, adapt, do a lot of research, and teach yourself whenever possible.
Drive: Most people who are successful in security seem to think of it as more than a job. They spend free time hacking away at tools, breaking things, fixing things, researching, reading, and more. I don't believe this kind of drive has to be all-consuming, because I certainly have plenty of outside interests. However, generally speaking there is plenty of time to be interested in information security outside of work while still having a life. I, and undoubtedly many successful security professionals, enjoy spending time reading, playing with new tools, and more. Finding this type of person is not actually difficult, but it can take some patience. Local security groups or mailing lists are examples of places to look for analysts to add to a team. Even if they have little work experience, by going to a group meeting or subscribing to mailing lists, they are already demonstrating some drive and initiative.
Communication skills: Although this may be more important for a senior analyst, being able to write and speak well is crucial. Knowing your audience is one of the most important skills. For instance, if you are writing a review of a recent incident that includes lessons learned, the end product will be different depending whether the review is for management or the incident responders on the team. Documentation, training, and reporting are other examples where good writing and speaking skills are important. I think good communication skills are underrated by many people in the field and IT in general, but the higher you look the better the chance you will find someone that realizes the importance of effective communication.
Background: Most of the successful NSM analysts and incident responders I know have a background in one or more of three core areas; networking, programming, or system administration. A person from each background will often have different strengths, so understanding the likely strengths of each background can go a long way toward filling a missing need on the team. You do not have to come from one of these backgrounds, it is just relatively common for the good analysts I know to have backgrounds in these areas.
The wrong candidate in the wrong position: Do not be scared to turn down people that are wrong for the job. That seems obvious, but it is worth emphasizing. Along the same lines, if someone is not working out, take steps to correct the problems if possible, but do not be afraid to get rid of a person that is not right for the job. Try to understand exactly what you are looking for and where in the organization the person is most likely to excel.

Experience versus Potential
When filling a senior position, experience is definitely important. However, when filling a junior position I think automatically giving a lot of weight to information security experience can be a mistake. The last thing I want to do is hire someone who has experience but is overly reliant on technology rather than critical thinking skills. I don't mean to denigrate or automatically discount junior analysts that have experience, I just mean that I'd rather have someone with a lot of potential that needs a little more training in the beginning than what some would call a "scope dope", someone whose experience is looking at IDS alerts and taking them at face value with little correlation or investigation. If you have both experience and potential, great!

Training
Information security covers a huge range of topics, requires a wide range of skills, and changes quickly. Good analysts will want training, and if you don't give it to them you will wind up with a bunch of people that don't care about increasing their knowledge and skills as the ones that do want to learn look for greener pastures.

There are many different types of training in addition to what most people think of first, which is usually formal classes. Senior analysts mentoring junior analysts can be one of the most useful types of training because it is very adaptable and can be very environment-specific. "Brown-bag" sessions where people get together over lunch and analyze a recent incident or learn how to more efficiently investigate incidents can also work well. Finally, when someone researches and learns new things on one's own or with coworkers as mentioned previously, that is also an excellent form of training. Load up a lab, attack it, and look at the traffic and resulting state of the systems.

Finally, do not forget about both undergraduate and graduate degrees. Though you may not consider them training, most people want to have the option open to either finish a degree or get an advanced degree in their off hours. There are a huge number of ways to provide training.

People versus Technology
Analysts are not the only ones that can overly rely on technology. Management will often take the stance that paying a bunch of money for tools and subscriptions means two things. One, that the systems must be what they need and will do all the work for them. Two, that the money means that the selling company has the systems optimally designed and configured for your environment. Just because you pay five or six digits for an IPS, IDS, anomaly detection, or forensics tools does not mean that you can presume a corresponding decrease in the amount you need to spend on people. Any tool is worthless without the right people using it.

Turnover, Retention, Mobility, and Having Fun
Creating and continuing a successful response team is to make sure the people you want to keep remain happy. There are a lot of things you need to retain the right people, including competitive pay, decent benefits, a chance for promotion, and a good work environment. Honestly, I think work environment is probably the most important factor. I know many analysts I have worked with receive offers of more money, but a good work environment has usually kept them from leaving. My boss has always said that the right environment is worth $X dollars to him, and I feel the same way. Effective and enjoyable coworkers, management that listens, and all the little things are not worth giving up without substantial reasons. Some opportunities are impossible to pass up, but having an enjoyable work environment and management that "gets it" goes a long way towards reducing turnover.

Bottom Line
I believe getting a good group assembled is the most important thing to have an effective response team. Obviously, I kept the focus of this post relatively broad. I would love to see comments with additional input. I hope to post additional material about building a response team in the near future, possibly covering organizing the team, dealing with growth, and a few other topics.

Eating Security