15 July, 2009

Building an IR Team: Documentation

My third post on building an Incident Response (IR) team covers documentation. The first post was Building an IR Team: People, followed by Building an IR Team: Organization.

Good documentation promotes good communication and effective analysts. Documentation is not sexy, and can even be downright annoying to create and maintain, but it is absolutely crucial. Making it as painless and useful as possible will be a huge benefit to the IR team.

Since documentation and communication are so intertwined, I had planned on making one post to cover both topics. However, the amount of material I have for documentation made me decide to do a future post, Building an IR Team: Communication, and concentrate on keeping this post to a more digestible size.

There are quite a few different areas where a Computer Incident Response Team (CIRT) will need good documentation.

Incident Tracking
Since I am writing about computer IR teams, it is obvious that the teams will be dealing with digital security incidents. For an enterprise, you will almost certainly need a database back-end for your incidents. Even smaller environments may find it best to use a database to track incidents.You will need some sort of incident tracking system for many reasons, including but not necessarily limited to the following.

  • Tracking of incident status and primary responder(s)
  • Incident details
  • Response details and summary
  • Trending, statistics and other analysis
Tracking the status and who is responsible for specific incidents is one of the primary reasons for incident tracking. Some off-the-shelf software can support incident tracking, for instance help desk ticketing software or other tasking software. This type of software will certainly support the basic needs like status (assigned, in progress, open, closed, etc) and who the incident is assigned to.

However, off-the-shelf software may not have great support for the incident details. A great example is IP addresses and ports. Logging IP addresses, names of systems, ports if applicable, and what type of vulnerability was exploited can be extremely useful for trending, statistics, and historical analysis. A field for IP addresses can probably be more easily be queried than a full text field that contains IP addresses. If I see that a particular IP address successfully attacked two systems in the previous shift, or a particular type of exploit was used successfully on two systems, I want to be able to quickly check and see how many times it happened in the past week. I also want to be able to pull that data out and use it to query my NSM data to see if there was similar activity that garnered no response from analysts.

Reponse details can be thought of as a log that is updated throughout the incident, from discovery to resolution. Having the details to look back on is extremely useful. You can use the details for a technical write-up, an executive summary, to recreate incidents in a lab environment, for training, lessons learned, and more. My general thought process is that the longer it takes to document an incident, the more likely the documentation is to be useful.

Trending and statistical analysis can be used to help guide future response and look back at previous activity for anything that was missed, as I already mentioned. It is also extremely useful for reports to management that can also be used to help gain political capital within the organization. What do I mean by political capital?

Say you have noticed anecdotally that you are getting owned by web servers over HTTP, but the malicious sites are usually known to be malicious, for instance when searching Google or using a anti-malware toolbar. Your company has no web proxy and you recommend one with the understanding that most of the malicious sites would be blocked by the web proxy. The problem is that the networking group does not want to re-engineer or reconfigure, and upper management does not think it is worth the money. With a thorough report and analysis using the information from incident tracking, and by using that data to show the advantages of the proxy solution, you could provide the CIRT or SOC management the political capital they need to get things moving when faced with other parts of the company that resist.

Standard Operating Procedures (SOP)
Although analysts performing IR need to be able to adapt and the tasks can be fluid, a SOP is still important for a CIRT. A SOP can cover a lot of material, including IR procedures, notification and contact information, escalation procedures, job functions, hours of operation, and more. A good SOP might even include the CIRT mission statement and other background to help everyone understand the underlying purpose and mission of the group.

The main goal of a SOP should be to document and detail all the standard or repetitive procedures, and it can even provide guidance on what to do if presented with a situation that is not covered in the SOP. As an example, a few bullet points of sections that might be needed in a SOP are:
  • Managing routine malware incidents
  • Analyzing short term trends
  • Researching new exploits and malicious activity
  • Overview of security functions and tools, e.g. NSM
  • More detailed explanation and basic usage information for important tools, e.g. how to connect to Sguil and who are the administrators of the system
Although a SOP will not cover every situation, the goal should be to make the team more efficient and provide a reference for tasks or procedures that are used repeatedly. I'm not a fan of hand-holding and like analysts to try and figure things out on their own, so I don't mind if analysts use different methods as long as the end results are consistent in both accuracy and format.

I also like analysts to think about the most efficient way to analyze an incident. Some may gather information and investigate using slightly different methodology, but each analyst should understand that something simple should be checked before something that takes a lot of time, particularly when the value of the information returned will be roughly equal. The analysis should use what my boss likes to call the "Does it make sense?" test. Gathering some of the simplest and most straightforward information first will usually point you in the right direction, and a SOP can help show how to do this.

Knowledge Base
A knowledge base can take many different forms and contains different types of information than SOP, though there also may be overlap. There are specific knowledge base applications, wikis, simple log applications, and even ticketing or tasking systems that provide some functionality for an integrated knowledge base. A knowledge base will often contain technical information, technical references, HOWTOs, white papers, troubleshooting tips, and various other types of notes and information.

One of my favorite options for a knowledge base is a wiki. You can see various open knowledge bases that are using wikis, for instance NSMWiki and Emerging Threats Documentation Wiki, but if you want organization- and job-specific knowledge bases then you will also need something to hold the information for your CIRT.

The reason I pick those two wikis as examples is because they contain some of the exact type of information that is useful in a knowledge base for your CIRT. The main difference is that your knowledge will be specific to your organization. One good example are wiki entries for specific IDS rules as they pertain to your network, in other words an internal version of the Emerging Threats rule wiki. There may be shortcuts to take with regard to investigating specific rules or other network activity to quickly determine the nature of the traffic, and a wiki is a good place to keep that information.

Similarly, documentation on setting up a NSM device, tuning, or maintenance can be very effectively stored and edited on a wiki. The ease of collaboration with a wiki helps keep the documentation useful and up to date. If properly organized, someone could easily find information needed to keep the team running smoothly. Some example of documentation I have found useful when put it in a wiki:
  • How to troubleshoot common problems on a NSM sensor
  • How to build and configure a NSM sensor
  • How to update and tune IDS rules
  • List and overview of scripts available to assist incident response
  • Overviews of each available IR tool
  • More detailed descriptions and usage examples of IR tools
  • Example IR walk-throughs using previously resolved incidents
  • Links to external resources, e.g. blogs, wikis, manuals, and vendor sites
One of the best ways I can think of to effectively communicate the usefulness of both a knowledge base and a SOP to senior technical personnel is by pointing out that better documentation makes it less likely that you are needed for help. Additionally, it is much faster to resolve an issue for another analyst if you can do something like refer the analyst to step-by-step instructions in the knowledge base. If you want fewer calls during off hours, make sure the analysts have the documentation they need.

Shift Logs
In an environment with multiple shifts, it is important to keep shift logs of notable activity, incidents, and any other information that needs to be passed to other shifts. Although I will also discuss this in Building an IR Team: Communication, the usefulness of connecting the shifts with a dedicated log is apparent. Given the amount of email and incident tickets that are generated in an environment that requires 24x7 monitoring, having a shift log to quickly summarize important and ongoing event helps separate the wheat from the chaff.

Since my feeling is that shift logs should be terse and quick to parse, what to use for logging may not be crucial. The first examples that come to my mind are software designed for shift logs, forum software, or blogging software. The main features needed are individual accounts to show who is posting, timestamps, and an effective search feature. Anything else is a bonus, though it may depend on exactly what you want analysts logging and what is being used to handle incident tracking.

One thing that is quite useful with the shift log is a summary post at the end of each shift, and then the analysts should verbally go over the summary at the shift change. This can help make sure the most significant entries are not missed and it gives the chance for the oncoming shift to ask questions before the outgoing shift leaves for the day.

As usual, I can't cover everything on the topic, but my goal is to provide a reference and get the gears turning. The need for good documentation exists and documentation is important to use to the IR team's advantage.

07 July, 2009

The "Does it make sense?" test

I was composing the next installment of my series on building an incident response team and started to include this, but then decided it deserves a separate entry.

Some time ago, my boss came up with what he calls the "Does it make sense?" test as a cheat-sheet for help training new analysts and to use as a quick reference. When we refer to traffic making sense, we are asking whether the traffic is normal for the network.

This is very simple and covers some of the quickest ways an analyst can investigate a possible incident. Consider it a way to triage possible NSM activity or incidents. Using something like this can easily eliminate a lot of unnecessary and time-consuming analysis, or point out when the extra analysis is needed.

The "does it make sense" test:

  1. Determine the direction of the network traffic.
  2. Determine the IP addresses involved.
  3. Determine the locations of the systems (e.g. internal, external, VPN, whois, GeoIP).
  4. Determine the functions of the systems involved (e.g. web server, mail server, workstation).
  5. Determine protocols involved and whether they are "normal" protocols and ports that should be seen between the systems.
  6. When applicable, look at the packet capture and compare it to the signature/rule.
  7. Use historical queries on NSM systems and searches of documentation to determine past events that may be related to the current one.
Based on the above knowledge, does the traffic that caused the alert make sense or is it abnormal? Simple examples:
  1. A file server sending huge amounts of SMTP traffic over port 25 probably does not make sense, whether because of malicious activity or a misconfiguration.
  2. Someone connecting to a workstation on port 21 with FTP probably does not make sense.
  3. A DNS server sending and receiving traffic to another DNS server over port 53 does make sense. However, an analysis of the alert and the DNS traffic may still be needed to verify whether the traffic is malicious or not.
Remember, traffic that makes sense and is normal on one network may not be normal on another network. Having a good baseline of your network traffic is extremely important before you can accurately determine what traffic makes sense and what traffic does not makes sense. Even traffic that does not make sense is not automatically malicious.

Also remember, traffic that makes sense is not always friendly. A good attacker will make his network traffic look like it fits in with the the baseline traffic, making the traffic less likely to stick out.