A friend of mine is a security manager at a university. He keeps metrics for fun, mostly, he says, to torture people with. His project, over the years I've known him, is to constantly improve the service quality, reliability and security of the IT systems under his purview. As you can imagine, that's a job that encompasses management, technology and politics in almost equal degrees. So, he got into collecting metrics because he wanted to be able to bring actual data to the table whenever there was a question about what to do or not to do.
A few years ago, there was a bit of discussion about improving security. A number of researchers' systems and faculty systems had been compromised, and “something must be done” - but what?
My friend presented a short series of charts that showed a breakdown of incidents by management domains: Systems managed under IT's configuration management and independently managed systems. It turned out that IT-managed systems were about half as likely to be compromised as independently managed systems. He also presented totals of time spent in incident response, which indicated that not only were independently managed systems nearly twice as likely to be compromised, but that it took almost twice as much time to repair them (presumably because they were unique configurations).
He told me he had a third chart up his sleeve that he never needed to use, which divided the amount of time spent recovering systems by operating system type, because a lot of the independently managed systems were running outdated versions of Windows XP.
The result of the exercise was a decision to bring more systems under IT's configuration management and to disincentivize independent management by not investing as much time in repairing systems. In other words: “If you get your desktop compromised, we're going to take it off the network and let you deal with it” which spurred an almost immediate 75 percent migration to IT-managed systems.
That all sounds so simple, doesn't it? The trick, in this case, was to have all the necessary data on hand – slicing it up and presenting it was easy.
How do you figure out what data to present? In this kind of situation, it's pretty much a matter of looking at what data points you've collected, and then deciding if any of it is relevant.
Obviously, in this case, the key data collection decision was to collect incident data and to include operating system type, system ownership and incident time spent at the time it was collected. The whole exercise would have failed if the data had been pre-lumped into a simple incident count, rather then kept as granular as possible. Always remember that it's pretty easy to add your fields (thereby giving up fine detail) when you want to, but it's harder to re-sort data that has been merged. If you think about it this way, you'll see your data sets and their reduction as a form of compression - you want lossless compression wherever possible. Only merge the data at the last minute, unless you're going to encounter storage problems if you don't pre-compact certain fields.
• Keep your metrics relevant to the problem at hand by reasoning toward the problem, as your goal. “What data do I have that sheds light on this problem?” is the first question to ask.
• Be mindful of how you compact out information from your data and avoid doing it until the last possible time.
• When you're collecting your data, give some thought to how you might want to access it in the future and keep it as fine-grained as possible.
• Storage is cheap; time spent analyzing data is expensive.