Tuesday, December 09, 2008

Measuring Usability: A Task-Based Approach

I think we all know that the simplest practical measure of intelligence is how often someone agrees with you. On that scale, University of Ottawa Professor Timothy Lethbridge must be some kind of genius, because his course notes on Software Usability express my opinions on the topic even better and in more detail than I’ve yet to do for myself. Specifically, he lists the following basic process for measuring usability:

- understand your users, and recognize that they fall into different classes
- understand the tasks that users will perform with the system
- pick a representative set of tasks
- pick a representative set of users
- define the questions you want to answer about usability
- pick the metrics that answer those questions
- have the users perform the tasks and measure their performance

This is very much the approach that I’ve been writing about, in pretty much the same words. Happily, Lethbridge provides additional refinement of the concepts. Just paging through his notes, some of his suggestions include:

- classifying users in several dimensions, including the job type, experience with the tasks, general computer experience, personality type, and general abilities (e.g. language skills, physical disabilities, etc.). I’d be more specific and add skills such as analytical or technical knowledge.

- defining tasks based on use cases (I tend to call these business processes, but it’s pretty much the same); understanding how often each task is performed, how much time it takes, and how important it is; and testing different tasks for different types of users. “THIS STEP CAN BE A LOT OF WORK” the notes warn us, and, indeed, building the proper task list is probably the hardest step in the whole process.

- a list of metrics:

- proficiency, defined as the time to complete the chosen tasks. That strikes me as an odd label, since I usually think of proficiency as an attribute of a user not a system. The obvious alternative is efficiency, but as we’ll see in a moment, he uses that for something else. Maybe “productivity” would be better; I think this comes close to the standard definition of labor productivity as output per hour.

- learnability, defined as time to reach a specified level of proficiency.

- efficiency, defined as proficiency of an expert. There’s no corresponding term for “proficiency of a novice”, which I think there should be. So maybe what you really need is “expert efficiency” and “novice efficiency”, or “expert and novice “productivity”, and discard “proficiency” altogether.

- memorability, defined as proficiency after a period of non-use. If you discard proficiency, this could be “efficiency (or productivity) after a period of non-use”, which makes just as much sense.

- error handling, defined as number or time spent on deviations from the ideal way to perform a task. I’m not so sure about this one. After all, time spent on deviations is part of total time spent, which is already captured in proficiency or efficiency or whatever you call it. I’d rather see a measure of error rate, which would be defined as number or percentage of tasks performed correctly (by users with a certain level of training). Now that I think about it, none of Lethbridge’s measures incorporate any notion of output quality—a rather curious and important omission.

- satisfaction, defined subjectively by users on a scale of 1 to 5.

- plot a “learning curve” on the two dimensions of proficiency and training / practice time; the shape of the curve provides useful insights into novice productivity (what can new users do without any training); learnability (a steep early curve means people learn the system quickly) and eventual efficiency (the level of proficiency where the curve flattens out).

- even expert users may not make best use of the system if stop learning before they master all its features. So they system should lead them to explore new features by offering tips or making contextual suggestions.

At this point, we’re about half way through the notes. The second half provides specific suggestions on:

- measuring learnability (e.g. by looking at features that make systems easy to learn);

- causes of efficiency problems (e.g. slow response time, lack of an easy step-by-step route to perform a task);

- choosing experts and what to do when experts are unavailable (basically, plot of learning curve of new users);

- measuring memorability (which may involve different retention periods for different types of tasks; and should also distinguish between frequently and infrequently used tasks, with special attention to handling emergencies)

- classifying errors (based on whether they were caused by user accidents or confusion [Lethbridge says that accidents are not the system’s fault while confusion is; this is not a distinction I find convincing]; also based on whether the user discovers them immediately or after some delay, the system points them out, or they are never made known to the user)

- measuring satisfaction (surveys should be based on real and varied work rather than just a few small tasks, should be limited to 10-15 questions, should use a “Likert Scale” of strongly agree to strongly disagree, and should vary the sequence and wording of questions)

- measuring different classes of users (consider their experience with computers, the application domain and the system being tested; best way to measure proficiency differences is to compare the bottom 25% of users with the 3rd best 25%, since this will eliminate outliers)

This is all good stuff. Of course, my own interest is applying it to measuring usability for demand generation systems. My main take-aways for that are:

1. defining user types and tasks to measure are really important. But I knew that already.

2. choosing the actual metrics takes more thought than I’ve previously given it. Time to complete the chosen tasks (I think I’ll settle on calling it productivity) is clearly the most important. But learnability (which I think comes down to time to reach a specified level of expertise) and error rate matter too.

For marketing automation systems in particular, I think it’s reasonable to assume that all users will be trained in the tasks they perform. (This isn’t the case for other systems, e.g. ATM machines and most consumer Web sites, which are used by wholly untrained users.) The key to this assumption is that different tasks will be the responsibility of different users; otherwise, I’d be assuming that all users are trained in everything. So it does require determining which users will do which tasks in different systems.

On the other hand, assuming that all tasks are performed by experts in those tasks does mean that someone who is expert in all tasks (e.g., a vendor sales engineer) can actually provide a good measure of system productivity. I know this is a very convenient conclusion for me to reach, but I swear I didn’t start out aiming for it. Still, I do think it’s sound and it may provide a huge shortcut in developing usability comparisons for the Raab Guide. What is does do is require a separate focus on learnability so we don’t lose sight of that one. I’m not sure what to do about error rate, but do know it has to be measured for experts, not novices. Perhaps when we set up the test tasks, we can involve specific contents that can later be checked for errors. Interesting project, this is.

3. the role of surveys is limited. This is another convenient conclusion, since statistically meaningful surveys would require finding a large number of demand generation system users and gathering detailed information about their levels of expertise. It would still be interesting to do some preliminary surveys of marketers to help understand the tasks they find important and, to the degree possible, to understand the system features they like or dislike. But the classic usability surveys that ask users how they feel about their systems are probably not necessary or even very helpful in this situation.

This matters because much of the literature I’ve seen treats surveys as the primary tool in the usability measurement. This is why I am relieved to find an alternative.

As an aside: many usability surveys such as SUMI (Software Usability Measurement Inventory) are proprietary. My research did turn up what looks like a good public version
Measuring Usability with the USE Questionnaire by Arnold M. Lund from the
Society for Technical Communication (STC) Usability SIG Newsletter of October 2001. The acronym USE stands for the three main categories: Usefulness, Satisfaction and Ease of Use/Ease of Learning. The article provides a good explanation of the logic behind the survey, and is well worth reading if you’re interested in the topic. The questions, which would be asked on a 7-point Likert Scale, are:

Usefulness
- It helps me be more effective.
- It helps me be more productive.
- It is useful.
- It gives me more control over the activities in my life.
- It makes the things I want to accomplish easier to get done.
- It saves me time when I use it.
- It meets my needs.
- It does everything I would expect it to do.

Ease of Use
- It is easy to use.
- It is simple to use.
- It is user friendly.
- It requires the fewest steps possible to accomplish what I want to do with it.
- It is flexible.
- Using it is effortless.
- I can use it without written instructions.
- I don't notice any inconsistencies as I use it.
- Both occasional and regular users would like it.
- I can recover from mistakes quickly and easily.
- I can use it successfully every time.

Ease of Learning
- I learned to use it quickly.
- I easily remember how to use it.
- It is easy to learn to use it.
- I quickly became skillful with it.

Satisfaction
- I am satisfied with it.
- I would recommend it to a friend.
- It is fun to use.
- It works the way I want it to work.
- It is wonderful.
- I feel I need to have it.
- It is pleasant to use.

Apart from the difficulties of recruiting and analyzing a large enough number of respondents, this type of survey only gives a general view of the product in question. In the case of demand generation, this wouldn’t allow us to understand the specific strengths and weaknesses of different products, which is a key objective of any comparative research. Any results from this sort of survey would be interesting in their own right, but couldn’t themselves provide a substitute for the more detailed task-based research.

No comments: