34.0.1 Introduction
With its roots in Psychology, experiments have always played an important role in Human-Computer Interaction (HCI). For example, early experimental HCI papers include English et al. (1968), Card, Moran and Newell (1980) and Malone (1982). From these early days, the psychology-style experiment developed to become the basis for usability tests where different designs are compared through controlled studies (see, for example, Preece, Rogers and Sharp, 2010) and measured for important differences in the key aspects of usability, that is, efficiency, effectiveness and satisfaction (Hornbaek and Law, 2007). This was embodied in the proposed engineering approach to HCI, such as Dowell and Long (1989), where new interfaces would be engineered through the application of scientifically established knowledge about human factors to achieve predictable and measurable effects when people used the interfaces
Alongside this though, HCI has always recognised the far-reaching nature of technology in human lives and so has also increasingly been influenced by other disciplines such as the social sciences, cultural studies and more recently the arts and humanities. These different influences have brought their own research methods such as ethnomethodology (Suchman, 1995) and critical readings (Blythe et al. 2008 ). Additionally, the vision of engineering interfaces based on scientific principles has foundered as HCI has moved from merely worrying about the productivity of users to being concerned with the wider effects of systems on user experiences (McCarthy and Wright, 2005), including fun (Blythe at al. 2006), pleasure (Jordan, 2001), immersion (Jennett et al. 2008), and so on.
Nonetheless, experiments have maintained a steady presence within HCI and adapted, where appropriate, to address these more wide-ranging aims. They still embody the principle of gathering well-founded data to provide rigorous, empirically validated insights into interactions but made suitable for wider research questions. This can be seen, in part through moves within the HCI community to develop an orientation to Interaction Science, of which theories and experiments form the cornerstone (Howes et al., 2014).
Also, with increasing maturity, there is now a generation of HCI researchers who have only ever been HCI researchers, unlike in the early years where people moved into HCI having done their early research in entirely different disciplines, such as Psychology, Physics, English, Mathematics and so on. Thus, there are appearing texts that are aimed at describing the research methods and, in particular, the experimental methods that these second generation HCI researchers need. Overviews of experimental methods are given in Cairns and Cox (2008a), Lazar et al. (2009) and Gergle and Tan (2014). Purchase (2012) is entirely about the design and analysis of experiments in HCI. Additionally, there are texts that draw strongly on experimental methods such as Tullis and Albert (2008) and Sauro and Lewis (2012) in order to demonstrate how to effectively measure user interactions with a view to improving the design of interactive systems. These texts therefore also understandably provide a lot of details and insights about the design and conduct of experiments in HCI even if not in a purely research context.
There are also research articles that focus in on specific aspects of experiments to help address the problems of experimental methods that are specific to their deployment in HCI. These research articles address diverse topics like the quality of statistical analysis (Cairns, 2007), new statistical methods (Wobbrock et al., 2011) and the practicalities of running good experiments (Hornbaek, 2013).
This sets a challenge for the goal of this chapter. Experiments are clearly an important approach in the modern HCI researcher’s toolkit but there are several existing resources that describe experimental methods well and in more detail than is possible here. There are already overview chapters (including one by me!) that provide a starting resource for new researchers. What can this chapter usefully add? First, it should be noted that this is an Encyclopaedia chapter and therefore if this is an important topic in HCI, it ought to be represented. This chapter aims to represent this topic. Secondly, I think there is a perspective on conducting experiments in HCI that has not been explicitly addressed and that is one, which I wish to present here. Which is not to say that my esteemed colleagues already writing in this area are unaware of anything in this chapter but rather that this is my particular synthesis based on my own experiences of using experimental methods and more particularly of teaching students to use them.
Understandably, students partly struggle to learn experimental methods because they are new and unfamiliar but it seems some of the problems arise because they do not fully understand why they need to do things the way they are told to (whether by me or by textbooks). There seems to be a ritual mystique around experiments that is not made wholly clear. This is a problem noted in other disciplines, like Psychology where students and researchers ritually follow the experimental method (Gigerenzer, 2004), but fail to identify when the method is invalid or ineffective. The aim of this chapter is thus to communicate concisely some of the core ideas behind the experimental method and to demonstrate how these play out in the conduct of actual experimental research in HCI. From this perspective, I identify three pillars that underpin good experiments. These are:
Experimental design
Statistical analysis
Experimental write-up
If any of these pillars is weak, then the contribution of an experiment is weak. It is only when all three work together that we can have confidence in the research findings arising from an experiment.
This chapter discusses each of these pillars in turn and along the way shows how each is related to the other. Before doing this though, it is necessary to think about what experiments are. This is philosophy. Too much philosophy is crippling (at least for an HCI researcher though obviously not for philosophers). But an unconsidered life is not worth living (Plato, 1956, The Apology). So here goes.
34.0.2 A Little Philosophy of Science
Though there is much to disagree about what science is, and indeed whether scientific method is in fact a coherent, meaningful term (Feyerabend, 2010), there is general agreement that science is about theories and experiments. Theories are statements expressed mathematically (“E = mc2”) or linguistically (“digital games fulfil people’s need for autonomy”) that describe something about how the world works and therefore can be used to predict to some extent what may happen in certain situations (Chalmers ,1999, chap. 4). Experiments are somewhat harder to define as they depend on the domain, the research questions and even the scientist doing them. As such they are full of craft skill and experience but basically they can be understood as tests or trials of ideas.
Historically, theory has dominated the philosophy of science with much concern about what a theory is and how we know it is true. And this reflected in the dominant influences in philosophy of science such as Popper (1959) and Kuhn (1962). Much of their thoughts are on understanding the nature of scientific theory. In these approaches, experiments are tests of theories that are able to undermine and even destroy theories but are essentially subservient to them. If this really were the case, this would be a big problem for HCI. It is hard to point within HCI to large, substantial and robust theories that can easily predict how people will interact with a particular system and what the outcome will be. Fitts’ Law (MacKenzie, 1992) is a good example of a well validated theory in HCI but, first, its specific implications for any differences in say button layout are so small as to be negligible (Thimbleby, 2013), and secondly it is not even interpreted correctly when it is used to inform design. For instance, it is often used to justify putting things at the edges of screens because then buttons effectively have infinite width (eg Smith, 2012) but updates of Fitts’ Law to modern interfaces show that both height and width are important (MacKenzie and Buxton, 1992).
Just in case you are in doubt about the status of theory within HCI, contrast this with psychology where there are many well-established theories that predict people’s behaviour in a wide variety of contexts: perception and inattentional blindness (Simons and Chabris, 1999); anchoring biases in decision making (Kahneman, 2012); embodied cognition (Wilson, 2002) and so on. This is not to say that these theories play out so clearly in our ordinary day-to-day lives but they do appear robustly in the lab and often outside of it as well. HCI would be hard pushed to demonstrate any such substantial theory of interactions.
Fortunately for HCI, more recently, it has been recognised that experiments have their own value independently of theories. This approach is termed new experimentalism, for example Chalmers (1999), and reflects the fact that sometimes experiments exist because of an idea but not necessarily one which has support from any existing theory and in some cases, quite the opposite. A classic example of this is Faraday’s first demonstration of the electric motor. It was known in Faraday’s time that there was some interaction between electrical currents and magnetic fields but it was Faraday who clearly isolated this effect by showing that a suspended wire would consistently rotate around a magnet when an electrical field was passed through it (Hacking, 1983). It took another 60 years for Maxwell to define the full theory of electromagnetism that would account for this even though that theory immediately seemed obviously flawed as it predicted a constant speed of light regardless of the motion of the observer (which turned out to be true!).
If experiments are not testing theories then what are they? In one sense, they simply isolate a phenomenon of interest (Hacking, 1983). How we identified that phenomenon in the first place (by theory, hunch or blind chance) is irrelevant. Once we have isolated it and can reliably reproduce it, that experiment stands as something that requires explanation, much like Faraday’s motor did. Few experiments though really are pure chance. Instead, they arise from a concerted effort by one or more people to identify something interesting going on, to isolate a recognised phenomenon.
A more sophisticated account of experiments’ role in science is that an experiment is a severe test of an idea (Mayo, 1996). In this formulation, we may have an idea, be it an established theory or some hunch about how things work, and we set up a situation where that idea is able to be severely tested. For example, we may believe that digital games are able to improve the outcomes of psychotherapy. So we need to set up a situation in which it ought to be clear that digital games have had an improvement, or not, on the outcomes of psychotherapy and that the cause of any such improvement can be directly attributed to the digital games. This has immediate consequences for how such an experiment might look: fixed style of therapy; possibly even a fixed therapist or set of therapists; comparable patients; clear assessment of outcomes; and so on. But nonetheless, having levelled the playing field, we expect digital games to demonstrate their improvements. The experiment isolates the phenomenon of interest, in this example; digital games in the context of psychotherapy, and sets it up so that if it is going to fail, it is obviously going to fail. Or if it succeeds then it is clear that the digital game is the sole cause of success. And so it is by such careful experiments that the idea is severely tested and each experiment therefore provides evidence in support of it.
The other thing to note in this definition of severe testing is that, in setting up the experiment, there is a prediction that is being tested and the prediction has structure, namely that, in situations represented by the experiment, the outcome will be a certain way. This prediction is central to any experiment (Abelson 1995; Cairns and Cox 2008a) but it should also be noted that it is a causal prediction: when X happens, Y will follow. Both these points are important in how experimental methods are defined and how statistical analysis is conducted.
Another immediate implication of seeing experiments as severe tests is that no single experiment can be enough to test all the implications of single idea (at least not one with any claim to have more than the narrowest impact). One experiment is only one test. There may be other tests where the idea might fail. Or other refinements of the test that are more severe and therefore better tests. Experiments do not live in isolation but need to form a cluster around a particular idea and although passing tests is important for establishing the robustness of an idea, it is only evidence in support of the idea and never proof that the idea is correct.
As with any branch of philosophy, Mayo’s notion of severe testing is not the last word on what experiments are nor on how they fit with science (Mayo and Spanos, 2010). However, it clearly has a good basis and, more pragmatically, I have found it to be very useful in the context of HCI. Many HCI experiments do not have recourse to theory: the contexts, tasks and devices under consideration are too complex to fall under a simple, well-accepted theory. Instead, researchers tend to argue for why a particular interaction should lead to a particular outcome in certain circumstances and in doing so hope to advance understanding of these sorts of interactions. Perhaps this in time will lead to a theory but perhaps it may only help designers with some solid information. A traditional usability test done in the form of an experiment is the best example of this. The researchers are not interested in a formal theory but simply how something works in some particular system as it will be used in some particular context. The interaction being predicted is being put under a severe test, where it might not play out as hoped, and when it passes the test then there is good evidence that the interaction is as understood and so merits further use or investigation.
What is also interesting for the purposes of this chapter is that the notion of severe testing does explain many features (and problems) of experimental methods in HCI even though the methods used pre-date this philosophy. It is also worth noting that none of the authors mentioned who have previously written about experiments in HCI (including myself) have adopted the notion of experiments as severe tests. Nonetheless, the people who developed, deployed and advocated such experiments seemed to know what was important in a good experiment even if they did not have the philosophy to say why explicitly.
Having positioned experiments philosophically in HCI, each of the three pillars of experimental research is now considered in turn.
34.0.3 Experimental Design
Experimental design is basically the description of what exactly will go on in an experiment. Computer scientists often use the acronym GIGO, garbage in, garbage out, to describe the output from programmes based on poor quality data. So it is with experiments: the data from an experiment cannot be any better than the experimental design that produced it. Fortunately, HCI people are good at understanding the complexity of the design process which combines innovation, craft skill and sound knowledge (Lawson, 1997). Experimental design is no different in that regard from any other sort of design. What is perhaps deceptive to a new researcher is that experimental designs are written up in papers as objective descriptions of a state of affairs and the complex processes that led to that particular experiment are glossed over. In particular, the fact that an experimental design may have been iterated on paper, tested in a pilot or even failed entirely and iterated further may only merit the briefest of mentions, if any.
Given that this craft element of experimental design is rarely seen except in text books (eg Purchase, 2012), it can be hard to perceive the key features that are needed in a good design or the thoughts that need to go into the design process. But these features are important because in the process of moving from a big idea to an experiment, there are many choices that need to be made and, as any designer will tell you, correct choices are not always obvious. At the same time, there are definitely choices that are wrong but this is not obvious when you are new to experimental design.
The starting point of an experiment as a severe test is to set up a situation that tests an idea and that idea must be causal: one thing should influence another. In HCI-style experiments, this is also expressed as seeing the effect of the independent variable on the dependent variable. The independent variable is what is in the experimenters’ control and the experimenter explicitly manipulates it. The dependent variable is the numerical measure of what the outcome of the manipulation should be: the data that the experimenter gathers.
However, holding in mind GIGO, it is vital to ensure that the data coming out of the experiment is of the highest quality possible. In the world of experiments, quality is equated with validity and generally four types of validity are important (Yin, 2003; Harris, 2008):
Construct validity
Internal validity
External validity
The ordering here is not essential but can be understood as moving from the details of the experiment to the wider world . They of course are all necessary and relate to each other in ensuring the overall quality of an experiment as a severe test. There are also other types of validity and slightly different ways of slicing them up but these four provide a firm foundation.
34.0.3.1 Construct validity
Construct validity is about making sure that you are measuring what you think you are measuring. If this is not the case, despite measuring a dependent variable, the experiment is not testing the idea it is meant to. Accurate, meaningful measurement is therefore at the core of experiments in HCI.
Construct validity may seem trivial at first glance but in fact it is easy to mistake not only what a measure is but also what it means. Take something that is relatively uncontroversial to measure but relevant to HCI, time. Time, for our purposes, is well-defined and easily measured using a stopwatch or even clocks internal to the system being used. So where an experiment is looking at efficiency, the time it takes a person to complete a task may seem relatively uncontroversial: it is the time from being told to start the task to the time at which they stop it. But, thinking about this in the context of particular tasks, even time on task can be hard to specify. For example, suppose you are doing an experiment looking at people’s use of heating controllers, which are notoriously difficult to use (Cox and Young , 2000). The task might be something like setting the heating to come on every weekday at 6am and to go off at 9am. It is clear when people start the task but they may stop the task when they think they have completed it but not when they have actually completed it. For instance they may have set it to come on every day at 6am or only on Mondays, both of which are potentially easier or harder tasks depending on the controller design. To use the measured time as task completion time would therefore be flawed. And even when people have completely corrected all the steps of a task, they may have made a mistake so that the timer comes on at 6pm not 6am. The process of correctly checking the time could add to the completion time, particularly if checking can only be done through backtracking through the process. So an experimenter may choose to only consider those people who completed the task correctly but that could mean throwing away data that has potentially important insights for people who want to design better heating controllers. The people who did complete the task accurately may not be reliable representatives of the wider population which introduces another problem into the measurement. So what is the right measure of time in this context?
Consider further the measurement of time of resumption after interruptions or initiation of an action after moving to a different device. These are interesting questions about the use of interactive devices but it is not simply the case of setting a stopwatch going. The experiment needs to be designed to make it clear what are meaningful points at which to start and stop timing and researchers need to make careful decisions about this, for example, Andersen et al. (2012).
Increasingly, HCI is not just concerned with objective measures like time but subjective measures related to user experience like appreciation of aesthetics, enjoyment of a game or frustration with a website. Though there is some move to use objective measures like eye-tracking (Cox et al., 2006) or physiological measures (Ravaja et al., 2006) these still need to be traced back to the actual experiences of users which can only be done through asking them about their experiences.
In some cases, a simple naïve question, “how immersed were you in that game (1-10)?”, can provide some insights but there is always the risk that participants are not answering the question in front of them but are in fact answering something they can answer like how much they enjoyed the game (Kahneman, 2012) or what they think they are meant to answer (Field and Hole, 2003) Even so, questions of this sort are regularly seen in HCI and should be treated with doubt about their construct validity: how do we know that they really are measuring what they say they are?
Questionnaires are therefore useful tools for standardising and quantifying a wide range of subjective experiences. The idea behind a questionnaire is that people are not able to directly express their experiences accurately: one person’s immersion is another person’s enjoyment. Instead, within people’s thoughts there are internal constructs, also called latent variables, that are common to people’s thinking generally but hard for them to access directly. They are nonetheless believed to be meaningful and moreover can be measured. Such measurements can be compared between people provided they can be meaningfully revealed. Questionnaires aim to ask questions about different aspects of a single internal construct indirectly through the thoughts which people are able to access. By combining the answers to these accessible thoughts, it is possible to assign meaningful numerical values to the (inaccessible) latent variables. For example, while a person may not really understand what is meant (academically) by immersion, they may be able to more reliably report that they felt as if there were removed from their everyday concerns, that they lost track of time and so on (Jennett et al. 2008). A questionnaire consisting of these items is more likely to build up a reliable picture of immersion.
Simply having a questionnaire is not enough to guarantee construct validity. Good questionnaire design is its own research challenge (Kline, 2000). It initially needs a great deal of care just to produce a questionnaire that has potential and then even more work to demonstrate that it has relevance in realistic contexts. HCI understandably relies a lot on questionnaires to help get access to subjective experiences but many questionnaires are not as robust as claimed. A questionnaire designed specifically for an experiment suffers the same faults as direct questions: social desirability, lack of consistent meaning and answering a different (easier) question. Moreover, a worse crime is to use an experiment to demonstrate the validity of new questionnaire: “this experiment shows that gestural interfaces are easier to learn and moreover validates our questionnaire as a measure of ease of learning” is a circular argument showing only that the experiment showed something was different about gestural interfaces. Even where questionnaires have been designed with more care, there is still a big issue in HCI as to whether or not they are still sufficiently well-designed (Cairns, 2013).
Overall then, whether the measure to be used is about objective variables or subjective experiences, every experiment needs a measure of the effect of the experimental manipulation. All such measures have the potential to be flawed so care needs to be taken at the very least to use a measure that is plausible and justifiable.
34.0.3.2 Internal validity
If an experiment is set up to severely test whether influence X really does affect outcome Y then it needs to be clear any systematic changes in Y (now that we know we really are measuring Y thanks to construct validity) are wholly due to the change in X. This is the issue of internal validity.
The most obvious threat to internal validity comes from confounding variables. These are things other than X that might potentially influence Y and so when Y changes, we have not tested that it is only X causing Y to change and that consequently our experiment is not severe. Consider an experiment to test the effect of information architecture of a company’s website on people’s trust in the company. In manipulating information architecture, an enthusiastic researcher might “make improvements” to other aspects of the website such as use of colour or logos. Thus, when it comes to interpreting the results of the measure of trust in the company, any differences in trust between the different versions of the website might be due to the improved aesthetic qualities of the website and not the revised organisation of information on the website. The revised aesthetics is a confounding variable. There may be more subtle effects though that even a careful researcher cannot avoid. The information architecture may be being revised in order to better categorise the services of the company but in doing so the revisions might result in navigation bars being shorter. This makes it quicker for users to make choices and hence improve their sense of progress resulting in a better user experience all round including their sense of trust in the company. This is not a consequence of improved information architecture but simply shortening of menus – any reasonable shortening of menus might have been equally good!
Confounding variables can creep into an experiment in all sorts of ways that are nothing to do with the experimental manipulation. Some are easier to spot than others, for instance, having all men in one condition of the experiment and all women in another means that the sex of the participant is a confounding variable. A less obvious example is having only Google Chrome users in one condition and Firefox users in the other. This cannot be noticed at all unless the experimenter asks specifically about browser usage. In both cases, say in an experiment about speed of entering text on mobiles, there may be no apparent reason why such differences should be relevant but they cannot be ruled out as the reason for any systematic variation in outcomes. They are therefore potential confounds. Other confounds might be: time of day at which participants take part in the experiment; familiarity with the task, hardware or software being used; how the experimenter greets participants; the rooms in which the experiment is done; and so on. The list is potentially endless.
There are two general approaches to removing confounds. The first is to randomize participants between conditions of the experiment. The second is to experimentally control for potential confounds. In randomization, it is assumed that if people are randomly allocated to experimental conditions there is no potential for a systematic difference between the conditions to appear. Nonetheless, if you suspect there might be potential confounds, such as experience with a particular web browser then asking about this is a good idea, if only to discount it as a confound. Furthermore, it also gives the opportunity to add in such confounds as factors in the statistical analysis and so exercise statistical control over the confounds. In experimental control though, the approach is to remove variation between participants that might have confounding effects: experiments are all conducted in the same room; only men are asked to participate in the study (Nordin et al., 2014); people are screened for technology use before being allowed to participate; and so on. Even with experimental control, it is not possible to remove all possible confounds but only some of the worse ones that the experimenter could think of in advance.
Aside from confounds, there is a further threat to internal validity which arises from what I call experimental drift. In the process of developing any experiment, there are necessarily decisions that must be made about the participants used, the tasks they do, how the system is set up and so on. All of these decisions must have a degree of pragmatism about them: it is no good having a wonderful idea of how pre-industrial cultures are able to engage with touchscreen interfaces if you are unable to contact anyone from a pre-industrial culture! However, in certain contexts, what starts off as a sensible idea for an experiment is eroded through the practical concerns of devising the experiment. This is particular prevalent in HCI in studies intended to examine some influence on the user-centred design process. For example, a very good question is whether usability design patterns (Tidwell, 2005) improve the usability of interactive systems. So of course the ideal experiment sets up a situation where the same system is developed by two equally skilled teams but one uses design patterns and one does not. The two designs are then evaluated for usability. There are very few commercial companies that would expend the effort and cost of having two teams develop the same system. So some non-commercial context is preferred. Fortunately, many researchers are in universities so using the resources at hand, students are excellent replacements for commercial design teams and moreover multiple teams could be set up to do the same design task. Such students are most motivated if in fact they do the work as part of module on HCI and furthermore as the assessment for that module and design patterns may be especially taught to some but not other students to see how they are used by the students. But already it is clear that there is a big move away from professional design teams using patterns to novice designers (students) using a technique that they may have only just learned. Aside from all the confounding variables that might be introduced along the way, even if the experiment gives the desired result, does it really show that when design patterns are used that usability of the end-product improves? The experiment is at best only obliquely testing the original proposed causal idea.
Internal validity then can only be maintained by a process of vigilance whereby an initial experimental design is iteratively reviewed for its overall coherence but also possible confounds that the design might be introducing.
34.0.3.3 External validity
Experiments naturally reduce from a general question with wide applicability to the specifics of the experiment actually done. The question then becomes to what extent the results of the experiment do have the intended wider applicability. This is the external validity or generalisability of the experiment.
In all experiments, the external validity of an experiment is a matter of judgment. For a typical HCI experiment, certain people were asked to do certain tasks in a particular context. The external validity of the experiment is the extent to which the results of the experiment generalise to other people doing other tasks in other contexts. To illustrate this, consider an experiment on the effect of accelerometer-based controls on the experience of playing digital games on mobile phones. The natural generalisation is from the sample of players to a wider audience of players in general. But what constitutes the wider audience? It depends very much on the people who participated: what was the range of their experience of games; their experience of accelerometer-based controls; of mobile devices; what was the sort of person who took part, be it young, old, men, women, children; and so on. A very large sample has the potential for greater generalisation but even so if that sample was collected from the undergraduate population of a university, which is very common (Sears, 1986) then generalising out to non-undergraduates may be unsound.
The experiment could also generalise to other mobile phones like the ones used in the experiment so if the study used iPhones, Samsung Galaxy phones (of a similar size) might be a reasonable generalisation. But would the results apply to iPads, Kindles, non-phones like PS Vita or Nintendo DS? Similarly, would the results apply to other games as well? Usually such experiments use one or two games, much as all HCI experiments use a small set of tasks for participants to complete. Do the experimental findings therefore to apply to all games? Or just games like the ones used? In which case how much like?
What is interesting is that in regard to this generalisation across devices and tasks, we rarely place the same emphasis in HCI as generalising across people. A true experiment that wished to generalise to games, should actually sample through the population of games just as we normally sample through the population of people. And then this should be factored into the analysis. But this is almost never done and the only example that I know of is Hassenzahl and Monk (2010) where they aim to overcome previous methodological problems in relating aesthetics and perceived usability by sample through both participants and products. And this is just one aspect of the task. With things like accelerometer-based interactions, as with many interactive modalities, there are parameters and settings that could be tweaked that may have a marked difference on people’s experiences of the interactions.
The true extent of external validity, therefore, can be very hard to judge. My feeling is that we are perhaps a bit keen when using other people’s work to say that it has wider relevance than might be warranted in order that we can use their work in our own. All experiments are of course intended to be particular instances of a more general phenomenon but it would seem there is room in HCI to consider generalising all three aspects (H, C and I not just H) much more explicitly in the design of experiments.
34.0.3.4 Ecological Validity
External validity is concerned with the extent to which the results of the experiment apply to other contexts but only inasmuch that if similar but not identical experiments were run then they would produce the same results. By contrast, ecological validity is concerned with the extent to which the experimental findings would have relevance in the real world in which people find themselves using interactive systems as part of their daily life.
A simple example of this might be to consider different versions of a map App to see whether people are able to use them to navigate better, let’s say, around the tourist sights of a city. Obviously for the purposes of an experiment, participants would be given a navigation task which might be at a real tourist destination (something that would be easy for me living in the historically rich city of York). For example, we might ask participants to navigate from the train station to the city castle. This seems reasonable but is it really how tourists use map Apps? Perhaps some tourists set off with such a destination in mind and allow themselves to be distracted and attracted elsewhere along the way. Other tourists might simply be happy to wander aimlessly without anything but a vague idea for a destination until such time as they need to get back to their hotel at which point the draw on the App to not only help them to navigate but to tell them where they are. Or perhaps they wouldn’t use the App at all but instead prefer a traditional tourist guidebook that would give lots of information and especially tailored maps all in one handy package!
This relevance of experiments to real use is what ecological validity is all about in HCI and it is often carefully discussed. Many studies do take ecological validity seriously and strive to conduct studies in the most realistic context possible. The most extreme example is the kind of testing exemplified by Google where different users simply see different interfaces. This is not a situation set up to see how people use these interfaces, it is people using these interfaces. It could not be more ecologically valid. However, the consequences of high ecological validity, is often the loss of experimental control. In the context of real people using real systems, there can be many other factors that potentially influence the dependent variable. That is, ecologically valid studies have many potential confounds. Google may not need to worry about such things but where effects are small or the aim is to develop underlying theories of interaction, more modest experiments will always need to make a trade-off between internal and ecological validity.
34.0.3.5 Validity as a whole
For all experiments, there is always compromise across all aspects of validity, at the very least ecological because in order to achieve experimental control, some aspects of the real world need to be constrained. Even then though an experiment does not have to be completely unrealistic, for example Anna Cox (personal communication) has set up a well-devised experiment to look at how people manage email but naturally allowed for an in-the-wild style of study in order to establish better ecological validity.
Depending on the complexity of the idea being tested in the experiment, different compromises need to be made. Internal validity can come at a cost to external validity. Too much experimental control results in the inability to generalise to a wide variety of tasks or even systems. Not enough experimental control and internal validity can be damaged. A general sample of the population is desirable but it can produce a lot of natural variation in an experiment that can mask any systematic differences that represent the goal of the experiment. Some aspects of user experience can simply be very difficult to measure, like culture or trust, so you may have to rely on a weak or, at best, a poorly validated construct in order to make progress in this area.
At the end of the day, all experiments are less than ideal. There are always compromises in validity and a researcher can only be honest and explicit about what those compromises were. And as already noted, one experiment is never enough so perhaps the best path is to acknowledge the compromises and do better (or at least something different) in the next experiment.
34.0.4 Statistical Analysis
Anyone who has a reputation (however modest) for being good at statistics will tell you that, on a more or less regular basis, they are approached with the following sort of request: “I’ve just run this experiment but I can’t work out how to analyse the data.” The situation arises in HCI (and Psychology in my experience) because the learning of the necessary statistical methods is not easy. This makes it time-consuming and effortful and furthermore tangential to the actual work of an HCI researcher – they didn’t undertake HCI research in order to learn a lot of statistics! Whereas a researcher may (relatively) quickly learn to see what might make a sensible experiment, it can take a lot more time to be confident about what would constitute appropriate statistical analysis (Hulsizer & Woolf, 2009).
The problem though is that the statistical analysis is not a ritual “tag-on” to experimental methods (Gigerenzer, 2004). It is essential for an experiment to succeed. In the face of natural variation, as exhibited by people when using interactive devices to achieve a variety of goals, there can be no certainty that the experimental manipulation really did influence the dependent variable in some systematic way. One way to see this is to consider an experiment with two conditions where each participant is scored on a test out of 100. You would be (and ought to be) very surprised if the mean score of 20 participants in each condition came out to be identical: you would suspect a cut-and-paste error at the very least. When the mean scores of a study come out as different between the conditions, this is to be expected. Statistics are needed to help identify when such differences are systematic and meaningfully related to the independent variable and when they are just natural variation that you might expect to see all the time.
Even when statistics can be done and have produced a strong result, there can not necessarily be any certainty. Sometimes, natural variation between two different random samples can occasionally be strong enough to look just like a systematic variation. It is just bad luck for the experimenter. A strong statistical result can at best be evidence in support of the experimental aims. Conversely, if an experimenter conducts a lot of tests on different aspects of the data then, by chance, some of them are quite likely to come out as indicating systematic variation. Statistics alone do not provide good evidence.
The strength of a statistical analysis comes from the notion of an experiment as a severe test. An experiment is set up to severely test a particular idea so that if the idea were incorrect, the experiment would be very likely to reveal so. The statistical analysis should serve to support the severe test by directly addressing the aim of the study and any other analysis does not really support the test (however interesting it might turn out to be).
Consider for example a study to look at the use of gestures to control music players in cars. Gestures might improve safety by not requiring drivers to attend to small screens or small buttons when selecting which track or station to listen to while driving. So an experiment is devised where drivers try out different styles of interaction while monitoring their lane-keeping performance (in simulation!). In the analysis, it is found that there was not a statistically significant difference in interaction styles but based on a good insight, the experimenter wondered if this might be due to differences in the handed-ness of the participants. Sure enough when dominant hand of the participants was factored into the analysis, the results were significant and moreover clearer to interpret. Result!
But this experiment is not a severe test of handed-ness differences in gestural interactions while driving because if the experimenter were really interested in such differences then there would have been a planned, not incidental, manipulation in the experiment. The experiment would not have looked like this one. For instance, the sample would deliberately have tried to balance across right- and left-handed people. But once this has been addressed, it may also be relevant as to which side of the road people are used to driving because that determines which hand is expected to be free for gear changes and therefore other hand-gestures. The experimental design ought to factor this in as well. And so the experiment to test this idea is now quite different from the original experiment where handed-ness was incidental.
This explicitly builds on what I have previously called the gold standard statistical argument (Cairns & Cox, 2008). In that account, the aim of the experiment is important because prediction makes unlikely events interesting. Without prediction, unlikely events are just things that happen from time to time. This is the principle behind many magic tricks where very unlikely events, such as a hand of four aces or guessing the card a person has chosen in secret, which might happen by chance occasionally, are done on the first go in front of an audience. An experiment might exhibit the outcome by chance but it carries weight because it happened on the first go when the experimenter said it would. Under severe testing, it is even stronger than this though because the prediction is the foundation of the structure of the experiment. With a different prediction, a different experiment entirely would have been done.
What is interesting is that this has not always been apparent even to experienced researchers (Cohen, 1994). There is a lot of criticism of null hypothesis significance testing (NHST) which is the usual style of testing most people think of when talking about statistics. In NHST, an experiment has an alternative hypothesis which is the causal prediction being investigated in the experiment. It is very hard to “prove” that the alternative hypothesis holds in the face of natural variation so instead, experimenters put forward a null hypothesis that the prediction does not hold and there is no effect. This assumption is then used to calculate how likely it is to get the data from the experiment if there really is nothing going on. This results in the classic p value beloved of statistics. The p value is the probability of getting the data purely by chance, that is if the null hypothesis holds and the prediction, the alternative hypothesis, is wrong. The criticism of this approach is that the null hypothesis and not the alternative hypothesis is used in the statistical calculations and so we learn nothing about the probability of our prediction which is surely what we are really interested in (Cohen 1994).
But this is not logically sound. The fallacy lies in that the whole experiment is devised around the prediction: the null hypothesis is the only place where the counter to the prediction is considered. (There is also a lot of woolly thinking around the meaning of probabilities in the context of experiments which is explicitly related to this but that’s for another time.)
Coming back to the researcher who has devised the experiment but not the statistical analysis, it is clear that they are in a mess for a lot of reasons. First, if they do not know what to analyse, then it suggests that the idea that the experiment can test is not properly articulated. Secondly, it also suggests that perhaps they are looking for a “significant result” which the experiment simply is not in a position to give, often described as fishing. Where the idea is well articulated but the analysis did not give the desired result, the remedy may simply be help for them to see that the “failure” of a well-devised experiment is of value in its own right. It’s not as cool as a significant result but it can be as insightful. Fishing for significance is not the answer.
Given the challenges of setting up an experiment as a severe test and the narrowness of the results it can provide, there is a fall-back position for researchers, namely, that experiments can be exploring the possible influences of many factors on particular interactions or experiential outcomes. In which case, the experiment is not a severe test because there is not particular idea that is being put under duress. Therefore, statistics cannot possibly provide evidence in support of any particular factor influencing the outcome of the experiment. What statistics can indicate is where some results are perhaps more unlikely than might be expected and therefore worthy of further investigation. This may seem like a rather weak “out” for more complex experimental designs but for researchers who say it does not matter, my response would be to ask them for what philosophy of experiment they are therefore using. In the face of natural variation and uncertainty, what constitutes evidence is not easy to determine.
34.0.4.1 Safe analysis
Knowing there are pretty dangerous pitfalls in statistical analysis and that statistics is a challenging area to master, I recommend a solution well known to engineers: keep it simple stupid (KISS). According to severe testing, the experiment should put under duress an idea about how the world works to see if the idea is able to predict what will happen. But as was seen in the previous section, an actual experiment must narrow down many things to produce something that is only a test of a single aspect of the idea. No experiment can test the whole idea (Mayo, 1996). So recognising that, it is always better to have several experiments. There is no need to expect one experiment to deliver everything. Furthermore, if each separate experiment offers clear and unambiguous results then this is a lot better than one big experiment that has a complex design and an even more complex analysis. Simpler experiments are actually more severe tests because there is nowhere to hide: something either clearly passes the test or it does not.
So what are the simple designs that lead to simple, safe statistics? Here are some simple rules that I would recommend to any researcher:
No more than two independent variables, ideally one
At most three conditions per independent variable, ideally two
Only one primary dependent variable (though you should measure other stuff to account for alternative explanations, accidental confounds, experimental manipulation etc)
These seem very restrictive but there is good reason for this: the statistical tests for designs like this are very straightforward. There are well established parametric and non-parametric tests that are easy to understand both in terms of applying them and in terms of interpreting them and the ones for two conditions are even easier than the ones for more than two. It may seem unduly restrictive but keep in mind the idea of a severe test: if there are more variables, independent or dependent, what exactly is being severely tested? And if there are lots of conditions, is there a clear prediction about what will happen in each of the conditions? If not, then the experiment is not testing a well-formed prediction.
Let’s see how this might work in practice. Suppose you are interested in the best layout of interfaces on touchscreens for use by older people. Even just thinking about buttons, there are various factors that are potentially important: button size, button spacing, text on the buttons, the use of a grid or a staggered layout and so on. If you do not know anything at all about this area, the first thing would be to find out whether any of these factors alone influence the use of touchscreen devices by older people. Take for instance button size. The overall goal might be to determine which button size is best. But if you don’t even know how big an effect button size can have, then why try out a dozen? Try out the extreme ends, very small and very large, of what might be considered reasonable sizes. And actually, when you think about it, there is not really opportunity for a huge range on most tablets or smartphones. And if the extreme ends do not make an appreciable difference then sizes that are more similar are not going to be any better. And if button size does make a difference only then is it worth seeing how it might interact with button spacing.
The question remains of what is meant by “best” in a layout. Is it speed and accuracy? These two things generally are measured together as there is often a speed/accuracy trade-off (Salthouse, 1979). But even so, in the context of this work, which is more important? It may be accuracy provided the speed effect is modest. So there’s the primary dependent variable with speed simply offering a possible guard against explanations that have nothing to do with button size.
But of course, as good HCI researchers, we always worry about user experience. Which is preferred by users? That’s an entirely different issue. It, of course, may be measured alongside speed and accuracy but if one layout is highly accurate but less preferred which is best? And if accuracy is irrelevant, then do not measure that in the first place. It’s a red herring!
The temptation is to devise a complicated experiment that could look at all of these things at once but the analysis instantly escalates and also opens up the possibility of over-testing (Cairns, 2007). By contrast, it should be clear that a series of experiments that target different variables are able to give much more unambiguous answers. Moreover, if several different experiments build up a consistent picture then there is far more evidence for a “best layout” because one experiment is always open to the possibility of just getting a good result by chance. This is less the case with lots of experiments and even less so if the experiments are different from each other.
34.0.4.2 Interpreting analysis
Having devised an experiment that is simple to analyse, it is important also not to fall at the final hurdle of interpreting the results of the statistical tests. All statistics, under the traditional NHST style of statistical analysis, ultimately produce a statistic and a p value. The p value is the probability of the experiment having given the result it did purely by chance. The threshold of significance is almost always 0.05 so that if the probability of a chance outcome is less than 1 in 20, the experimental result is declared to be significant. There are other important thresholds as summarised in Table 1. These thresholds are all purely conventional and without specific scientific meaning outside of the convention but within the convention, if a test produces a p < 0.05 then it is deemed to be significant and the experiment has “worked.” That is, the experiment is providing evidence in support of the alternative hypothesis. If p comes out as more than 0.05 then the experiment has not worked and the null hypothesis is more likely.
But this is not really a fair picture. Because the thresholds are conventionalised, we need to recognise the apparent arbitrariness of the conventions and not make such black and white interpretations. So in particular, if p is close to 0.05 but slightly bigger than it, then there is the possibility that the experiment is working as intended but there is some issue such as a small sample size, a source of noise in your data or an insufficiently focused task that means the experiment is not able to give an unambiguous answer. Failure to meet the 0.05 threshold is not an indication that nothing is going on. Indeed, any insignificant result does not support that nothing is going on but only that this experiment is not showing what was expected to happen. So where p<0.1 it is usual to consider this as approaching significance or marginally significant and it should be interpreted, very cautiously, as potentially becoming interesting.
p-value | Typical descriptors | Typical but incautious interpretation |
0.1 < p | Not significant | Prediction is not true |
0.05 < p ≤ 0.1 | Marginally significant, Approaching significance | Prediction is likely to be true but might not be |
0.01< p ≤ 0.05 | Significant | Prediction is pretty sure to be true |
0.001 < p ≤ 0.01 | Significant, Highly significant | Prediction is really true |
p ≤ 0.001 | Significant, Highly significant | Prediction is really true |
Table 1: The conventional thresholds for interpreting p values and their inflexible interpretations
At the other extreme, where p is very small so 0.001 or even less, this however is not an indication that is more convincing or important that any other p value less than 0.05. Such values surely can come out by chance (one in a million chances occur a lot more often than you might think depending on how you frame them) but if an effect was expected in the experiment then such small values ought to appear as well. This is where there is a move, within psychology at least, to also always report effect sizes where possible. Effect sizes are measures of importance because they indicate how much effect the experimental manipulation had on the differences seen in the data. Unlike p values though, there is no convention for what makes an important effect size, it depends on the type of study you are doing and what sort of effects are important.
One measure of effect that is used alongside t-tests is Cohen’s d. This statistic basically compares the difference in means between two experimental conditions against the base level of variation seen in participants. This variation is represented by the pooled standard deviation in the data collected but do not worry if you are not sure what that means (yet). A d value of 1 means (roughly) that the experimental conditions have as much effect as 1 standard deviation between participants, that is, the average variation between participants. That’s a really big effect because the experimental manipulation is easily perceived compared to natural variation. By contrast, d=0.1 means that the average variation between participants is a lot larger than the variation caused by the experimental conditions so the effect is not easily seen. What is interesting is that with a large sample it is possible, in fact common, to have p<0.001 but a small d value. This should be interpreted as a systematic difference in conditions but one which is easily swamped by other factors. Unlike p values, the importance of such small effects is heavily context dependent and so requires the experimenter to make an explicit interpretation.
Thinking this through further, a small sample that gives a significant effect is also more likely to demonstrate a big effect. That is, a smaller sample that gives significance is potentially more convincing than a larger sample! This is counter to much advice on sample sizes in experiments but is a consequence of thinking about experiments as severe tests (Mayo and Spanos, 2010). If an idea is able to predict a robust effect then it ought to be seen in small samples when put under a severe test. What small samples do threaten is generalizability and the risk of confounds due to consistent differences that appear by chance within a small number of participants. Even so, where these differences are not seen, small samples provide more convincing evidence!
The real snag with effect sizes is that they are only easily generated for parametric tests, that is, situations where the data follows the classic bell curve of the normal distribution. But many situations in HCI do not have such data. For example, data on the number of errors that people make typically has what is called a long-tailed distribution with most participants making very few errors, smaller number making several errors and one or two people making a lot of errors. Such data can only be robustly analysed using non-parametric tests and these do not provide easy measures of effect size. There are emerging measures of effect size that can be used in non-parametric situations (Vargha and Delaney, 2000) but they are not widely used and therefore not widely understood but hopefully this will merely be a matter of time and perhaps something that the HCI community could be in a position to lead, particularly within computer science.
Overall then, statistical analysis should be planned as part of the experiment and wherever possible simpler designs preferred over more complex ones so that the interpretation is clear. Moreover, where possible, effect sizes offer a meaningful way to assess the importance of a result, particularly when results are very significant or marginally significant.
34.0.5 Experimental Write-up
It should be clear why both experimental design and statistical analysis are important pillars in the development of experiments. It may not be so clear why write-up is sufficiently important to be considered a pillar. Of course, a report in the form of an article or dissertation is essential to communicate the experiment and gain recognition for the work done. Hornbaek (2013) articulates many of the important features of a write-up (and I recommend the reader also read that article alongside this) in terms of enabling the reader to follow the chain of evidence being presented and to value it. To be clear though, people do not write-up experiments to be replicated: if a researcher replicates an experiment they may struggle to get published even when replication seems essential (Ritchie et al., 2012). Hornbaek (ibid) suggests that a write-up can help experimental design but I would go further. An experimental write-up is an important pillar in the development of experiments because it forces a commitment to ideas and from that the experiment its validity, analysis and meaning become scrutable.
Contrast this with describing an experiment in an academic conversation such as a supervision. Some of the details may yet be fluid and revised as a result of the conversation. This is the advantage of dialogue, as Socrates knew (Plato, 1973, Phaedrus). It is through talking and engaging with a person that their ideas and thoughts are best revealed. However, a dialogue is ephemeral and, YouTube notwithstanding, is generally only accessible to those present at the time. By writing the experimental description down, the details are usually presented in full (or at least substantially) and made available to others (usually supervisors) to understand and critique. As a consequence, it is possible to critique the experiment for validity in a measured and considered way.
If an experiment is found to be lacking in validity after it has been conducted then there is a problem. The data cannot be relied on to be meaningful. Or put another way, the test was not severe and so contributes little to our understanding. But I would argue that an experimental write-up not only could be written up before conducting any experiment but indeed should be written up before because without a write-up in advance, how can a researcher be sure what the experiment is let alone whether it is good?
What particularly supports writing up in advance is that all experimental reports have a more or less established structure:
Title and Abstract
Motivation/Literature Review
Experimental Method
Results
Discussion
Hypothesis – idea being tested
Participants – summary of who took part
Design – specify independent and dependent variables
Materials
(Tasks)
Procedure
And with the Experimental Method section, there are also clearly defined subsections:
Tasks are not always considered separately but this is a particular issue for HCI, as discussed with reference to External Validity above, because the Tasks may in fact be the subject of study rather than the participants.
The literature review perhaps has a special place in the experimental write-up in that it is not about the experiment itself but rather about what it might mean. A literature review in this sense serves two purposes: first to motivate the experiment as having some value; secondly to explain to the readers enough so that they can understand what the experiment is showing. The motivation for value comes either from doing work that is important or interesting. For instance, an experiment looking to reduce errors could be important because it could save time, money or even lives. Or it might be interesting simply because lots of people are looking at that particular problem, for instance, whether new game controllers lead to better user experiences. In the best cases, the experiment is both important and interesting by solving problems that people in the research community want solving and also by having an impact outside of the research community in ways that other people, communities or society value.
A literature review in itself though does not speak to what an experiment ought to be: there is no necessity in any experiment because like any designed artefact it must fit to the context in which it is designed. Instead, a literature review perhaps scopes what a good experiment should look like by defining the gap in knowledge that a suitable experiment could fill. How an experimenter fills that gap is a matter of design. In fact, a literature review is not even necessary do doing a good quality experiment but without a literature review the risk is that you redo an existing experiment or worse do an experiment that no-one else is interested in.
There are many good textbooks on experimental write-ups. My favourite is Harris (2008) but that is purely a matter of taste. Such books give clear guidelines not only as to the structure of an experimental report, in terms of these headings, but also what should be expected under each heading. Rather than rehearse these things here, the goal is to relate the typical experimental write-up of the method section back to the previous two pillars of experimental work as described here.
Construct validity or “are you measuring what you think you are measuring” appears primarily in the design when the dependent variable is specified. This must relate to the concepts that are under scrutiny either through being obvious (time is a measure of efficiency), through comparison with other studies or because a case has been made why it is a valid measure. Both of these latter two ought to have been clearly established in the literature review. In particular, as has been mentioned, an experiment that is used to validate a questionnaire and test a different idea at the same time does not make sense. Also, having a valid construct means that the statistical analysis must only focus on that construct. There is no equivocation about what constitutes the severe test when it comes to the statistics.
Internal validity is concern for confounds and the relevance of the experiment. Confounds typically become visible in the materials, tasks and procedure sections where problems may be arising from either what is used in the experiment or what is done in the experiment. Participants may not be guided correctly or there may be issues in training which are usually represented somewhere in materials or the procedure section. Furthermore, participants themselves may be a source of confounds because some are more experienced than others in a way that is relevant to the experimental tasks. Initially, these sorts of problems may not be apparent but in writing up these sections, doubts should not only be allowed to creep in but should be actively encouraged.
Internal validity also permeates the statistical analysis. Where an experiment has a clear causal connection to establish, the statistics should all be related to establishing whether or not that connection exists, at least in the data gathered. Other analysis may be interesting but does not contribute to the internal argument of the study nor can it constitute a severe test of a different idea.
External validity is about how well this experiment is representative of other experiments and therefore to what extent the results of this experiment generalise and would be seen in other, similar experiments. Participants and tasks are obvious directions in which to generalise and it is these that need to be clear. It is not necessary to specifically articulate what the expected generalisation is (at least not in the method section) but rather leave it to the reader to make this call for themselves. But at the point of devising an experiment, a researcher should be able to ask themselves to what extent does this experiment have a credible generalisation.
Ecological validity is the least easy to specify in a write up and essentially has to be a judgment call over the entire structure of the experiment. The design and procedure sections basically describe what happens in the experiment. In writing up an experiment, there may be specific references to ecological validity for particular design decisions made, particularly where it is being traded against the other forms of validity. In some cases though, it is not necessary to make any reference explicitly to ecological validity but let it be understood through the experiment as a whole. The reader must decide for themselves to what extent the experiment is sufficiently relevant to the real world interactions between people and systems.
The Discussion section is where a researcher is able to defend their choices and acknowledge the limitations of the experiment done. Usually, the discussion is considered only after the results are in because naturally, you cannot discuss the results till you know what they are. But this is not quite true. An experiment, strictly speaking, produces only one of two results: there is a significant difference in relation to the idea being tested or there is not a significant difference. Significance is a great result but what would be the discussion if it were not significant? Normally, the discussion looks at the limitations of the experiment in this case and suggests where there may be weaknesses leading to an unexpected null result. Yet this could be written in advance in (pessimistic) anticipation of just such a result. So why not do that first then make the experiment better? And if the experiment cannot be made better, then this can be thrown out as a challenge to future researchers to see if they can do better because when all is said and done, every experiment has its limitations.
It may of course be the case that there is no effect to be seen in the experiment despite a cogent argument, presumably earlier in the report, that there should be (otherwise why do this experiment). This highlights, though, a particular weakness of experiments. They are very good at showing when things are causally related but poor at demonstrating the absence of an effect. An absence of effect could be because the effect is hard to see not because it is actually absent. Weaknesses in experimental design, measurement of variables or variability between participants could all account for failing to see an effect and there cannot be any certainty. The only situation where a null result may carry more weight is where the effect has previously been strongly established but has failed to re-appear in some new context. Even then, experimental elements may be preventing the effect from being seen. In the case though that experiments are close to replications of other studies then null effects may start to get interesting. In which case, then once again the discussion around the null result can be written in advance.
Where the results are significant, the experiment is still limited. It is only one test of one aspect of some larger idea. So where are the limitations that lead to the further work? What were the compromises made in order to produce this particular experiment? How might they be mitigated in future? What else could be done to test this same aspect of the idea a different way? And what other aspects ought to be tested too? It is perfectly possible to have two versions of the discussion ready to insert into a report depending on either outcome of the analysis.
There are of course subtleties in any experiment that produce unexpected outcomes. Yes, the result is significant but the participants did strange things or were all heavy gamers or hated iPhones. These need more careful discussion and really can only be written after the data has been gathered. But even accounting for this, most of the experimental write up can be constructed before running a single participant: the literature review, the experimental method, the discussion and even the skeleton of the results because the statistical analysis should also be known in advance. Moreover, if the experiment is not convincing to you or a colleague, then no time has been wasted in gathering useless data. Make the experiment have a convincing write-up before doing the study and it is much more likely to be a good experiment.
34.0.5.1 Summary
Each aspect of a write-up is necessary for a reason. The Method section reveals validity. The Discussion section accounts for the compromises made in validity and examines if they were acceptable. The Results section provides the analysis to support the Discussion. Together they make any experiment scrutable to others but, more usefully, can be used ahead of doing an experiment to make the experiment scrutable to the experimenter,
A general maxim that I use when writing up is that I need to be so clear that if I am wrong it is obvious. I believe that it is through such transparency and commitment to honesty that science is able to advance (Feynman, 1992). How this plays out in experimental write-ups is that if the experiment is giving a wrong result (albeit one I cannot see), then the diligent reader should be able to see it for me. The intention of any researcher is to do the best experiment possible but even with the best will in the world, this is not always achievable. The write-up is about acknowledging this not only to the reader but also to yourself.
34.0.6 Summing up the experimental pillars
Experiments are an important method in the HCI researcher’s toolkit. They have a certain “look and feel” about them which is easy to identify and also moderately easy to emulate. The problem is that if the formal, apparently ritualistic, structures of an experiment are observed without understanding the purpose of the formalities, there is a real risk of producing an experiment that is unable to provide a useful research contribution, much like Cargo Cult Science (Feynman, 1992).
I have presented here what I think are key pillars in the construction of an effective experiment to reveal the reason for the formalities against the backdrop of experiments as severe tests of ideas. The experimental design is constructed to make the test valid and the write-up makes the validity scrutable. The analysis reveals whether the data supports the idea under test and so is an essential component of the experimental design. Furthermore, the write-up can be used as a constructive tool to allow a researcher to enter a dialogue with themselves and others about the effectiveness of the experiment before it has been carried out.
Of course, not every problem can be solved with an experiment but where there is a clearly articulated idea about how the world works and how one thing influences another, an experiment can be a way to show this that is both rigorous and defensible. The strength of an experiment though comes from the three things coming together: the experimental design to produce good data; the statistical analysis to produce clear interpretations; the write-up to present the findings. Without any one of these, an experiment is not able to make a reliable contribution to knowledge. This chapter goes a step further and holds that by taking each of these seriously before running an experiment, it is possible to produce better more rigorous and more defensible experiments in Human-Computer Interaction.
34.0.7 References
Abelson, R.J. (1995) Statistics as Principled Argument, Lawrence Erlbaum Assoc.
Andersen, E., O'Rourke, E., Liu, Y-E., Snider, R., Lowdermilk, J., Truong, D., Cooper, S. and Popovic, Z. (2012) The impact of tutorials on games of varying complexity. Proc. of ACM CHI 2012, ACM Press, 59-68
Blythe, M., Bardzell, J., Bardzell, S. and Blackwell, A. (2008) Critical issues in interaction design. Proc. of BCS HCI 2008 vol. 2, BCS, , 183-184.
Blythe, M., Overbeeke, K., Monk, A.F., Wright, P.C. (2003) Funology: from usability to enjoyment. Kluwer Academic Publishers.
Cairns, P. (2007) HCI. . . not as it should be: inferential statistics in HCI research. Proc. of BCS HCI 2007 vol 1, BCS, 195-201.
Cairns, P. (2013) A commentary on short questionnaires for assessing usability. Interacting with Computers, 25(4), 327-330
Cairns and Cox (2008a) Using statistics in usability research. In Cairns and Cox (2008b)
Cairns, P., Cox, A., eds (2008b) Research Methods for Human-Computer Interaction, Cambridge University Press.
Card, S. K., Moran, T. P. and Newell A.(1980) The keystroke-level model for user performance time with interactive systems. Communications of the ACM , 23(7 ), July 1980, 396-410.
Chalmers, A.F. (1999) What is this thing called science? 3rd edn. Open University Press
Cohen, Jacob (1994) The earth is round (p < .05). American Psychologist, 49(12), Dec 1994, 997-1003
Cox, A.L., Cairns, P., Berthouze, N. and Jennett, C. (2006) The use of eyetracking for measuring immersion. In: (Proceedings) workshop on What have eye movements told us so far, and what is next? at Proc. of CogSci2006, the Twenty-Eighth Annual Meeting of the Cognitive Science Society, Vancouver, Canada, July 26-29, 2006.
Cox, A. L., & Young, R. M. (2000). Device-oriented and task-oriented exploratory learning of interactive devices. Proc. of ICCM, 70-77.
Dowell, J. , Long, J. (1989) Towards a conception for an engineering disicipline of human-factors, Ergonomics, 32(11), 1513-1535.
English, W.K., Engelbart, D.C. and Berman, M.L. (1967) Display-selection techniques for text manipulation, IEEE Trans. Hum. Factors Electron., vol. HFE-8, Mar. 1967, 5-15.
Feyerabend, P. K. (2010) Against Method, 4th edn. Verso.
Feynman, R. P. (1992) Surely you’re joking, Mr Feynman. Vintage.
Field, A., Hole, G. (2003) How to Design and Report Experiments. Sage Publications, London.
Gergle, D. and Tan, D. (2014). Experimental Research in HCI. In Olson, J.S. and Kellogg, W. (eds) Ways of Knowing in HCI. Springer, 191-227.
Gigerenzer, G. (2004) Mindless statistics. The Journal of Socio-Economics, 33( 5), 587-606.
Hacking, I. (1983) Representing and intervening. Cambridge University Press.
Harris, P. (2008) Designing and Reporting Experiment in Psychology, 3rd edn. Open University Press.
Hassenzahl, M., Monk, A. (2010) The Inference of Perceived Usability From Beauty, Human–Computer Interaction , 25(3), 235-260.
Hornbaek , K. (2011) Some whys and hows of Experiments in Human-Computer Interaction, Foundations and Trends in Human–Computer Interaction, 5(4), 299–373
Hornbæk, K. and Law, E. (2007) Meta-analysis of Correlations among Usability Measures, Proc. of ACM CHI 2007, ACM Press, 617-626.
Hulsizer, M. R, Woolf, L. M. (2009) A guide to teaching statistics. Wiley-Blackwell.
Jennett, C., Cox, A.L., Cairns, P., Dhoparee, S., Epps, A., Tijs, T., Walton, A. (2008). Measuring and Defining the Experience of Immersion in Games. International Journal of Human Computer Studies, 66(9), 641-661
Jordan, p. (2002) Designing Pleasurable Products, CRC Press.
Kahneman, D. (2012) Thinking Fast and Slow, Penguin.
Kline, P. (2000) A Psychometrics Primer. Free Association Books.
Kuhn (1996), The Structure of Scientific Revolutions, 3rd edn. University of Chicago Press.
Lawson, B. (1997) How designers think: the process demystified. 3rd edn. Architectural Press.
Lazar, J., Feng, J. H., Hocheiser, H. (2009) Research Methods in Human-Computer Interaction, John Wiley & Sons.
MacKenzie, I. S. (1992) Fitts' law as a research and design tool in human-computer interaction. Human-Computer Interaction, 7(1), 91-139.
MacKenzie, I. S., Buxton, W. (1992) Extending Fitts’ Law to 2d tasks, Proc. of ACM CHI 1992, 219-226
Malone. T. W. (1982) Heuristics for designing enjoyable user interfaces: Lessons from computer games. Proc. ACM CHI 1982. ACM Press, 63-68
Mayo, D. (1996) Error and the Growth of Experimental Knowledge, University of Chicago Press.
Mayo, D, Spanos, A. eds (2010) Error and Inference. Cambridge University Press.
McCarthy, J., Wright, P. (2007) Technology as Experience, MIT Press
Nordin, A.I., Cairns, P., Hudson, M., Alonso, A., Calvillo Gamez, E. H. (2014)The effect of surroundings on gaming experience. In Proc. of 9th Foundation of Digital Games.
Popper, K. (1977) The logic of scientific discovery. Routledge.
Sharp, H. , Preece, J., Rogers, Y. (2010) Interaction Design, 3rd edn. John Wiley & Sons.
Plato (1954) The Last Days of Socrates. Penguin.
Plato (1973) Phaedrus and Letters VII and VIII. Penguin.
Purchase, H. (2012) Experimental Human-Computer Interaction, Cambridge University Press.
Ravaja, N., Saari, T., Salminen, J., Kallinen, K. (2006) Phasic Emotional Reactions to Video Game Events: A Psychophysiological Investigation, Media Psychology , 8(4), 343-367.
Ritchie, S.J., Wiseman, R, French, C. C. (2012) Failing the Future: Three Unsuccessful Attempts to Replicate Bem's ‘Retroactive Facilitation of Recall’ Effect. PLoS ONE, 7(3): e33423.
Salthouse, T. A. (1979) Adult age and the speed-accuracy trade-off, Ergonomics, 22(7), 811-821.
Sauro, J., Lewis, J. R. (2012) Quantifying the user experience. Morgan Kaufmann.
Sears, D. O. (1986), College sophomores in the laboratory: Influences of a narrow data base on social psychology's view of human nature. Journal of Personality and Social Psychology, 51(3), 515-530.
Simons, D. J., Chabris, C. F. (1999) Gorillas in our midst. Perception, 28, 1059-1074.
Smith, J. (2012) Applying Fitts’ Law to Mobile Interface Design.
http://webdesign.tutsplus.com/articles/applying-fitts-law-to-mobile-interface-design--webdesign-6919, accessed 28th April, 2014
Suchman, L. (1987) Plans and Situated Actions, 2nd edn, Cambridge University Press.
Thimbleby, H. (2013) Action Graphs and User Performance Analysis. International Journal of Human-Computer Studies, 71(3), 276–302.
Tidwell, J. (2005) Designing Interfaces: Patterns for Effective Interaction Design. O’Reilly.
Tullis, T., Albert, W. (2010) Measuring the User Experience, Morgan Kaufmann.
Vargha, A., Delaney, H., D. (2000) A Critique and Improvement of the CLCommon Language Effect Size Statistics of McGraw and Wong, Journal of Educational and Behavioral Statistics, 25(2), 101-132
Wilson M. (2002) Six views of embodied cognition. Psychonomic Bulletin & Review, 9, 625–636
Wobbrock, J.O., Findlater, L., Gergle, D. and Higgins, J.J. (2011) The aligned rank transform for nonparametric factorial analyses using only anova procedures. Proc. of ACM CHI 2011, ACM Press, 143-146
Yin, R. K. (2003) Case Study Research: Design and Methods, 3rd edn. Sage Publications.