The measurement instruments used in research should be reliable and valid. The reliability of a measurement instrument is high when the reapplication of an instrument results in a value that is close (ideally identical) to the first value. The validity of a measurement instrument is high when the instrument yields a measure of the phenomenon it is supposed to measure.
Figure 16 illustrates the difference between reliability and validity. Assuming darts players all aim for bull’s eye, some players do better a better job using their body (arms, eyes, brains and will power to stay focused) as an instrument than others. The best darts players throw a pattern like the one in the bottom right panel. Players consistently throwing darts in a place other than bull’s eye show a high level of reliability – applying the same instrument gets them similar results. The reliable but not valid instrument shown on the bottom left is better than the top left instrument, which is not reliable but also not valid. The instrument on the top right is a bit better, if you have the option to use it enough times – the ‘average throw’ would be exactly in bull’s eye, but every single attempt is off.
Another analogy is the thermometer. If a parent thinks her child has a fever because the child looks pale she can touch the child’s forehead to feel how warm it is. This is a first measure of the child’s temperature. Suppose the child feels warm. Using a thermometer will then give another (and quantitative) measure of the child’s temperature. The thermometer – assuming it is accurate – could confirm the parents’ first reading, or perhaps disconfirm it and reveal that the child does not have a fever. The original measure would then be a type I error. The type II error, not detecting a fever while the child does have one, could occur if the parent and the child are both feverish. The child being warm could go unnoticed if the parent is warm too.
The reliability of specific measures increases with the number of data points used to create the measure. Therefore it is typically better to use multiple instruments to measure the same phenomenon. Survey and experimental research often include scales consisting of multiple items rather than a single question. The commonly used Cronbach’s Alpha coefficient gives you an impression of the reliability of a scale. Keep in mind though that the Alpha coefficient almost always increases with the number of items used, even if they are contributing little to the reliability of the measure (Cortina, 1993). This is a result of the definition of Alpha. Also you should know there are no absolute thresholds for what is an acceptable degree of reliability. As a rule of thumb, many researchers use values such as .70. With only three items, however, a value of .60 is pretty good, but with six items, the same .60 indicates a low level of reliability. With only three items, .80 is excellent, but with six items it is merely ‘good’. Using more items typically has decreasing marginal utility. Adding a fourth item to a scale of three tends to increase Alpha more strongly than adding a ninth item to a scale of eight.
The validity of measures is higher when they are not clouded by other influences. Common problems in survey research are that respondents fail to report accurately about their behavior and give more positive answers to hypothetical questions than to questions about facts. Observational measures of behavior are to be preferred over self-reported measures. Factual questions (e.g., ‘Did you give to a charitable cause in the past month?’) are to be preferred over hypothetical questions (‘Would you give to a charity if asked?’). The problem that respondents report socially desirable attitudes and behaviors has often been studied as a ‘response bias’ (Meehl & Hathaway, 1946; Crowne & Marlowe, 1960). However, it is clear that this tendency is not a uniform personality trait with strong effects on different behaviors (Parry & Crossley, 1950). It is dependent on situational characteristics and on other personality characteristics. The current consensus is that measures of ‘social desirability’ reflect both substance and style, and researchers should not include them to correct responses to other questions (McCrae & Costa, 1983; Connelly & Chang, 2016). A credible guarantee of partial anonymity (Joinson, 1999; Lelkes et al., 2012), a forgiving introduction (“For understandable reasons, some organizations find it difficult to partner with…”), asking indirect questions (Fisher, 1993) and several other techniques may get you more truthful responses (Krumpal, 2013).
In the example of partnerships we discussed in section 4.1 you could approach a sample of nonprofit organizations to tell you what works in partnerships with corporations, and what went wrong in failed attempts. In the best case, you survey pairs of partners, who report about the partnership as well as about each other (Kelly & Conley, 1987; Robins, Caspi & Moffitt, 2000; Watson, Hubbard & Wiese, 2000).