Data shouldn’t be allowed to speak for itself

Dr Vaughn Tan is an assistant professor at University College London’s School of Management. He received a PhD in Organizational Behaviour from Harvard University in 2013.

Previously, he was an infantry signals logistician in the Republic of Singapore Army, then worked at Google on advertising, Earth, Maps, spaceflight, and Fusion Tables. In May 2008, he left Google to become a minion in the wood and sculpture program at the Anderson Ranch Art Foundation. Vaughn sometimes contributes to The Atlantic. A link to his personal website can be found here: http://www.vaughntan.com/

This piece was originally written for a digital ethics forum run by the UCL Digital Ethics Forum in May 2019; it represents a management perspective on digital ethics.

The most critical failure mode for management research that uses digital data is mistaking abundant quantitive data for complete data when analysing and drawing conclusions from it. To avoid this, researchers should theorise and interpret explicitly by stating hypotheses and assumptions in data analysis. Patterns discovered in data can lead to theorising, but patterning in data should not be mistaken for theory. In other words, data shouldn’t be allowed to speak for itself.

Management research studies people individually and collectively and is inherently oriented toward action. Management research aims to understand and then to influence (by changing management practice) how, for instance, organisations are run, how products are designed, how goods are sold, how employees are managed, and so forth. Digital ethics—the consideration of the ethical and prudential issues surrounding interaction in digital space and collection and use of digital information—is crucial for management research because:

A growing proportion of human action and interaction in the context of management happens digitally

Data from digital action and interaction is becoming easier to store and analyse.

Both have ramifications for how researchers think about the process of doing research and drawing conclusions from it that are intended to inform practice. These ramifications are conditioned by the affordances and constraints of digital information in organisational settings.

Affordances

1. Large quantities of information can be collected, stored, and distributed relatively easily
2. Routine and ambient data is relatively easy to collect prospectively (e.g. sociometers) or mine from systems retrospectively (e.g. data from corporate email servers).

Constraints

1. Hard to understand what digital information about a focal entity (individual or organisation) remains unobserved or inaccessible
2. Non-digital aspects of the focal entity remain hard to observe
3. Identification of focal entities becomes easier (and robust anonymisation thus becomes harder) as datasets grow in size and are combined
4. Murky ethics of using data acquired retrospectively and opportunistically from routine user activity (such as cellphone location data)

Implications

Conceptualising research permission becomes more complex.

The ethical consideration here is in understanding both the legal (liability-oriented) and moral (ethics-oriented) imperatives for permission to use digital data in research. This is especially important for data collected opportunistically from the activity of respondents who are not the legal owners of the data they produce in the course of routine activity.

2. Analytic bias becomes harder to see.

Most importantly, the size of large datasets leads to the implicit assumption that they are more complete datasets that “speak for themselves.” While large quantities of data can be relatively easily collected, it remains difficult to understand what types of data (both digital and non-digital) are missing. Analyses of a large digital dataset may thus be subject to unknown biases, but this may become less apparent as datasets become larger.