Yesterday Jindřich asked me for a reason to treat data as a language. I really appreciate this little conversations in data philosophy, but after a while 140 characters get too limited. Half a year ago a similar discussion with Adrian became a small series of blog articles (in German). I doubt that you can find a simple and final answer to fundamental questions about data and meaning, because these questions touch the human condition. This is also the reason why we should never stop asking unless we give up being human beings.
Jindřich’s question first made me wonder because for me data obviously is a language. All data is represented as sequence of bits, which can easily be defined as formal language. But this argument is stupid and wrong. Although language can be described by formal languages (as introduced by Noam Chomsky), this description only covers synax and grammar. Above all, a description of language must not be confound with language itself: the map is not the territory.
But data is used to communicate just like natural (written) language. The vast amount and heterogenity of data sometimes makes us forget that all data is created by humans for humans. Let me start with a simple argument against the view of data as language. A lot of data is created by measuring nature. As nature is not language, measured data is not language. This argument is also wrong. We (humans) design measuring devices and define their language (sic!) in terms of units like length, duration, blood pressure, and so forth. These units do not exist independent from language, but only communicated via it. And most units describe much more complex and fuzzy concepts like “name”, and “money”, which only exist as social construct. A piece of data is a statement that can be false, true, nonsense, or all of it, depending on context. Just like language.
Nowadays we create a lot of data for machines. Is this an argument against nature as language? I don’t think so. We may say that a piece of data made a machine to perform some task, but the machine was designed to act in a specific way. Machines do not “understand” data, they just react. If I use an axe to cut a tree, I do not send the tree a message of data that it interprets to cut itself. Of course computers are much more complex then trees (and much simpler from another point of view). The chain of reaction is much more subtle. And most times there are more participants. If I create some data for a specific program, I do not communicate with the program itself, but with everyone involved in creating the program and its environment. This may sound strange but compare the situation with legal systems: a law is a piece of language, used to communicate to other people: “don’t step on the grass”. Unforunately society makes us think that laws are static and independent from us. In the same way people think that data is shaped by computers instead of people. Next time you get angry about a program, think about the vendor and programmer. Next time you get angry about a law, think about the lawmakers.
What follows from treating data as language? I think we need a new approach to data, a dedicated study of data. I would call this discipline data linguistics. Linguistics has many sub-fields concerned with particular aspects of natural language. The traditional division in syntax, semantics, and pragmatics only describes one direction to look at language. Anthropological linguistics and sociolinguistics study the relation between language and society, and historical linguistics studies the history and evolution of languages, only to mention a few disciplines. Suprisingly, the study of data is much more limited — up to now there is no data linguistics that studies data as language. The study of data is mainly focused on its form, for instance on the study of formal languages in computer science, the study of digital media in cultural studies and media studies, or the study of forms and questionnaire in graphic design and public administration (forms could be a good starting point for data linguistics).
There are some other fields that combine data and linguistics but from different viewpoints: computational linguistics studies natural language by computational means, similar to digital humanities in general. In one branch of data analysis, linguistic summaries of data are created based on fuzzy set theory. They provide natural language statements, that capture the main characteristics of data sets. Natural language processing analyses textual data by algorithmic methods. But data linguistics that analyses data in general is still waiting to be discovered. We can only conjecture possible reasons for this lack of research:
- Data is not seen as language.
- Digital data is a relatively new phenomenon. The creation of data on a large scale mainly began in the 20th century, so there is not enough time to historically explore the topic.
- In contrast to natural language, data is too heterogenious to justify a combined look at data in general.
- data seems to be well-defined, so no research is needed.
What do you think?
10 Kommentare »
Entschuldige, das Kommentarformular ist zurzeit geschlossen.