Is data a language? In search of the new discipline Data Linguistics

13. September 2010 um 01:24 10 Kommentare

Yesterday Jindřich asked me for a reason to treat data as a language. I really appreciate this little conversations in data philosophy, but after a while 140 characters get too limited. Half a year ago a similar discussion with Adrian became a small series of blog articles (in German). I doubt that you can find a simple and final answer to fundamental questions about data and meaning, because these questions touch the human condition. This is also the reason why we should never stop asking unless we give up being human beings.

Jindřich’s question first made me wonder because for me data obviously is a language. All data is represented as sequence of bits, which can easily be defined as formal language. But this argument is stupid and wrong. Although language can be described by formal languages (as introduced by Noam Chomsky), this description only covers synax and grammar. Above all, a description of language must not be confound with language itself: the map is not the territory.

But data is used to communicate just like natural (written) language. The vast amount and heterogenity of data sometimes makes us forget that all data is created by humans for humans. Let me start with a simple argument against the view of data as language. A lot of data is created by measuring nature. As nature is not language, measured data is not language. This argument is also wrong. We (humans) design measuring devices and define their language (sic!) in terms of units like length, duration, blood pressure, and so forth. These units do not exist independent from language, but only communicated via it. And most units describe much more complex and fuzzy concepts like „name“, and „money“, which only exist as social construct. A piece of data is a statement that can be false, true, nonsense, or all of it, depending on context. Just like language.

Nowadays we create a lot of data for machines. Is this an argument against nature as language? I don’t think so. We may say that a piece of data made a machine to perform some task, but the machine was designed to act in a specific way. Machines do not „understand“ data, they just react. If I use an axe to cut a tree, I do not send the tree a message of data that it interprets to cut itself. Of course computers are much more complex then trees (and much simpler from another point of view). The chain of reaction is much more subtle. And most times there are more participants. If I create some data for a specific program, I do not communicate with the program itself, but with everyone involved in creating the program and its environment. This may sound strange but compare the situation with legal systems: a law is a piece of language, used to communicate to other people: „don’t step on the grass“. Unforunately society makes us think that laws are static and independent from us. In the same way people think that data is shaped by computers instead of people. Next time you get angry about a program, think about the vendor and programmer. Next time you get angry about a law, think about the lawmakers.

What follows from treating data as language? I think we need a new approach to data, a dedicated study of data. I would call this discipline data linguistics. Linguistics has many sub-fields concerned with particular aspects of natural language. The traditional division in syntax, semantics, and pragmatics only describes one direction to look at language. Anthropological linguistics and sociolinguistics study the relation between language and society, and historical linguistics studies the history and evolution of languages, only to mention a few disciplines. Suprisingly, the study of data is much more limited — up to now there is no data linguistics that studies data as language. The study of data is mainly focused on its form, for instance on the study of formal languages in computer science, the study of digital media in cultural studies and media studies, or the study of forms and questionnaire in graphic design and public administration (forms could be a good starting point for data linguistics).

There are some other fields that combine data and linguistics but from different viewpoints: computational linguistics studies natural language by computational means, similar to digital humanities in general. In one branch of data analysis, linguistic summaries of data are created based on fuzzy set theory. They provide natural language statements, that capture the main characteristics of data sets. Natural language processing analyses textual data by algorithmic methods. But data linguistics that analyses data in general is still waiting to be discovered. We can only conjecture possible reasons for this lack of research:

  • Data is not seen as language.
  • Digital data is a relatively new phenomenon. The creation of data on a large scale mainly began in the 20th century, so there is not enough time to historically explore the topic.
  • In contrast to natural language, data is too heterogenious to justify a combined look at data in general.
  • data seems to be well-defined, so no research is needed.

What do you think?


RSS feed for comments on this post. TrackBack URI

  1. I think there’s some future for this discipline. Currently, lots of the data we produce is still „well understood“, at least by the agents involved (you give a good understanding about the nature of these agents in your post above). Time will come when the data that our century produces (and it’s _lots_ of data) will have no „native speakers“ anymore, but it will still be represented. At least from that point on, a linguistic approach to it will make some sense from a cultural heritage PoV, possibly also for economic reasons.

    For now, there would possibly be only some very generic scientific value in such a discipline. Usage scenarios for the knowledge gained by such a discipline are still rare.

    There’s also some value in just thinking of data as a language. It sheds some new light on why standards are important. And for the same reasons those are important, the linguistic approach to data would also become important.

    As a rather esoteric side note: There _are_ people who think that nature _does_ communicate. From such a PoV, the linguistic approach to data would be „science“, i.e. making sense of the data nature’s providing.

    Comment by Hans-Werner — 13. September 2010 #

  2. Great post Jakob, now I get the problem I had with your Twitter question. For me, the problem with treating data as language lies in the speech-language distinction (probably first formulated first by de Saussure). Using the speech-language distinction applied to data I see data rather as speech, whereas the formalization that is used in data, for example binary, is the language.
    So, this is basically the reason why I was uncomfortable with treating data as a language. However, I think you’ve raised a lot of interesting issues in this blog post, but I think they don’t apply on the level of data (speech), but on the level of the formalizations we use to record data (language). At the same time, I think it’s probably useful as well research data as a kind of speech and have an equivalent of phonetics or phonology.

    Comment by Jindřich Mynarz — 13. September 2010 #

  3. Thanks for your responses! Hans-Werner is right, up to now there are little practical use-cases but analyzing data by linguistic means can give some theoretical insights. The discipline also touches philosophy of information, but I don’t want to dig into this at this point. Standards are important to shape data even more as rules of grammar, orthography, and spelling in natural language – but they share similarities. For instance you can have data dialects and minor violations of standards, which you must discover to still make use of the data.

    Jindřich’s objection deals with the definition of „data“. I may also have used the term „information linguistics“ but this is already used for computational linguistics and natural language processing. As far as I understand de Saussure’s model, data can in fact be better described as speech. However you will not only analyze data on the most basic level, but also its formalizations and schemas. In contrast to natural language, the rules of data are often explicitly written down as schemas, ontologies, etc. These can be analyzed in the same way, because they are also data. I think in data the speech/language distinction is less clear, but you are right: The description of data as language is problematic.

    By the way the current Code4Lib journal contains a very simple application of data linguistics. Jason Thomale describes how he reconstructed a specific variant of the MARC 245 field. In linguistic terms I would say, he partly deciphered a rare, ancient data language 🙂

    Comment by jakob — 22. September 2010 #

  4. The distinction between prescriptive linguistics (how language should be) and descriptive linguistics (how language actually is actually used) is crucial: Today most data is prescribed by standards. For instance Dublin Core defines a set of elements. But data linguistics should analyze whether and how these elements are really used as distinct entities in practise. I think there is a large gap between how data should be, and how data actually is. Just explaining differences as errors and misuse of standards does not get it: data always contains patterns – data linguistics should reveal these patterns whether reflected by explicit standards or not.

    Comment by jakob — 24. September 2010 #

  5. Wow. Jakob, you are coming up with so much BIG words: after ‚data‘ we now have ‚linguistics‘ and also terms like ‚meaning‘ keep coming up regularly. What’s lacking until now is the discussion about what a rule is :-). I believe the questions you touch are very important to understand the future of what’s coming in the „digital age“. I wish I’d have more time to again dive into these topics because these are issues you can’t cover with one or two blog posts (or books)…

    Anyway, here are some thoughts to incite further discussion. I will talk about three ways in which data and linguistics differ categorically. (Behind this is my understanding of ‚data‘ which seems to me to be much narrower than how you use the term, Jakob.) I can’t point out here how the following three points relate to each other or whether there are problems reconciling them with each other. Anyway, I think they should both be treated in the beginning of a discussion about data and linguistics.

    1. Data and language are part of seperate realms of communication: We use – at least in the context of linguistics – the term ‚language‘ when we talk about communication between humans. In contrast, we tend to use data more and more in a computational context.
    In human communication there is no one-to-one-equivalence between the speaker’s meaning and the meaning the listener understands. It was Wilhelm von Humboldt who put this principle of language into the sentence: Every understanding always also is a non-understanding. („Alles Verstehen ist … immer zugleich in Nicht-Verstehen.“)
    In technical communication the communicating agents are machines, the acts of coding and decoding are technical processes . Shannon/Weavers model of communication deals with technical communication where sender, channel and receiver are apparatuses. It is a popular mistake to impose this model on human communication and to gain wrong assumptions about language.

    2. In my understanding of the concept of data there’s also another important difference between data and languages: All data is written, while every languages is primarily spoken, whereas many have also developed their written counterpart. And out of this big difference follow many many further differences:
    – Data is persistent while (oral) language is transient.
    – All data (if you don’t widen the term to also denote sense-data which I think nobody does in this discussion) breaks down to a limited set of types: The alphabet, numbers, mathematical and logical symbols, punctuation marks, musical notation etc.: All data is based on a code, a finite set of distinct types.
    – Data is normally found in a mono-modal medium, humans encouter it as visual (it is seen by the eye though you might feel it with your fingers (braille) or experience through other senses). In language besides the auditive speech we’ve got the visual gestures and mimic.

    3. (To get the rules into play…) No language develops by an initial statement of its rules and their subsequent adherence. Rules of language are regularities which change in the course of time because of the language being used. As you say data standards change as well but as far as I know they originate in rule statements, in an act of saying: If you follow the rules R1, R2, R3,… then you use the standard S. So much for sure: Natural languages don’t originate like this.

    So, following this categorical differences between data and language the term ‚data linguistics‘ doesn’t make sense and can legitimately be called a contradictio in adiecto. But anyway, it certainly is very useful to think about data and information in distinction to oral and written language. And especially in comparison to the written language many similarities on the syntactical level can be found.

    Comment by Adrian — 24. September 2010 #

  6. Thanks Adrian, I need to think more about your comments. Just a short answer to point 2 and 3: we communicate via text messages and digitized speech which is coded in data. There are ancient languages, we only know in written form. So the written/oral distinction is not quite clear to me. I’d say that all data somehow origins in natural language because you first have to describe it with words. For me data is like a specific dialect, that evolved because of usage restrictions. It’s like if someone with very limited abilities has tried to speak. Rules in data are more often prescriptive, but can also come up as ad-hoc conventions. Would you agree that data is result of language but not language itself?

    Comment by jakob — 25. September 2010 #

  7. First, I’d like to make clear that I agree with you concerning your basic suggestion: I think a discipline/line of thought which deals with the philosophical, cultural and social questions concerning data would be a very good thing. I don’t know whether approaches in this direction already exist anywhere but you would probably know if it was like that.

    Another thing I have to point out: I haven’t yet developed a coherent terminology myself. In our preceding discussion I made clear that I would prefer to use the term ‚data‘ for all marks which are based on a code (a finite set of distinct types), whether some person interprets them or not. As I understand your search for a discipline or a line of thought which inquires our use of data and the surrounding social and cultural practices and as I believe that we both are obsessed with similar questions, using ‚data‘ as the fundamental term doesn’t fit into my current understanding. On the other side, ‚information‘ doesn’t seem to be better because it is even more semantically loaded by quite different previous usages…
    If data should be the central concept I might call a discipline ‚Datalogy‘. And it might even be a subdiscipline of linguistics, especially of the linguistics which deal with the written word. As far as I know the human writing has its origin in running stock books. That means, generating data in a very narrow sense could arguably have been the foremost use of the written word…

    Whatever, as I read over your post again a very important (probably more important than the already mentioned) fourth difference between data and language came up to me. You write:

    „A piece of data is a statement that can be false, true, nonsense, or all of it, depending on context. Just like language.“

    Though it is right that language and data can be „false, true, nonsense, or all of it“, language can be much more than that. At the latest with Austin’s „How to do things with words“ it is clear that language’s role in communication isn’t restricted to asserting facts. Language isn’t completely absorbed in declarative sentences. We have questions, commands, performative sentences (e.g. „I herebey declare you man and wife.“) and much more. As data only deals with assertions (and that is the reason why RDF legitimately only accounts for assertions because RDF is a data model) it can only be a part of all language. This is a fundamental insight every thinking about data should take into account in its beginning.

    Comment by Adrian — 25. September 2010 #

  8. A phenomenon that occurs in natural language can also occur in formalized ontologies: semantic change. I think one of the biggest challenges, given the enormous amount of data created by a huge number of people (who might have different levels of understanding), is how to reconcile inconsistencies and semantic drifts in data usage. See also

    Comment by Saskia — 12. Oktober 2010 #

  9. @Saskia: good point: semantic changes. My interest is less in removing inconsistencies and semantic drifts in data usage, although this is surely needed. But if there is a semantic shift, there may be a good reason, so its more interesting to detect and describe this shift instead of removing it. But maybe I am just to lazy and I prefer pointing to problems, instead of resolving them 😉

    Comment by jakob — 14. Oktober 2010 #

  10. Jakob, just to clarify: like you, I wouldn’t intend to remove inconsistencies and changes in semantics, but trace them, put them in context and make sense of them, because despite possible departures from defined standards, we will still need to work with the data. This is what I meant by „reconcile“.

    Comment by Saskia — 15. Oktober 2010 #

Sorry, the comment form is closed at this time.