Modeling is difficult

21. September 2011 um 00:33 2 Kommentare

Yesterday Pete Johnston wrote a detailed blog article about difficulties of “the right” modeling with SKOS, and FOAF in general, and about the proposed RDF property foaf:focus in particular. As Dan Brickley wrote in a recent mail “foaf:focus describes a link from a skos:Concept to ‘the thing itself’. Not every SKOS concept (in a thesauri of classification scheme) will have such a direct “thing”, but many do, especially concepts for people and places.”

Several statements in this discussion made me laugh and smile. Don’t get me wrong – I honor Pete, Dan, and the whole Semantic Web community, but there is a regular lack of philosophy and information science. There is no such thing as ‘the thing itself’ and all SKOS concepts are equal. Even the distinction between an RDF ‘resource’ and an SKOS ‘concept’ is artificial. The problem origins not from wrong modeling, which could be solved by the right RDF properties, but from different paradigms and cultures. There will always be different ways to describe the same ideas with RDF, because neither RDF nor any other technology will ever fully catch our ideas. These technologies are not about things but only about data. As William Kent wrote in Data Reality (1978): “The map is not the territory” (by the way, last year Chris Rusbridge has quoted Kent in the context of linked data). As Erik Wilde and Robert J. Glushko wrote in a great article (2008):

RDF has succeeded beyond the wildest expectations as a convenient format for encoding information in an open and easily computable fashion. But it is just a format, and the difficult work of analysis and modeling information has not and will never go away.

Ok, they referred not to “RDF” but to “XML”, so the quotation is wrong. But the statement is right for both data structuring methods. No matter if you put your data in XML, in RDF, or carve it in stone – there will never be a final model, because there’s more than one way to describe something.

Query parts in Linked Data URIs considered harmful

29. Mai 2011 um 01:40 Keine Kommentare

The generic syntax of URI (and of IRI with slightly different definition of the elements) is:

<scheme name> : <hierarchical part> [ ? <query> ] [ # <fragment> ]

RDF is based on “URI references” which have a different syntax, but all practical RDF data should follow the generic URI syntax. This is also implied by the Linked Data Principles which encourage you to use HTTP URIs. Furthermore it is a good advice not to include fragments in your URIs if you care about coexistence of the Web and the Semantic Web. Yes, there is RDF data with fragment parts and the so called Hash URIs are even given as one design pattern, but you can avoid a lot of trouble if you use URIs without them. By the way, fragment identifiers are also used a lot in JavaScript-based web applications that spoil the whole concept of REST as described in this recent in-depth article on Hash URIs.

I would even go further and say that well-designed URIs for Linked Data should also forgo the query part. A clean URI to be used as identifier in Linked Data should stick to this limited generic syntax:

http: <hierarchical part>

I do not argue to disallow URIs with query part, but I wonder how much they are really used and needed for resources other than Web resources. Especially URIs for non-information resources, should not have a query part. Most web applications will not distinguish between these two:

http://example.org?foo=1&bar=2

http://example.org?bar=2&foo=1

These are different URIs but equivalent URLs. Choosing URIs that are not equivalent to other URIs in common non-RDF applications is a good advice, isn’t it? If you choose a cleaner URI like http://example.org/foobar you could use additional query parts to refer to different representations (information resources) and services around the resource that is referenced by the base URI.

Research data, git hashes, and unbreakable links

5. Mai 2011 um 23:57 1 Kommentar

Yesterday I had a short conversation about libraries and research data. This topic seems to be trendy, for instance D-Lib magazine just had a special issue about it. I am not quite sure about the role of libraries for the management of research data. It looks like many existing projects at least aim at analyzing research data – this can get very complex because any data can be research data. Maybe librarians should better limit to what they can do best and stick to metadata. It is not the job of librarians to analyze traditional publications (there are scientists to do so, for instance in philology), so why should they start analyzing research data? It would be easier if we just treat research data as “blobs” (plain sequences of bytes) to not get lost in the details of data formats. There will be still enough metadata to deal with (although this metadata might better be managed by the users).

One argument in our discussion was that libraries might just use the distributed revision control system git. Git is also trendy, but among software developers that must track many files of source code with revisions and dependencies. Although git is great for source code and lousy for raw binary data, we could learn something from its architecture (actually there is an extension to git to better handle large binary files). I already knew that git uses hash sums and hash trees and was curious how it actually stores data and metadata.

Management of data in git is basically based on the SHA1 hashing algorithm, but you could also use another hashing method. This answer told me how git calculates the SHA1 for a chunk of data. Note that the name of a file is not part of the calculation, as the filename is no data but metadata. You can move around and rename a file; its hash remains the same. More details of how git stores data and metadata about collections of data chunks (filenames and commits) can be found in the git community book, in the git book and in this blog article how git stores your data.

Hashes are also used in peer-to-peer networks to reference files of unknown location. With distributed hash tables you can even decentralize the lookup mechanism. Of course someone still needs to archive the data, but if the data is stored at least somewhere in the system, it cannot get lost by wrong names or broken links. Instead of pointing to locations of files, metadata about research data should contain an unbreakable link to the data in form of its hash. Libraries that want to deal with research data can then focus on metadata. Access to data could be provided via BitTorrent or any other method. The problem of archiving is another issue that should better be solved independently from description and access.

A third trendy topic is linked data and RDF. You can use clean URIs like this to refer to any chunk of data: urn:sha1:cd50d19784897085a8d0e3e413f8612b097c03f1

To make it even more trendy (you are welcome to reuse my idea in your next library research project proposal ;-) put the data objects into the cloud. No more file names, no more storage media – data is just a link in form of a hash value and a big cloud that you can look up data chunks by their hash.

P.S: A short explanation why you can really replace any piece of (research) data by its hash: There are 2160 different SHA1 hash values. According to rules of probability the expected number of hashes that can be generated before an accidental collision (“birthday paradox’”) is 280. The sun will expand in around 5 billion years (less than 258 seconds from now, making life on earth impossible. That means until then we can still generate 2^44 (4 million) hashes per second and collisions are still unlikely. With cryptographic attacks the number can be smaller but it is still much larger than other sources of error.

Mapping bibliographic record subfields to JSON

13. April 2011 um 16:26 1 Kommentar

The current issue of Code4Lib journal contains an article about mapping a bibliographic record format to JSON by Luciano Ramalho. Luciano describes two approaches to express the CDS/ISIS format in a JSON structure to be used in CoudDB. The article already provoked some comments – that’s how an online journal should work!

The commentators mentioned Ross Singer’s proposal to serialize MARC in JSON and Bill Dueber’s MARC-HASH. There is also a MARC-JSON draft from Andrew Houghton, OCLC. The ISIS format reminded me at PICA format which is also based on fields and subfields. As noted by Luciano, you must preserves subfield ordering and allow for repeated subfields. The existing proposals use the following methods for subfields:

Luciano’s ISIS/JSON:

[ ["x","foo"],["a","bar"],["x","doz"] ]

Ross’s MARC/JSON:

"subfields": [ {"x":"foo"},{"a":"bar"},{"x":"doz"} ]

Bill’s MARC-HASH:

[ ["x","foo"],["a","bar"],["x","doz"] ]

Andrew’s MARC/JSON:

"subfield": [
  {"code":"x","data":"foo"},{"code":"a","data":"bar"},
  {"code":"x","data":"doz"} ]

In the end the specific encoding does not matter that much. Selecting the best form depends on what kind of actions and access are typical for your use case. However, I could not hesitate to throw my encoding used in luapica into the ring:

{ "foo", "bar", "doz",
  ["codes"] = {
    ["x"] = {1,3}
    ["a"] = {2}
}}

I think about further simplifying this to:

{ "foo", "bar", "doz", ["x"] = {1,3}, ["a"] = {2} }

If f is a field than you can access subfield values by position (f[1], f[2], f[3]) or by subfield code f[f.x[1]],f[f.a[1]],f[f.x[2]]. By overloading the table access method, and with additional functions, you can directly write f.x for f[f.x[1]] to get the first subfield value with code x and f:all("x") to get a list of all subfield values with that code. The same structure in JSON would be one of:

{ "values":["foo", "bar", "doz"], "x":[1,3], "a":[2] }
{ "values":["foo", "bar", "doz"], "codes":{"x":[1,3], "a":[2]} }

I think a good, compact mapping to JSON that includes an index could be:

[ ["x", "a", "x"], {"x":[1,3], "a":[2] },
  ["foo", "bar", "doz"], {"foo":[1], "bar":[2], "doz":[3] } ]

And, of course, the most compact form is:

["x","foo","a","bar","x","doz"]

Proposed changes in VIAF RDF

13. April 2011 um 13:42 2 Kommentare

The Virtual International Authority File (VIAF) is one of the distinguished showcases of international library community projects. Since more then five years, name authority files from different countries are mapped in VIAF. With VIAF you can look up records about authors and other people, and see which identifiers are used for the same person in different national library catalogs. For some people there are also links to bibliographic articles in Wikipedia (I think only English Wikipedia, but you can get some mappings to other Wikipedias via MediaWiki API), and I hope that there will be links to LibraryThing author pages, too.

However, for two reasons VIAF is not used as much as it could be: first not enough easy-to-understand documentation, examples, and simple APIs; and second difficulties to adopt technologies by potential users. Unfortunately the second reason is the larger barrier: many libraries cannot even provide a simple way to directly link to publications from and/or about a specific person, once you got the right person identifier from VIAF. If you cannot even provide such a fundamental method to link to your database, how should you be able to integrate VIAF for better retrieval? VIAF can do little about this lack of technical skills in libraries, it can only help integrating VIAF services in library software to some degree. This brings me to the other reason: you can always further improve documentation, examples, the design of you APIs, etc. to simplify use of your services. As a developer I found VIAF well documented and not very difficult to use, but there are many small things that could be made better. This is natural and a good thing, if you communicate with your users and adopt suggested changes, as VIAF does.

For instance yesterday Jeffrey A. Young, one of the developers behind VIAF at OCLC published a blog article about proposed changes to the RDF encoding of VIAF. I hope that other people will join the discussion so we can make VIAF more usable. There is also a discussion about the changes at the library linked data mailing list. And earlier this month, at the Code4Lib mailing list, there was a a controversial thread about the problems to map authority records that are not about people (see my statement here).

I appreciate the simplification of VIAF RDF and only disagree in some details. The current proposal is illustrated in this picture (copied from Jeffrey’s original article):

This looks straightforward, doesn’t it? But it only suits for simple one-to-one mappings. Any attempt to put more complex mappings into this scheme (as well as the existing VIAF RDF scheme) will result in a disaster. There is nothing wrong with simple one-to-one mappings, with SKOS you can even express different kinds of mappings (broader, narrower, exact, close), but you should not expect too much preciseness and detail. I wonder why at one side of the diagram links are expressed via foaf:focus and at the other side via owl:sameAs. In my opinion, as VIAF is about mapping authority files, all mapping links should use SKOS mapping properties. There is nothing wrong in declaring an URI like http://viaf.org/viaf/39377930/ to stand for both a foaf:Person, a rdaEnt:Person, and a skos:Concept. And the Webpage that gives you information about the person can also get the same URI (see this article for a good defense of the HTTP-303 mess). Sure Semantic Web purists, which still dream of hard artificial intelligence, will disagree. But in the end RDF data is alway about something instead of the thing itself. For practical use it would help much more to think about how to map complex concepts at the level of concept schemes (authority records, classifications, thesauri etc.) instead of trying to find a “right” model reality. As soon as we use language (and data is a specific kind of language), all we have is concepts. In terms of RDF: using owl:Thing instead of skos:Concept in most cases is an illusion of control.

XML Schema vs. Library APIs (OAI-PMH/SRU/unAPI…)

24. Februar 2011 um 18:33 2 Kommentare

Much of our work at GBV library network has to do with record formats and APIs. We harvest or get metadata records in a wide range of formats (with many different interpretations and misconstructions of these formats), convert records to a wide range of formats (with many special request how to interpret this formats), and provide records through various APIs. Some of these APIs allow you to select different record formats, for instance OAI-PMH (first published 2001), SRU (2003), and unAPI (2006). These APIs are based on HTTP for transport and XML for encoding of the records. There are also older APIs and encoding formats like Z39.50 and newer APIs like pure Linked Data and SPARQL for RDF. unAPI also supports non-XML formats, but in this article I will concentrate on XML-based formats.

The basic question (that I deal with since years) is “what exactely is a format and how do you refer to it?”. All three APIs provide a method for listing of all formats that are supported by a particular server. unAPI provides a “list of object formats”. Each format has a “name”, a “type” (which must be an official Internet media type), and an optional documentation URL (“docs”), which may refer to some human-readable documentation, or to an XML Schema (XSD) file. Here are three examples:

<format name="oai_dc" type="application/xml"
  docs="http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
/>
<format name="pubmed" type="application/xml"
  docs="http://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html"
/>
<format name="mods" type="application/xml"
  docs="http://www.loc.gov/standards/mods/"
/>
<format name="marcxml" type="application/xml"
  docs="http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"
/>

To avoid the uncertainty whether “docs” references a formal schema or a plain document, there should have been a “schema” attribute (first problem). To refer to a format in an unAPI request, you use the format’s “name”. In OAI-PMH you refer to a format by its “metadataPrefix”. You can get a list of supported formats with the ListMetadataFormats request. In addition to the “metadataPrefix” each format has the location of an XML Schema (“schema”) and an XML Namespace URI (“metadataNamespace”). In theory the latter is dispensable, because each XSD document declares a namespace URI in its “targetNamespace” attribute: Given a format with a schema that defines namespace “http://example.org/” like this

<xs:schema targetNamespace="http://example.org/">

I would expect records in this format to use this namespace, at least for the XML root element:

<record xmlns="http://example.org/">

The OAI-PMH specification does not explicitly say that the “metadataNamespace” must match the namespace in the schema file “schema”. What does it mean if they differ? (second problem).

In SRU a format is known as “schema”. A list of supported formats is contained in an explain request. Each schema has an optional “title”, a “name” (used to refer to schemas in the “recordSchema” HTTP parameter when doing a search query), an “identifier”, and an optional “location”. The “identifier” contains an additional URI, and the “location” contains a link to an XML Schema file or to some human-readable documentation (like the “docs” attribute in unAPI). There is a list of known schemas at the SRU page, for instance:

title and location name identifier
MODS Schema Version 3.0 mods info:srw/schema/1/mods-v3.0
MODS Schema Version 3.3 mods info:srw/schema/1/mods-v3.3
MARCXML marcxml info:srw/schema/1/marcxml-v1.1

Note that one name (for instance “mods”) can refer to several schemas, but one particular SRU server can only provide one particular format under this name. The additional identifier neither refers to a particular XML Schema (Third problem). The identifier may only give a hint which particular version or interpretation of a format is provided.

Does anyone really need this diverse methods to refer to formats? I found in practice you cannot rely on the claimed format anyway, unless you can automatically validate it. That’s what XML Schema can be used for. I don’t say that XML Schema is the best or only method to formally describe an XML-based format (personally I much bettter like RELAX NG), but if there is an XML Schema – shouldn’t this schema be enough to identify the format?. Is there really a need of four independent identifiers to refer to an XML-based format? In the worst case we have:

  • Schema Name (e.g. mods)
  • Schema Location (e.g. http://www.loc.gov/standards/mods/v3/mods-3-3.xsd)
  • Schema Identifier (e.g. info:srw/schema/1/mods-v3.3)
  • Schema Namespace (e.g. http://www.loc.gov/mods/v3)

This is bad design, because you cannot say which of the four is the right one and how they relate to each other. A clean solution would only have two identifiers for XML-based formats:

  • The local name, which is only unique for a particular API and a particular server
  • The global schema Location, which is a cool URI that resolves to an XML Schema file.

The Schema Namespace is included as “targetNamespace” in the XML Schema, and the Schema Identifier is delusion anyway. Either you can identify a format by a formal schema (that can also be used to validate records) or you just cannot guarantee which format your records will be in. Sure you can give some hints by linking to documentations, examples, and guidelines. But adding more identifiers is a fakery of control. You are still allowed to provide more specific formats, variants, application profiles, and interpretations under different names. But these formats don’t get more clear or usable if you give them a “Schema Identifier”. Does anyone uses SRU’s Schema Identifiers anyway? I think for XML we can better live with XML Schemas that the XML namespaces can be extracted from. An application can identify a format by its schema location, by the XML namespace, and/or by other information contained in the schema. Additional pointers to human-readable documentation are great. But don’t confuse description with identification if you need to refer to a data format.

P.S. At Code4lib mailing list Rob Sanderson pointed to our discussion we had about the same topic in 2009, and one of my earlier postings on XML4Lib also deals with SRU and namespaces.

Named Entity Recognition with DBPedia

15. Februar 2011 um 14:55 5 Kommentare

Yesterday the DBPedia team released DBPedia Spotlight, a named entity recognition service based on structured data extracted from Wikipedia. You can access the service via Web APIs or download the software as Open Source. I could not resist to feed Spotlight its own description:

DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. Text annotation has the potential of enhancing a wide range of applications including search, faceted browsing and navigation. By connecting text documents with DBpedia, our system enables a range of interesting use cases. For instance, the ontology can be used as background knowledge to display complementary information on web pages or to enhance information retrieval tasks. Moreover, faceted browsing over documents and customization of web feeds based on semantics become feasible. Finally, by following links from DBpedia into other data sources, the Linked Open Data cloud is pulled closer to the Web of Documents.

Pretty cool, isn’t it? Natural Language Processing (NLP) for information extraction seems to be the next hype after Web 2.0 and Semantic Web. I don’t neglect the innovative capabilities of DBPedia Spotlight and similar tools, but you should never forget that these are just tools, which won’t automatically solve information problems, or replace all other tools. Given the example above, there is little chance that an automatic system will extract you an exact topic of the text (for instance “named entity recognition based on data extracted from Wikipedia”) because this requires much background knowledge combining domain-specific expertise with common sense. By the way: as long as both Wikipedia and NLP-software is mainly written by white males, the result of will always mirror a limited world-view.

You can compare the results of Spotlight with similar open services:

I found little overlap between the different services. Spotlight seems to provide more results (depending on the Text) on an error rate between 10% and 30%. You could use such tools for automatic subject indexing based on abstracts and use the result at least for ranking. Unfortunately in library metadata we often have no full text or abstract to annotate. Furthermore many library entities have no DBPedia entry but catalogers create new authority records if needed. What do you think, named entity recognition and other NLP techniques can be used for in metadata land? Can we give up controlled subject indexing in libraries in favour of automatic NLP-based indexing on the one side and social tagging on the other? Or is room for all of these approaches, and how can you successfully combine them?

Linked Data/RDF/SPARQL Documentation Challenge

23. Januar 2011 um 13:49 4 Kommentare

In a blog posting earlier in January Richard Pop descibed the difficulty to get started with Linked Data for the average programmer: “there is a new format (RDF), new query language (SPARQL) and new code libraries to learn and install (if you can find them in the first place)”. For this reason Pop created a challenge to come up with documentation for the basic steps of developing a Linked Data/RDF/SPARQL application. You can find contributions by Bill Roberts, Christopher Gutteridge, Pezholio, Gunnar Aastrand Grimnes, Tom Morris, Jeni Tennison (and here), Niklas Lindström, Felix Ostrowski, and John Goodwin. Meanwhile there are £50 prizes for solutions in PHP, Ruby, and Phython.

I have engaged in RDF since years, but mostly from a theoretical point of view. This year I started creating Linked Data applications with RDF::Trine, a powerfull programming library by Gregory Williams to handle RDF data in Perl (see more at perlrdf.org). I contributed some minor patches to further simplify usage. Next month I will write a tutorial on how to create an RDF based application in Perl. Meanwhile I work on a Perl module (limited preview) to create classifications and thesauri in SKOS/RDF.

Data is not meaning – but a web badge

6. Januar 2011 um 00:57 3 Kommentare

I’m am sure that Douglas Adams and John Lloyd had a word for it: you know exactly what you mean, but not how to call it. Recently I tried to find information about : A particular kind of “web banner”, “button”, or “badge” with specific size, border, and two parts. I finally found out, that it is a 80×15 web badge as introduced by Antipixel in 2002. A helpful description of the format is given by ZwahlenDesign, who also points to two online badge creation tools: Brilliant Button Maker and Button Maker. Note that the tools use “Button” instead of “Badge” to refer to the same thing.

I created a web badge to promote a simple philosophical web standard: data is not meaning* Here is the data as 177 bytes hexdump:

89 50 4E 47 0D 0A 1A 0A 00 00 00 0D 49 48 44 52 00 00 00 50 00 00 00 0F 01 03 00 00 00 49 07 DA CC 00 00 00 01 73 52 47 42 00 AE CE 1C E9 00 00 00 06 50 4C 54 45 FF FF FF 00 00 00 55 C2 D3 7E 00 00 00 59 49 44 41 54 08 D7 63 F8 0F 07 0C 0D 0C 50 C0 C8 B0 FF FF FF 0F D8 99 0D 10 E6 8E CF 7D 05 2D 7E 86 42 2E 85 0C BB 73 EF 6E 7E 76 C2 73 52 4A 23 C3 EE C4 3B 06 AD 7E 95 41 21 1B C1 A2 4F 3C 3C 8D 7C 26 32 EC 78 7B 77 43 8B 9F A7 90 4B 22 B2 09 D8 AD 40 72 03 C2 65 00 CA 67 45 A7 86 69 B7 81 00 00 00 00 49 45 4E 44 AE 42 60 82

If data was meaning, that should be all to say. But data is just a stream of bits, bytes, numbers, characters, strings, nodes, triples, or sometimes even words. You have to make use of it in a meaningful way. For instance you could give the data above to a specific piece of software like your web browser. Here comes the data again:

This PNG image was the smallest I was able to create with optimized colors, LZW compression etc. There is another piece of data, that is only eight bits more (178 bytes) and looks the same as GIF image:

I could also express the monochrome badge with one bit per pixel. That makes 80×15 = 1200 bits = 150 bytes uncompressed. The meaning could be the same, but not when only looking at your browser (because this piece of software cannot handle my “ad-hoc monochrome 80×15 web badge format”).

I also created a version with color. Feel free to use and modify as you like. In this case the PNG with 198 bytes is slightly larger than the GIF with 196 bytes.

PNG:

GIF:

* I was suprised that there were only seven Google hits for this phrase, none of them with the same meaning (sic!) that I try to express by this article. The badge was inspired by this important warning sign.

P.S: Enough data philosophy, time for music. There is so much more than one and zero and one and zero and one!

Is data a language? In search of the new discipline Data Linguistics

13. September 2010 um 01:24 10 Kommentare

Yesterday Jindřich asked me for a reason to treat data as a language. I really appreciate this little conversations in data philosophy, but after a while 140 characters get too limited. Half a year ago a similar discussion with Adrian became a small series of blog articles (in German). I doubt that you can find a simple and final answer to fundamental questions about data and meaning, because these questions touch the human condition. This is also the reason why we should never stop asking unless we give up being human beings.

Jindřich’s question first made me wonder because for me data obviously is a language. All data is represented as sequence of bits, which can easily be defined as formal language. But this argument is stupid and wrong. Although language can be described by formal languages (as introduced by Noam Chomsky), this description only covers synax and grammar. Above all, a description of language must not be confound with language itself: the map is not the territory.

But data is used to communicate just like natural (written) language. The vast amount and heterogenity of data sometimes makes us forget that all data is created by humans for humans. Let me start with a simple argument against the view of data as language. A lot of data is created by measuring nature. As nature is not language, measured data is not language. This argument is also wrong. We (humans) design measuring devices and define their language (sic!) in terms of units like length, duration, blood pressure, and so forth. These units do not exist independent from language, but only communicated via it. And most units describe much more complex and fuzzy concepts like “name”, and “money”, which only exist as social construct. A piece of data is a statement that can be false, true, nonsense, or all of it, depending on context. Just like language.

Nowadays we create a lot of data for machines. Is this an argument against nature as language? I don’t think so. We may say that a piece of data made a machine to perform some task, but the machine was designed to act in a specific way. Machines do not “understand” data, they just react. If I use an axe to cut a tree, I do not send the tree a message of data that it interprets to cut itself. Of course computers are much more complex then trees (and much simpler from another point of view). The chain of reaction is much more subtle. And most times there are more participants. If I create some data for a specific program, I do not communicate with the program itself, but with everyone involved in creating the program and its environment. This may sound strange but compare the situation with legal systems: a law is a piece of language, used to communicate to other people: “don’t step on the grass”. Unforunately society makes us think that laws are static and independent from us. In the same way people think that data is shaped by computers instead of people. Next time you get angry about a program, think about the vendor and programmer. Next time you get angry about a law, think about the lawmakers.

What follows from treating data as language? I think we need a new approach to data, a dedicated study of data. I would call this discipline data linguistics. Linguistics has many sub-fields concerned with particular aspects of natural language. The traditional division in syntax, semantics, and pragmatics only describes one direction to look at language. Anthropological linguistics and sociolinguistics study the relation between language and society, and historical linguistics studies the history and evolution of languages, only to mention a few disciplines. Suprisingly, the study of data is much more limited — up to now there is no data linguistics that studies data as language. The study of data is mainly focused on its form, for instance on the study of formal languages in computer science, the study of digital media in cultural studies and media studies, or the study of forms and questionnaire in graphic design and public administration (forms could be a good starting point for data linguistics).

There are some other fields that combine data and linguistics but from different viewpoints: computational linguistics studies natural language by computational means, similar to digital humanities in general. In one branch of data analysis, linguistic summaries of data are created based on fuzzy set theory. They provide natural language statements, that capture the main characteristics of data sets. Natural language processing analyses textual data by algorithmic methods. But data linguistics that analyses data in general is still waiting to be discovered. We can only conjecture possible reasons for this lack of research:

  • Data is not seen as language.
  • Digital data is a relatively new phenomenon. The creation of data on a large scale mainly began in the 20th century, so there is not enough time to historically explore the topic.
  • In contrast to natural language, data is too heterogenious to justify a combined look at data in general.
  • data seems to be well-defined, so no research is needed.

What do you think?

Powered by WordPress with Theme based on Pool theme and Silk Icons.
Entries and comments feeds. Valid XHTML and CSS. ^Top^