Data Task Questionnaire Results

Task types, task steps and task hold-ups

The tasks that respondents were doing mainly ranged in scale from Local Authority to European wide tasks. Two tasks were global in their area of interest . Some tasks involved the creation of interactive mapping tools, as well as service or (web) application development. Other tasks were concerned with data manipulation and analysis, data visualisation using spatial and social statistics, data transformations from raw to linked data, or combining data from multiple sources.

We asked the respondents to describe their task in three to five main steps and to highlight those steps that presented them with the most difficulties and hold-ups. This told us that the most time consuming aspects of the data tasks represented by the questionnaire responses were as follows;

  • getting hold of data that couldn’t be downloaded (CD through post)
  • sorting out poorly formatted data, manipulating and re-organising data, converting data
  • getting region boundaries to load quickly enough into Google Maps using SPARQL query and Javascript application
  • turning free text into triples and locations into geographic entities that can be linked to
  • modelling the data in RDF
  • combining population figures with geographically referenced entities by time period
  • manually checking data that has been automatically processed
  • mapping between different data models
  • converting point data to regional aggregates
  • building links that make sense and can be easily harvested

Technology used for the data tasks

  • Several, depending on the use case. From scrapers to linked data tools.
  • Publish My Data for linked data and SPARQL endpoint. Ruby scripts for processing/preparing boundary data. Javascript and Google Maps API
  • Excel, QGIS and the internet
  • Laptop, R software environment, Illustrator
  • Web sites that can be searched for words or phrases that are geographic in nature
  • data transformation tool (I used Google Refine mainly)
  • A good programming language, php
  • Web browser, storage space, GIS for data management, software for modelling
  • Python, a spatial database (SQL Server and ArcSDE). The operating systems in this case is Windows Server 2008 and XP
  • Schema mapping software Data model Web Feature service based distributed architecture GUIS
  • 1. linked data formats for common data representation 2. custom scripts to convert and manipulate datasets 3. mapping APIs to assist with visualisation
  • Primarily server-side scripting languages
  • Computer, GIS, CRM
  • php, perl and c

Task timings

  • most of the tasks that are one-off need to be repeated in a similar fashion with new data, or because data needs to be updated and/or maintained. This implies that it is worth looking at the requirements for these tasks as the challenges that they encounter have relevance beyond the initial task.
  • The most significant reasons for hold-ups are the difficulty of getting hold of data, joining up data models, finding suitable data alignments, and differences in data suppliers’ technological capabilities.

Data coverage

Among the questionnaire responses, the most used geographical coverage was fairly evenly spread among National, Regional and Local coverage. Not many conclusions can be drawn from this given the small sample size of fourteen respondents. What may be of more interest is the combination of coverages that any one task needs to be able to handle, and the links that should be enabled between regions, and the data associated with those regions, at different levels of scale.

Types of data sources used

Again the results for the types of data sources being used are not significant due to the small sample size, although there was a clear preference for administrative boundaries, statistical geographies and social statistics among the tasks represented. A greater number of responses to the questionnaire would better reflect the variety of tasks that are out there and the full range of data sources being utilised.

Tasks also need to be seen in the context of which data source types are being used in combination.

When data source types are cross tabulated with the coverage that each task requires, there is a clustering in the responses that we received around combined National + Regional + Local coverage for Administrative geographies, Authority Boundaries, Statistics and Statistical geographies, Transport data and Addresses.

Information sources

The question about information sources invited respondents to describe each data item they use in their task, its type, where it comes from, and what format it is in and terms of use that are attached to it.

A full list of the data items for each task was created, along with details of their source, where they are located, which organisation they originate from, what their identifiers are called and the purpose of the data item in the context of the task.

The purpose field could provide some interesting pointers about what the questionnaire respondents are doing with the data, and with the benefit of technical expertise, what software tools or services might help them achieve their data activity.

Possibilities could include;

  • Tools to combine and query data sources from multiple perspectives for the creation of dash boards to visualise social statistics or a generic dashboard for data querying and visulisation
  • Tools to automatically match data held against post codes against the various regions that that postcode sits in
  • Tools to link LSOA boundaries (and the data associated with these) to their respective Local Authority boundary
  • Some way to automate the matching of location mentioned in text on a web site, with all the URIs that could be representing that location from various sources such as GeoNames or dbpedia.
  • Following on from the point above, match up the resolving of location mentions with their linked data representation with the output from the ¬†EuroGeoNames project.
  • Matching functional sites to population data

The results of the information sources question have been combined in a table showing where the data items sit in the ‘information space’ regarding data formats and terms of use. This table demonstrates that most of the data items used by the tasks represented by the questionnaire responses sit in the External Data (Open) category and are either of machine readable proprietary formats such as ESRI shapefiles or Excel spreadsheets, or in a non-proprietary format or data standard. Only three data sources were described as external commercial data.

Data issues

The Data Issues part of the questionnaire asked about data quality issues, data management issues and work-arounds. Key points made by the respondents were as follows;

  • Data Formats and Metadata issues are the two areas where most of the respondents would like to see improvements in data management. Ordering and delivery mechanisms were the next important, followed by pricing and licensing terms.
  • Need standardization of definitions where possible
  • Need clear licensing terms attached to the data
  • Not all data is available at the desired scale for the task…need means to interpolate
  • All data should ideally be online in web friendly format and machine readable
  • One respondent said they needed better identifiers and for data formats to be closer to RDF
  • One respondent mentioned the importance of provenance and update details.
  • Where data is in spreadsheet format, there needs to be a more consistent and ‘cleaner’ way to express data, following basic rules
  • Need to use well defined resources for common types of identifier
  • Open data needs to be more consistent
  • Non-government sources of data need to be more consistent

Possible tools or services to address these issues?

  • Automatic trawling of data to find where data items can be replaced by a URI (offered from a list of respected suppliers), or linked to a URI through a property.

Tools and services

 The ways that current tools could be improved include;

  • ¬†Publish comparable metadata and pool the catalogues in one place
  • Automatic identification and linkage of key identifier types

Tool wish list included;

  • Tool to extract structured data from unstructured formats
  • An on-line point-within-polygon API would be useful – i.e. to find which statistical or admin regions contain a given point.
  • An online lat-long <-> easting – northing API would also be handy for many people
  • A tool to allow the specification of multiple data downloads from multiple sources using a standard interface/environment, rather than having to go off and find the data then specify it differently in each case
  • Better generic capabilities, e.g. coordinates to area conversion, aggregation of small areas to larger areas, interpolation from large to small, tools to convert data relating to one abstract geometry into another