How to access structured data

From Linking experiences of World War One
Revision as of 13:23, 11 November 2017 by GavinRobinson (talk | contribs) (Text replacement - ", UK" to ", British Army")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This page is a work in progress. Please help to improve it if you can, especially if you're familiar with the MediaWiki API.

This wiki aims to create pages for military units in the First World War, with each unit's page containing structured data in infobox templates. You are free to re-use this data for any purpose according to our Creative Commons Attribution Share Alike licence. This page gives some tips on how to access the data.

Data structures

Data about military units is structured using these templates:

See the documentation of each template for more information about parameters and values. The templates are adapted from Wikipedia but are not quite the same.

Unit pages also have some semi-structured data outside the infoboxes. They mostly have the same headings for personal narratives, media and other sources. References to official war diaries are under a standard heading controlled by {{War diary heading}}. There are several templates for information about war diaries, listed in Category:War diary templates.

British administrative units often have references to campaign medal rolls in a wikitable with a standard structure. For example, see Lincolnshire Regiment, British Army#Medal rolls.

MediaWiki action API

This site runs on MediaWiki. The action API is enabled for read queries, and you don't need to be logged in to use it. The API allows you to access wiki pages using the programming language of your choice.

The action API is fully documented in the MediaWiki manual.

The endpoint for this wiki is http://collaborativecollections.org/WorldWarOne/api.php

Special:Export

You can export pages as XML files using Special:Export, which is documented in the Mediawiki manual. You don't have to be logged in to use this method. If you want to download large numbers of pages, it's quicker and easier than using API calls.

You can download all the pages in a category by putting the category name in the 'Add pages from category' box and then clicking 'Add'. You can add another category after this in the same way.

The quickest way to get every unit's page is to use Category:All units. If you only want a subset of units, you can use a more specific category. You can see what unit categories are available by looking at Category:Units or Category:Countries and manually clicking through their sub-categories.

If you check the 'Save as file' box on the Special:Export page and then click the 'Export' button, you will get an XML file containing the source text of each page wrapped in simple XML tags.

These are the most important tags:

  • <page> contains all downloaded revisions of a page, and metadata about the page.
    • <title> is the page name.
    • <revision> is one revision of the page, including source text and metadata. If you unchecked the box 'Include only the current revision, not the full history' on the Special:Export page, there might be more than one revision per page. If you leave the box checked, you only get the latest revision of each page but it will still be wrapped in a <revision> tag.
      • <timestamp> is the timestamp of the revision.
      • <contributor> contains details of the user who made the revision.
      • <text> contains the full source text of the page, as it would appear in the edit box if you were editing a page manually. These is no XML markup within the page text, so you can't access template parameters just by parsing the XML. Angle brackets inside the page text are escaped as &lt; and &gt;

You can use the programming language of your choice to either parse the XML or just split the file into substrings.

Scraping HTML

You can scrape data straight out of HTML pages instead of accessing the source wikitext. The infoboxes are marked up as tables. The first row spans two columns and contains:

Each subsequent row has two columns.

  • the first column is a <th> tag containing the parameter label (this does not always match the parameter's actual name in the template source code).
  • the second column is a <td> tag containing the parameter value. For parameters that are turned into wikilinks, the value will be inside <a> tags. Some values can contain <ref> tags, which will be rendered in HTML as <sup> tags.


Limitations of the data

This site is a work in progress. See Populating lists of military units for more details of coverage and progress. We currently have pages for over 7,000 units. These are mostly infantry and mounted units from the British Empire and the American Expeditionary Force.

The data on most existing pages is still quite rough. Many tactical relationships are probably missing, and dates for relationships and theatres are often only accurate to the nearest month (typically the start date will be given as the first of the month and end date as the last of the month regardless of the actual date).

American division-corps-army relationships should be more or less complete, with exact dates.

Some tactical relationships between British Empire divisions, corps and armies have now been added but the dates are nearly all missing, and some relationships are probably still missing because we don't have a complete list.

British administrative relationsips are mostly complete for the units that have pages but dates are likely to be missing. Most pages don't have a complete history of the unit's full names.