Repository Statistics Wednesday, 27 March 2013

Repository Statistics - Trying to measure the wiggly worm

 

Introduction

Repository statistics on web traffic and downloads, and on the number and type of objects held are an important tool for institutional repository managers to evaluate the success of their repository, plan marketing campaigns for external and internal users and, one hopes, provide evidence to the institution that their repository is delivering value. However, it is important to use repository statistics in context. This involves not over-estimating the data statistics are providing, not underestimating the possibility statistics are misused by the community, and staying aware of the limitations and bugs in existing repository software statistics packages. This guide will look at some of the basics of repository statistics, what they are useful for, some of the risks involved in misusing them, and provide links to further resources.

Unfortunately, the fact is that currently there are no reliable ways to share statistics across repositories and for most repository software to get substantial value out of statistics will take substantial amounts of work.

What Repository Statistics are available?

Repository statistics can provide both content and usage statistics:

  • Content statistics are based on metrics such as the number of objects in a repository. This can usually be reported further by content (e.g. author, subject, publication type). Even in the absence of a reporting tool, it is usually easy enough to transcribe item counts from the user interface of the repository manually.
  • Usage statistics packages attempt to measure web site visits by real people and filter out visits by machines such as the web crawlers that search engines use to build their indexes. Usage statistics:
  • How many hits (page loads) and visits you are getting to your repository.
  • How many downloads the content (i.e. usually PDF) files are getting, and how many times metadata pages have been viewed.
  • The likely geographical location for visitors, based on IP address (sometimes with interactive maps).

How are Repository Statistics useful?

Incorporating both ‘content’ and ‘usage’ statistics, repository statistics can provide data to assist in demonstrating the effectiveness of the repository in fulfilling its intended aims.
Content statistics can be used for:

  • Reporting on institutional research areas by subject/publication type.
  • Reporting on individual researcher growth in publications.
  • Reporting on repository growth (and therefore institutional research growth) by individual researcher or faculty.

Usage statistics can be be used for:

  • A tool for promoting repository deposits.

This is one of the main strategies utilized in the Open Access repository movement (e.g. ePrints built in ‘top downloads’ list – but see the section below on potential problems).

  • As an analytical tool for identifying repository problems: To give a real example, many Australian repositories report around one third of repository hit statistics originating from search engines – in research for this page we came across and institution’s repository statistics show virtually no hit statistics originating from search engines which indicates a potential problem worth investigating, maybe the site is not well indexed, or there is some other configuration issue which misses the search-engine traffic.
  • For demonstrating repository use institutionally, nationally and internationally.
  • For demonstrating popular research publications/areas of an institution.
  • A variety of internal reporting processes.

Quick Start Guide

Our advice about approaching statistics in your repository is:

  1. Start with your repository’s statistics package and see what it does ‘out of the box’. You can contact other CAIRSS members, and consult the spreadsheet we are compiling as a community as well as ask on mailing lists, or approach the vendor. Even if the stats don’t do what you want in this release, one potential strategy is to wait for a new release of the software that does do what you require.
  2. For simple content statistics, and to benchmark against other repositories, Register your repository with ROAR, as described on this page. This involves adding a link to the OAI-PMH feed on your repository to the ROAR site. ROAR harvests your repository and shows the growth of the content as a graph. Australian respositories are listed on this page. Unfortunately ROAR does not distinguish between the types of content in repositories in a very useful way. See the final section of this document on what we might do about that as a community in the future.
  3. If the in built repository stats in your repository are not meeting requirements, then the simplest way to get detailed reporting is to add a service like Google Analytics or W3Counter (recommended by Tim McCallum – CAIRSS technical officer 2009-2011). These services require you to add some code to the page template used for your repository, and will track usage, referring sites, geographical data about readers and so on. They do not understand the details of you repository – such as the differenc between navigation pages and item pages but they are useful and very easy to deploy.Sign up for Google Analytics here (free up to 5,000,000 page views per month): https://www.google.com/analytics/provision/W3Counter is here: http://www.w3counter.com/
  4. Looking ahead to the future, the most promising development in repository statistics is PIRUS2 – a standards based way to collect audited statistics that can potentially be shared and compared across repositories and journals. We recommend that the better resourced CAIRSS institutions (ie those who have staff to set it up) start to explore this service. We return to this in the final section.

How are CAIRSS Community Repository Managers currently utilising Repository Statistics?

65% of CAIRSS community repository managers report utilizing end-user and management/back-end repository statistics.

However some CAIRSS Community repository managers report the statistics component of their repository software has never worked properly, while others express concerns about the data reliability of repository statistics currently on offer.When asked What other types of statistics would you like to capture for/about your repository that your repository software does not currently offer?, CAIRSS Community repository managers reported the following:

  • The ability to produce statistics reports based on Fields of Research (FoR)
  • A function to calculate and report HERDC points statistics
  • Statistic reports of records with fulltext attached (identifying restricted and open access)
  • Counts of publications (1) deposited and (2) processed and made live
  • Easily accessible statistics for individual researchers
  • OAI harvesting activity on repository server by service providers
  • Better display of statistics (especially on metadata pages)
  • Number of edits/updates daily

Although some responded to this question stating they would just like the repository statistics packages to work properly and deliver the basic statistics that are advertised for the product/repository software.

Technical approaches to collecting Repository Statistics

The goal of a usage stats package is produce a useful report on who or what is using your repository. The basic mechanism for this is to:

  1. Count how many times the various resources that make up a repository are requested over the web (ie viewed or downloaded).
  2. Work out where those requests are coming from.
    1. Geographically speaking.
    2. And where the requests referred by another site such as Google Scholar, a publisher, the repository itself, or in the absence of a referrer via a bookmark or link in a PDF etc.

But the raw numbers from such counting are not necessarily useful. Downloads or page-views will include requests by search-engine crawlers ‘robots’ and by repository administrators who may request a page several times in a row. Then there is the potential that someone might try to increase their stats by hitting ‘Refresh’ over an over in their browser.Stats programs are typically not that easily fooled, though and they will treat repeat requests from the same user differently from requests coming from identifiably different computers.Technically, there are three main approaches to collecting repository usage statistics:

  • Using web-server statistics

One simple approach to repository statistics is to use web-server log files. Most repository applications sit inside a host webserver (such as Apache). There are standard log-file analysers such as AWStats (http://awstats.sourceforge.net/) which can produce reports about the number of requests made on a repository, filtering out requests from machines. Here is an example screenshot from the Griffith repository

AWStats and other basic stats packages are standard web infrastructure. You should be able to ask a systems administrator to install one for you.The problem with this approach is that without additional processing the logs are quite generic – they don’t distinguish between a repository item and other pages, for example.There have been some attempts to build open source packages that help make sense of statistics in repository terms, but our investigation shows that most of these efforts are no longer supported.

  • Using ‘bugs’

One of the easiest ways to count web site usage statistics is via the use of services like Google Analytics which work with a ‘bug’ in each page on your site. This is typically a small bit of JavaScript code that runs on every page load. Every time someone looks at a page their browser sends a request to the stats-provider (could be Google, another service provider, or your own site) which then counts that page-view.One very substiantial issue is that this approach will miss direct downloads of PDF files, which can be a very substantial and important segment of repostiory traffic.This approach also shares the probelem of AWStats et al in that it is not ‘repository aware’.

  • Using built-in or add/on repository statistics functionality

Each repository package in use in Australia has some form of statistics reporting, but they vary by version, and many have been customised. CAIRSS simply does not have the resources to try to cover all of the software and provide advice. We recommend that user groups put pressure on vendors and open source maintainers to ask for the features you need. CAIRSS can assist if asked.

The potential for problems

On the web, there is an economy of page-views and clicks, driven by advertising. Spammers have the resources to mount extremely widespread campaigns to send out vast volumes of nonsense designed just to confuse filtering software, internet email traffic is estimated to be 97% spam and the search engine optimization industry has resulted in a web littered with sites that are designed to rank highly and attract clicks, usually for the purposes of hosting advertisements. We raise this issue in order to point out the the web is not a governed space. There is a risk that if repository statistics become a form of currency linked to promotion for example that the kinds of dirty-tricks listed above come in to play in the academic world.

The simplest kind of dirty trick would be for a researcher to try to increase their own repository access statistics by repeatedly loading a page. Modern usage stats packages should not be so easily fooled, but it it worth checking with your repository to see if repeated clicking updates a simple counter (we have seen this naïve behavior implemented in commercial repository solution).

But there would are more sophisticated ways to increase statistics that impossible to detect. The risk is that if download or page-view stats are used as metrics that have real consequences, such as being used in promotion processes, or are tied to funding, then there will be a real incentive for researchers to start trying to game them. There’s business opportunity here – one could set up a ‘research impact optimizing service’ which one way or another resulted in more page-clicks in the repository – reasearchers could pay – downloads would go up, and it would all be done at arms length. On a simpler note, researchers could simply pay for advertisements that pointed readers at the repository when certain search terms were used.

So, our recommendation is that in order to keep statistics useful, they should not be tied to rewards or made too much of. It is certainly useful to be able to watch trends in your repository, and useful for researchers to see where there research is being consumed and in what volume (assuming they are not responsible for inflating the stats themselves, of course).

A much more reliable indicator, in that it is harder to manipulate is citations. At this stage there is no practical way that an individual repository could do very much to track citations without doing deals with commercial entities but some information on citations is available from Google Scholar, and from commercial services. Some repository software has integration with commercial citation service – notably the Fez system as implemented at the University of Queensland and presented at the CAIRSS community day in 2010 by Andrew Martlew.

Comparing Repository Statistics: Apples and Oranges

While repository managers can use the tools described above to manage statistics on a per-site basis there is no reliable way to compare or benchmark repositories as the variety of tools in use report on different things in different ways.

In 2009 the CAIRSS central team proposed a CAIRSS service that would have taken quarterly snapshots of the contents of IRs in Australia as exposed via OAI-PMH, which would have allowed detailed reporting of trends across types of resource, repository software type, etc. If there is any interest in this it could be revisited in the 2011 and onwards version of CAIRSS.

At the moment the range of repository statistics reporting tools being used by CAIRSS members is not suitable for benchmarking or comparison. As an APSR project (BEST) put it:
“The reliability of this information is limited as there is no standard form of presenting and filtering the data and no service which aggregates the information across institutions to allow higher level analysis and comparisons” (http://www.apsr.edu.au/best/index.htm)
If there is community interest in being able to to this kind of reporting, then the main issue is how to compare usage statistics.

Real apples/apples comparisons

The COUNTER project specifies a way for repositories to supply statistics about usage in a standard format and requires periodic audits that the software used to collect usage data is conservative in eliminating multiple clicks, hits from robots, etc. Only one of the repository soultions used by Australians is on the COUNTER compliant list – the BePress software.

There have been a number of international efforts to benchmark repositories and produce software toolkits. E.g. PIRUS and PIRUS2, Interoperabile Repository Statistics Project (IRS).

The PIRUS2 Project has been has been successfully completed. You  download the PIRUS2 Final Report.   As part of the exit stategy, the team drafted a COUNTER PIRUS Code of Practice.