Getting into Google, Google Scholar and other search engines Thursday, 5 September 2013

Quick start

If you want to get started straight away, we offer this summary of three things to do:

  1. Get into Google Scholar. The single best starting point for getting good exposure is to follow Google’s instructions for how to be indexed in Google Scholar and then contact them using this form. For more discussion on this see below in the section on Getting into Google Scholar.
  2. Get harvested by as many aggregation sites possible. To do this, follow the harvesting guide.
  3. Build a good site. Don’t spend time working on search engine optimization1, spend time on repository usability and promoting the repository to your target audience so that people cite your research outputs and link to them.
    • Use standard markup such as titles and headings to structure your page, and do not make any content dependent on JavaScript to display. You should use meta elements for metadata, this is covered in detail in the Google Scholar guide.
    • A site with incoming links will be indexed by Google et al. without the repository manager doing anything, as long as you don’t ban web crawlers from indexing your site.
    • Google Scholar is more particular about what it indexes, hence the advice in point one. Follow their guide.
    • Make sure your repository has clear browse-paths, eg by Author, Subject or Title, so users can discover content without searching. This helps the search engines’ robot harvesters to find and index your content. Frequent updates will generally increase the frequency with which search engines index your site; frequent additions of fresh content may influence search engine rankings, but this is not clear.

Getting into Google et al.

The basics

Since the early days of the web, there have been services which provide full-text search indexes of the web. These indexes are constructed by web-crawling software which essentially browses the web a page at a time and indexes its textual content. Broadly speaking, the crawlers follow links, and cannot discover pages that are not linked from somewhere.

Google is famous for improving search results, about ten years ago using the PageRank algorithm which took into account how many links pointed to the pages it was indexing to help order search results. The indexing processes used by search engines today are trade secrets and differ from search engine to search engine, so trying to optimise your exposure by worrying about them is something that would be very resource intensive, not to mention risky, as attempts to ‘game’ the search engines typically result in your site being dropped from the index.

Why be indexed?

Most repositories are designed to promote the research output of a repository by making it discoverable. The main source for discovery, by far, is search engines. Even though the majority of traffic to your repository may not be coming from a search engine, chances are a lot of the traffic is from links and citations where the original discovery was via a search engine.

A recent very informal survey of CAIRSS sites showed that typically around 40% of traffic is coming from search engines and almost all of that search engine traffic is from Google2. There are lots of complexities here, for example many ARROW/VITAL sites banned crawlers to make previous versions of the software more stable, meaning they have small footprints in Google, and some sites have URLs which are re-directed a couple of times meaning the repository statistics module sees only referring traffic from the same institution..

Most CAIRSS respondents to the informal survey reported that aggregating sites such as Australian Research Online (ARO) and Trove account for only a few percent of overall traffic, but it is still important to be represented in such sites for a couple of reasons:

  • Their incoming links almost certainly raise the status of your repository with the search engines.
  • An 2008 survey of visitors to the ARO site showed that users were mainly researchers and students seeking information on a specific topic. These are the kinds of searches which will likely result in increased downloads and citations.

It is very difficult to make direct comparisons between sites as there is a huge amount of variation in the way statistics are collected. CAIRSS is working on a guide to repository stats.

Strategies for being indexed

  • Make sure the repository is harvested by discovery services. See this page listing discovery services. As noted above, these are typically not currently major sources of traffic but they do contribute to the rank of your site. And Trove in particular is experiencing very rapid growth, so we may see increases in traffic from there as users searching for Australian content of all kinds discover things from institutional repositories alongside the many other kinds of content indexed in Trove.
  • Follow the advice set out in the Google Scholar Inclusion Guidelines for Webmasters. All of that advice is relevant to making your site visible in Google et al. as well as Google Scholar, with the exception of the Indexing Guidelines section.

Getting into Google Scholar

Google Scholar is very important as it is the largest open index of scholarship accessible to most readers. It does a good job of finding multiple versions of scholarly articles and theses, including various publisher and database sites as well as open access versions and providing interfaces that make it easy for users to download articles into reference management software like Zotero, EndNote and Mendeley (to name a few). Getting Open Access versions of articles into Google Scholar is a key way to further the OA agenda.

CAIRSS and the Trove team from the National Library of Australia have been in discussion with the Google Scholar team. Google Scholar is interested in:

  • Full text, by which they mean HTML and PDF versions of scholarly material.If your repository does not have a lot of full text it will not be well-indexed by Google Scholar.
  • Material that that they won’t find elsewhere like fulltext thesis content.

Google Scholar is not set up to index discovery sites like the NLA’s Trove, it is designed to work on repositories which contain the full text of articles. So even if the NLA had enough metadata available to them, as supplied over OAI-PMH in Dublin Core then they could not make an effective Google Scholar web site. That said, being indexed by Trove is important as it will make your repository more visible to search engines, and increase traffic (if only by a little bit).

The Google Scholar documentation is very clear that the full-text download for an item must be ‘in a subdirectory’ that is, the URL must be ‘under’ the metadata page. To use a real example, this is the metadata page for an item in the USQ repository:

And the full text is in a ‘subdirectory’ (actually there are not really directories involved):

This should not, in general be a problem for your repository as most are set up so that data streams are referenced in this way but if, for example, you have are using handles to reference PDF files then Google Scholar may not index them, although they may have coded something special for your particular repository software. This could be a problem also for some of non-mainstream proprietary repository software some of which appears to make use of indirect links. Contact CAIRSS if you have a problem.

The best way to be indexed in Google Scholar is to add metadata to your summary pages. This is covered in Indexing Guidelines in their guide. There are a variety of formats that can work but one in particular is recommended. This is supported ‘out of the box’ by EPrints for other sites CAIRSS can assist in networking you with sites using your software.

CAIRSS sites: if you have configured you repository or want help then please contact CAIRSS and we will compile a technical guide using information from the community on how to set up the metadata for various software.

We are not going to reproduce Google’s very clear advice here, and recommend that CAIRSS sites follow their guide and then contact them using this form.


Not being indexed?

The number one reason you might not be able to find content in Google et al. is that your robots.txt file prevents access to their crawlers. This is worth checking. For example CAIRSS discovered that one of the commercial software vendors has recently shipped updates to their software which by default blocks access to web crawlers.

Other common reasons for a lack of indexed pages would be:

  • If your site has had extended periods of down-time search engines will eventually start dropping your content from the index.
  • If your site has changed its URLs without putting in place redirects so that the old URLs map to new ones. This is a basic governance process which needs to be in place for all of an institutions websites, but is particularly important for repositories. If you change domain names, or software, or upgrade some software packages then you may need to add redirect rules which permanently redirect browser-clients to the new home of URL.Note that using Handles does not automatically or magically fix this problem for two main reasons:
    • The Handles database needs to be updated with new URLs, something which we have found that even commercial vendors do not do when upgrading a repository.
    • Many incoming links to the repository will use the ‘ordinary’ URL for full-text items and metadata pages, if these links break will undermine the integrity of your repository and its usefulness in exposing research via increased readership and citation rates.

Not seeing traffic referred from search engines?

This is not necessarily a question to do with search engines, but as noted above, we have found that some repository software uses a lot of redirects. The URL a user sees and clicks on may be to one university site, which then redirects to the repository. The stats module in the repository may then report that all the traffic is coming from university referrers, even though the original referring site has about 40% chance of being a search engine.

The same thing is being indexed multiple times?

If the same content appears in your site on more than one URL then that can dilute the impact of the content in a search index. This can happen when the repository adds session information to its URLs or when multiple views of the same content can be set up. The solution is to add some metadata all the pages showing the same content to say which version is the canonical one. See this guide, which works for multiple search engines.

Other resources

We will not attempt to provide a complete guide here, and do not have the resources to keep it up to date. For general indexing /search issues try Google’s web master help forum.


This document was compiled by Peter Sefton with input from the CAIRSS team and the CAIRSS community.

Copyright USQ, 2010. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <>


1For a blunt exposition on why Search Engine Optimization is not helpful see Spammers, Evildoers and Opportunists, by Derek Powazek.

2 There are lots of complexities here, for example many ARROW/VITAL sites banned crawlers to make previous versions of the software more stable, meaning they have small footprints in Google, and some sites have URLs which are re-directed a couple of times meaning the repository statistics module sees only referring traffic from the same institution.