Google Scholar API

Google Scholar API

 

Problem

We would like to explore the available Google Scholar API (if they exist), if not, we might use a web scraper.

 

The Bad News

Up to today(Feb 4, 2015), Google doesn't provide an API for Google Scholar data. Issue ticket is filed since 2008:

Alternate Solution

  1. Use a web scraper https://github.com/ckreibich/scholar.py : scholar.py is a Python module that implements a querier and parser for Google Scholar's output, it can also be invoked as a command-line tool.

    • Examples of Usage:

      # Retrieve 1 article written by Einstein on quantum theory:
      python scholar.py -c 1 --author "albert einstein" --phrase "quantum theory"
      
      # Retrieve 1 article written by Einstein on quantum theory:
      python scholar.py -c 1 --author "albert einstein" --phrase "quantum theory" > test.txt
      
      # Retrieve 20 article written by Einstein on quantum theory and put it in a txt file:
      python scholar.py -c 20 --author "albert einstein" --phrase "quantum theory" > test.txt
      
      # Retrieve a BibTeX entry for that quantum theory paper:
      python scholar.py -c 1 -C 8987828492054530436 --citation bt
      
      # Retrieve five articles written by Einstein after 1970 where the title
      # does not contain the words "quantum" and "theory":
      python scholar.py -c 5 -a "albert einstein" -t --none "quantum theory" --after 1970"""
    • For a complete list of usages, type

      python scholar.py


      This will generate the following query arguments :

      Options:
        -h, --help                                show this help message and exit
        Query arguments:
          These options define search query arguments and parameters.
          -a AUTHORS, --author=AUTHORS            Author name(s)
          -A WORDS, --all=WORDS                   Results must contain all of these words
          -s WORDS, --some=WORDS                  Results must contain at least one of these words. Pass
                                                  arguments in form -s "foo bar baz" for simple words, and
                                                  -s "a phrase, another phrase" for phrases
          -n WORDS, --none=WORDS                  Results must contain none of these words. See -s|--some
                                                  re. formatting
          -p PHRASE, --phrase=PHRASE              Results must contain exact phrase
          -t, --title-only                        Search title only
          -P PUBLICATIONS, --pub=PUBLICATIONS     Results must have appeared in this publication
          --after=YEAR                            Results must have appeared in or after given year
          --before=YEAR                           Results must have appeared in or before given year
          --no-patents                            Do not include patents in results
          --no-citations                          Do not include citations in results
          -C CLUSTER_ID, --cluster-id=CLUSTER_ID  Do not search, just use articles in given cluster ID
          -c COUNT, --count=COUNT                 Maximum number of results
        Output format:
          These options control the appearance of the results.
          --txt                                   Print article data in text format (default)
          --txt-globals                           Like --txt, but first print global results too
          --csv                                   Print article data in CSV form (separator is "|")
          --csv-header                            Like --csv, but print header with column names
          --citation=FORMAT                       Print article details in standard citation format.
                                                  Argument Must be one of "bt" (BibTeX), "en" (EndNote),
                                                  "rm" (RefMan), or "rw" (RefWorks).
        Miscellaneous:
          --cookie-file=FILE                      File to use for cookie storage. If given, will read any
                                                  existing cookies if found at startup, and save resulting
                                                  cookies in the end.
          -d, --debug                             Enable verbose logging to stderr. Repeated options
                                                  increase detail of debug output.
          -v, --version                           Show version information
    • Limitations

      • No more than 20 entries can be fetched (Google has only 20 in one page)

      • It currently *only* processes the first results page. It is not a recursive crawler


  2. Microsoft Academic Search does offer an API. You need to request a key, but other than that, it provides full programatic access to what the application returns using the web interface.