Google Scholar API
Problem
We would like to explore the available Google Scholar API (if they exist), if not, we might use a web scraper.
- Available resources :
- Answer got from web :
" Google doesn't have an API for Scholar likely for the same reason they don't have an API for web search - it would get overwhelmed by applications creating aggregation platforms (and running continuous queries) versus applications that just run on-demand, user-initiated lookups (like Mendeley linking out to Google Scholar). "
The Bad News
Up to today(Feb 4, 2015), Google doesn't provide an API for Google Scholar data. Issue ticket is filed since 2008:
Alternate Solution
- Use a web scraper https://github.com/ckreibich/scholar.py : scholar.py is a Python module that implements a querier and parser for Google Scholar's output, it can also be invoked as a command-line tool.
Examples of Usage:
# Retrieve 1 article written by Einstein on quantum theory: python scholar.py -c 1 --author "albert einstein" --phrase "quantum theory" # Retrieve 1 article written by Einstein on quantum theory: python scholar.py -c 1 --author "albert einstein" --phrase "quantum theory" > test.txt # Retrieve 20 article written by Einstein on quantum theory and put it in a txt file: python scholar.py -c 20 --author "albert einstein" --phrase "quantum theory" > test.txt # Retrieve a BibTeX entry for that quantum theory paper: python scholar.py -c 1 -C 8987828492054530436 --citation bt # Retrieve five articles written by Einstein after 1970 where the title # does not contain the words "quantum" and "theory": python scholar.py -c 5 -a "albert einstein" -t --none "quantum theory" --after 1970"""
For a complete list of usages, type
python scholar.py
This will generate the following query arguments :Options: -h, --help show this help message and exit Query arguments: These options define search query arguments and parameters. -a AUTHORS, --author=AUTHORS Author name(s) -A WORDS, --all=WORDS Results must contain all of these words -s WORDS, --some=WORDS Results must contain at least one of these words. Pass arguments in form -s "foo bar baz" for simple words, and -s "a phrase, another phrase" for phrases -n WORDS, --none=WORDS Results must contain none of these words. See -s|--some re. formatting -p PHRASE, --phrase=PHRASE Results must contain exact phrase -t, --title-only Search title only -P PUBLICATIONS, --pub=PUBLICATIONS Results must have appeared in this publication --after=YEAR Results must have appeared in or after given year --before=YEAR Results must have appeared in or before given year --no-patents Do not include patents in results --no-citations Do not include citations in results -C CLUSTER_ID, --cluster-id=CLUSTER_ID Do not search, just use articles in given cluster ID -c COUNT, --count=COUNT Maximum number of results Output format: These options control the appearance of the results. --txt Print article data in text format (default) --txt-globals Like --txt, but first print global results too --csv Print article data in CSV form (separator is "|") --csv-header Like --csv, but print header with column names --citation=FORMAT Print article details in standard citation format. Argument Must be one of "bt" (BibTeX), "en" (EndNote), "rm" (RefMan), or "rw" (RefWorks). Miscellaneous: --cookie-file=FILE File to use for cookie storage. If given, will read any existing cookies if found at startup, and save resulting cookies in the end. -d, --debug Enable verbose logging to stderr. Repeated options increase detail of debug output. -v, --version Show version information
Limitations
No more than 20 entries can be fetched (Google has only 20 in one page)
It currently *only* processes the first results page. It is not a recursive crawler
Take a deeper look of how query is constructed
SCHOLAR_QUERY_URL = ScholarConf.SCHOLAR_SITE + '/scholar?' \ + 'as_q=%(words)s' \ + '&as_epq=%(phrase)s' \ + '&as_oq=%(words_some)s' \ + '&as_eq=%(words_none)s' \ + '&as_occt=%(scope)s' \ + '&as_sauthors=%(authors)s' \ + '&as_publication=%(pub)s' \ + '&as_ylo=%(ylo)s' \ + '&as_yhi=%(yhi)s' \ + '&as_sdt=%(patents)s%%2C5' \ + '&as_vis=%(citations)s' \ + '&btnG=&hl=en' \ + '&num=%(num)s'
For example : http://scholar.google.com/scholar?&q=Dean+Karlan&hl=en&num=22
Potential fix of the code :
If we want to go to the next 20 results, append start=20 in the URL, such as :
http://scholar.google.com/scholar?start=20&q=Dean+Karlan&hl=en&num=20&as_sdt=0,7
- Microsoft Academic Search does offer an API. You need to request a key, but other than that, it provides full programatic access to what the application returns using the web interface.
, multiple selections available, Use left or right arrow keys to navigate selected items