Jump to content

Data Platform/Internal API requests

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This page documents how to query MediaWiki Action API, MediaWiki REST API, and Wikimedia REST API internally in R and Python, rather than sending requests over the Internet. The code examples here were tested on stat1010.eqiad.wmnet.

Both R and Python approaches assume that HTTPS_PROXY, https_proxy, NO_PROXY, and no_proxy environment variables are already set. Refer to HTTP proxy for setting them manually if they get unset.

Python

Fixing SSL certificate verification error

To avoid

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate

when running the code from a virtual environment (e.g. conda-analytics environment in JupyterHub on stat hosts), use:

import os

os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'

Thank you Ben Tullis for figuring this out.[1]


Using requests

With the requests library:

import requests

url = 'https://mw-api-int-ro.discovery.wmnet:4446/w/api.php'

headers = {'Host': 'en.wikipedia.org'}

payload = {
    'action': 'query',
    'prop': 'info',
    'titles': 'R_(programming_language)|Python_(programming_language)',
    'format': 'json'
}

resp = requests.get(url, headers=headers, params=payload).json()

Using mwapi

With the mwapi library (which also requires REQUESTS_CA_BUNDLE environment variable):

import mwapi

session = mwapi.Session(host = 'https://mw-api-int-ro.discovery.wmnet:4446')
session.headers['Host'] = 'en.wikipedia.org'

resp = session.get(
    action = 'query',
    prop='info',
    titles = 'R_(programming_language)|Python_(programming_language)'
)

Thank you to Lucas Werkmeister for figuring this out.[2]

DataFrame from API response

To convert the response into a nice data frame we can use from_dict from pandas:

import pandas as pd

page_info = pd.DataFrame.from_dict(resp['query']['pages'], orient='index')
pageid ns title contentmodel pagelanguage pagelanguagehtmlcode pagelanguagedir touched lastrevid length
23862 23862 0 Python (programming language) wikitext en en ltr 2022-05-17T15:11:33Z 1088356878 146500
376707 376707 0 R (programming language) wikitext en en ltr 2022-05-16T17:53:29Z 1087609113 59925

R

Using httr2 package:

library(httr2)

req <- request("https://mw-api-int-ro.discovery.wmnet:4446/w/api.php") %>%
    req_headers("Host" = "en.wikipedia.org")

req <- req %>%
    req_url_query(
        action = "query",
        prop = "info",
        titles = "R_(programming_language)|Python_(programming_language)",
        format = "json"
    )

# Fix error "SSL certificate problem: unable to get local issuer certificate":
req <- req %>%
    req_options(ssl_verifypeer = 0)

# Perform the request:
resp <- req %>%
    req_perform() %>%
    resp_body_json()

To convert the response into a nice data frame we can use map_dfr from purrr and as_tibble from tibble:

library(tidyverse)

page_info <- resp$query$pages %>%
    map_dfr(as_tibble)
A tibble: 2 × 10
pageid ns title contentmodel pagelanguage pagelanguagehtmlcode pagelanguagedir touched lastrevid length
<int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int>
23862 0 Python (programming language) wikitext en en ltr 2022-05-17T15:11:33Z 1088356878 146500
376707 0 R (programming language) wikitext en en ltr 2022-05-16T17:53:29Z 1087609113 59925

References

  1. T361024#9662135
  2. https://github.com/mediawiki-utilities/python-mwapi/issues/45