Jump to content

Data Platform/Data Lake/Traffic/Pageviews/Redirects

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This page discusses the following question: What happens when a request goes through redirects like

/wiki/Something Notable
/wiki/Something%20Notable
/wiki/Something_Notable

or

/index.php?title=Something Notable
/index.php?title=Something%20Notable
/index.php?title=Something_Notable

And how do we handle pageview identification and counting on those requests?

Types of redirects

In the example above we see 2 kinds of redirects, but there are others, here's a list of possible redirects:

Direct correct request

Well this is not a redirect, but serves as a base to compare it to the other exmples. The browser sends a request to for example Something_Notable and Varnish responds with a 200. The Cluster recognizes this as a Pageview.

URI encodings performed by the browser

Those are made prior to sending the request. For example: Something Notable to Something%20Notable, or "Awesome" to %22Awesome%22. They have no effect in the pageview computation, because both representations are supported in PageviewDefinition UDFs, and are ultimately normalized by it.

Capitalization of the first letter

Whenever a request is sent with a lower-case first letter, the response is a 301, where the target is the article with a capitalized first letter. The browser will send another request to the new target this time, which should return a 200. The PageviewDefinition does not identify 301 requests as pageviews, so it will only count the second request to the correct page as a pageview.

Conversion of spaces

Conversion of spaces (%20) to underscore is the same case as first letter capitalization. Whenever a request is sent with spaces (%20) in between words, the response is a 301, where the target is the article with underscores instead of spaces (%20). The browser will send another request to the new target, which should return a 200. The PageviewDefinition does not identify 301 requests as pageviews, so it will only count the second request to the correct page as a pageview.

Other spellings covered by a redirect page

Any other spellings like: alternate spellings, misspellings, abbreviations, translations, capitalizations, plural-vs-singular, etc. for which there is a page in the corresponding project that acts as a hard redirect (its contents start with #REDIRECT[<target>]) will be handled by Varnish or the server-side and will return a 200 response with the contents of the target page, with a small redirect note like "(Redirected from ...)". However, Varnish will generate a log with the redirect URL (before conversion). This is the only potentially problematic scenario, because the cluster will compute a pageview for the redirect page, even the contents shown to the user are those of the target page. But nevertheless, it will only compute 1 pageview, there will be no duplicates.

Alternate spellings NOT covered by a redirect page

If no page exists that covers the spelling requested, Varnish or the server return a 404, so no pageview will be computed for that.

Potential problems

Per article analyses

The only redirect scenarios that can be confusing (or may be wrong) are the alternate spellings covered by a redirect page. They do not alter global counts, or counts per project, but they alter per article analyses. For example, in the per-article endpoint of the Pageview API, the page "Barack_Obama":

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Barack_Obama/daily/2016010100/2016010100

returns 26166 pageviews, whereas its redirect page "Barack_obama" (note the lower-case 'o'):

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Barack_Obama/daily/2016010100/2016010100

returns 30415 pageviews for the same period. Actually all the users that generated these 30415 pageviews actually read the contents of Barack_Obama with capital 'O', but we're counting them as Barack_obama (lower-case 'o'). The research paper mentioned by Aaron:

https://mako.cc/academic/hill_shaw-consider_the_redirect.pdf

suggests that 55% of the articles in the main namespace are redirects to other pages, so this is surely not a small proportion of pageviews or articles.

Possible solutions?

X-Analytics

Add a redirectedTo field to the x-analytics header that holds the target url of the redirect. Note: if the request to the redirect page has ?redirect=no, it should leave the redirectedTo field blank. And let the PageviewDefinition get the page title from the redirectedTo field when not empty.