User:DCausse/Term Stats With Cirrus Dump

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Dump data

It is now possible to dump the content of a cirrus index.

Data can be dumped with the dumpIndex maintenance script. For example on deployment-bastion.deployment-prep.eqiad.wmflabs you can use the following commands to dump the simplewiki content and general index ː

mwscript extensions/CirrusSearch/maintenance/dumpIndex.php --wiki simplewiki --indexType general | gzip -c > dump-simplewiki-general.gz
mwscript extensions/CirrusSearch/maintenance/dumpIndex.php --wiki simplewiki --indexType content | gzip -c > dump-simplewiki-content.gz

In order to rebuild the index locally you will need to dump the mapping and the settingsː

curl http://deployment-elastic06:9200/simplewiki_content_first/_mapping/ > simplewiki_content_mapping.json
curl http://deployment-elastic06:9200/simplewiki_general_first/_mapping/ > simplewiki_general_mapping.json

curl http://deployment-elastic06:9200/simplewiki_general_first/_settings/ > simplewiki_general_settings.json
curl http://deployment-elastic06:9200/simplewiki_content_first/_settings/ > simplewiki_content_settings.json

Import data

On another host or your local machine you can recreate the same index :

You need to install the proper elasticsearch plugins :

analysis-icu
experimental-highlighter-elasticsearch-plugin
extra

Create the index with the same settings as the original (you need to install jq & curl) :

jq -c '.simplewiki_content_first' < simplewiki_content_settings.json | curl -XPUT 'http://localhost:9200/simplewiki_content_test' --data @-

And the mappings :

jq -c '{"page": .simplewiki_content_first.mappings.page}' < simplewiki_content_mapping.json | curl -XPUT 'http://localhost:9200/simplewiki_content_test/_mapping/page' --data @-
jq -c '{"namespace": .simplewiki_content_first.mappings.namespace}' < simplewiki_content_mapping.json | curl -XPUT 'http://localhost:9200/simplewiki_content_test/_mapping/namespace' --data @-

Import the data (you need to install gnu parallel):

zcat dump-simplewiki-content.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/simplewiki_content_test/_bulk --data-binary @- > /dev/null'

You can follow the same steps to import simplewiki_general by changing all references to simplewiki_content to simplewiki_general.

Dump term stats

To dump term stats I use a plugin named elasticsearch-index-termlist. This plugin does not support ES-1.6 and have some problems with some fields, use my fork here : https://github.com/nomoa/elasticsearch-index-termlist . If you just need the jar for ES-1.6 grab it here : https://drive.google.com/file/d/0Bzo2vOqfrXhJZU1DTDdyanRkQUU/view?usp=sharing

Once the plugin is installed you can extract term stats with :

curl -XGET 'http://localhost:9200/simplewiki_content_test/_termlist?field=title.prefix'  > terms_title_prefix.json

TODO: add a list of the available fields.

Convert json to CSV

jq -r '.terms[] | [.term,.totalfreq,.docfreq] | @csv' < terms_title_prefix.json  > terms_title_prefix.csv

Now you can use R to inspect data :

# install.packages("stringi")
# install.packages("data.table")

library(stringi)
library(data.table)

# Read input, V1:the term, V2:totalDocFreq (always 0 for this field), V3:docFreq
dat <- read.csv('/plat/cirrus-dump/terms_title_prefix.csv', header=FALSE, sep=",");

# calculate the prefix length
dat$tlength <- stri_length(dat$V1);

# Reorder the dataframe on prefix length and term freq
dat <- dat[order(dat$tlength, -dat$V3),]

# Create a data.table with prefixes of length from 1 to 10
all = as.data.table(dat[dat$tlength<=10,]);
# Keep only data that have less than 10 chars and have doc freq > 10
highfreqs <- as.data.table(dat[dat$tlength<10 & dat$V3>10,])
# Keep only data that have less than 10 chars and have doc freq <= 10
lowfreqs <- as.data.table(dat[dat$tlength<10 & dat$V3<=10,])

# Rank in each term on freq grouped by length
all[,order:=rank(-V3,ties.method="first"),by=tlength]
highfreqs[,order:=rank(-V3,ties.method="first"),by=tlength]
lowfreqs[,order:=rank(-V3,ties.method="first"),by=tlength]

# Get the number of terms by length
counts <- table(all$tlength);

highfreqsCounts <- table(highfreqs$tlength);
lowfreqsCounts <- table(lowfreqs$tlength);

# number of terms by length that have more than 10 results
plot(highfreqsCounts)

# number of terms by length that have less than 10 results
plot(lowfreqsCounts)

# kind of freq distribution by term length
# each dot is a term, terms with length one are distributed between x [0,1], terms with length 2 in x [1,2] ...
plot(mydt$tlength + (mydt$order/ counts[mydt$tlength]), mydt$V3, log="y", cex=0.002)