communicate via http://etherpad.wikimedia.org/DeploymentPrep
http://www.mediawiki.org/wiki/Extension:MWSearch

Some Java Related Puppet Definitions

Classes

apt
apt::clean-cache

apt::clean-cache

Variables

$apt_clean_minutes*: cronjob minutes - default uses ip_to_cron from module "common"
$apt_clean_hours*: cronjob hours - default to 0
$apt_clean_mday*: cronjob monthday - default uses ip_to_cron from module "common"

Require: - module common (http://github.com/camptocamp/puppet-common)

Definitions

 * apt::conf
 * apt::key
 * apt::sources_list

apt::conf

apt::conf{"99unattended-upgrade":
 ensure  => present,
 content => "APT::Periodic::Unattended-Upgrade \"1\";\n",
}

apt::key

apt::key {"A37E4CF5":
 source  => "http://dev.camptocamp.com/packages/debian/pub.key",
}

apt::sources_list

apt::sources_list {"camptocamp":
 ensure  => present,
 content => "deb http://dev.camptocamp.com/packages/ etch puppet",
}

Jenkins

class jenkins {
   include apt
   apt::key {"D50582E6":
       source  => "http://pkg.jenkins-ci.org/debian/jenkins-ci.org.key",
   }
   apt::sources_list {"jenkins":
       ensure  => "present",
       content => "deb http://pkg.jenkins-ci.org/debian binary/",
       require => Apt::Key["D50582E6"],
   }
   package {"jenkins":
       ensure  => "installed",
       require => Apt::Sources_list["jenkins"],
   }
   service {"jenkins":
       enable  => true,
       ensure  => "running",
       hasrestart=> true,
       require => Package["jenkins"],
   }
}

Search setup

using maven + ant to do the build + sync

client wikis

MWSearch - needs to be configured with the ips of all searchers OpenSeachXml - not sure what if it is significant

both

notpeter can supply a debian

java
ant

svn of lucenesearch

A script is required make a like the file://home/wikipedia/common/pmtpa.dblist with the list of all the wikis that should be indexed in the same format

deployment-searchidx

currently uses 600gig, 48 gigs of ram

needs access to

locasettings of each wiki being indexed

scripts - Migrate to puppet - assigned notPeter
- https://gerrit.wikimedia.org/r/#patch,sidebyside,1825,1,files/lucene/lucene.jobs.sh - that is a script that I made based on the scripts that run as crons in prod
- there are a couple of others for doing things like building the .jar on the search indexer and pushing it out to the various search boxes
there are also some start/stop scripts that I want to turn into an init script

deployment-searcher01

search boxes have 16 or 32 gigs of ram, and looks to be using 50-100 gigs of storage

Local.config

https://gerrit.wikimedia.org/r/#patch,unified,1590,1,templates/lucene/lsearch.conf.erb

	 
	diff --git a/templates/lucene/lsearch.conf.erb b/templates/lucene/lsearch.conf.erb
	index 9da2958..ad5b5f2 100644
	--- a/templates/lucene/lsearch.conf.erb
	+++ b/templates/lucene/lsearch.conf.erb
	@@ -1,16 +1,22 @@
1	+######################################################
2	+##### THIS FILE IS MANAGED BY PUPPET              ####
3	+##### puppet:///templates/search/lsearch.conf.erb ####
4	+######################################################
5	+
6	 # By default, will check /etc/mwsearch.conf
7	 
8	 ################################################
9	 # Global configuration
10	 ################################################
11	 
12	+### TO DO: resturcture so this doesn't depend on nfs
13	 # URL to global configuration, this is the shared main config file, it can
14	 # be on a NFS partition or available somewhere on the network
15	 MWConfig.global=file:///home/wikipedia/conf/lucene/lsearch-global-2.1.conf
16	 
17	 # Local path to root directory of indexes
18	 Indexes.path=/a/search/indexes
19	 
20	 # Path to rsync
21	 Rsync.path=/usr/bin/rsync
22	 
	@@ -28,21 +34,21 @@
34	 Search.updateinterval=0.5
35	 
36	 # In seconds, delay after which the update will be fetched
37	 # used to scatter the updates around the hour
38	 Search.updatedelay=0
39	 
40	 # In seconds, how frequently the dead search nodes should be checked
41	 Search.checkinterval=30
42	 
43	 # Disable wordnet aliases
	-Search.disablewordnet=true
44	+Search.disablewordnet=<% if scope.lookupvar('search::searchserver::indexer') == "true" then -%>true<% else -%>false<% end -%>
45	 
46	 ################################################
47	 # Indexer related configuration
48	 ################################################
49	 
50	 # In minutes, how frequently is a clean snapshot of index created
51	 # 2880 = two days
52	 Index.snapshotinterval=2880  
53	 
54	 # Daemon type (http is started by default)
	@@ -50,41 +56,51 @@
56	 
57	 # Port of daemon (default is 8321)
58	 #Index.port=8080
59	 
60	 # Maximal queue size after which index is being updated
61	 Index.maxqueuecount=5000
62	 
63	 # Maximal time an update can remain in queue before being processed (in seconds)
64	 Index.maxqueuetimeout=120
65	 
66	+<% if scope.lookupvar('search::searchserver::indexer') == "true" then -%>
67	 Index.delsnapshots=true
68	+<% end -%>
69	 
70	 ################################################
71	 # Log, ganglia, localization
72	 ################################################
73	 
74	 SearcherPool.size=6
75	 
76	+### TO DO: resturcture so this doesn't depend on nfs
77	 # URL to message files, {0} is replaced with language code, i.e. En
78	 Localization.url=file:///home/wikipedia/common/php/languages/messages
79	 
80	+<% if scope.lookupvar('search::searchserver::indexer') == "true" then -%>
81	 # Pattern for OAI repo. {0} is replaced with dbname, {1} with language
82	 #OAI.repo=http://{1}.wikipedia.org/wiki/Special:OAIRepository
83	 OAI.username=lsearch2
84	 OAI.password=<%= lucene_oai_pass %>
85	+<% end -%>
86	 # Max queue size on remote indexer after which we wait a bit
87	 OAI.maxqueue=5000
88	 
89	 # Number of docs to buffer before sending to inc updater
90	 OAI.bufferdocs=500
91	 
92	+<% if scope.lookupvar('search::searchserver::indexer') == "true" then -%>
93	+# UDP Logger config
94	+UDPLogger.port=51234
95	+UDPLogger.host=208.80.152.184
96	+<% end -%>
97	 
98	 # RecentUpdateDaemon udp and tcp ports
99	 #RecentUpdateDaemon.udp=8111
100	 #RecentUpdateDaemon.tcp=8112
101	 # Hot spare
102	 #RecentUpdateDaemon.hostspareHost=vega
103	 #RecentUpdateDaemon.hostspareUdpPort=8111
104	 #RecentUpdateDaemon.hostspareTcpPort=8112
105	 
106	 # Log configuration

Global.Config

http://noc.wikimedia.org/conf/lsearch-global-2.1.conf

# Logical structure, maps different roles to certain db
[Database]
{file:///home/wikipedia/common/pmtpa.dblist} : (single,true,20,1000) (prefix) (spell,10,3)
enwiki : (nssplit,2) 
enwiki : (nspart1,[0],true,20,500,2)
enwiki : (nspart2,[],true,20,500)
enwiki : (spell,40,10) (warmup,500)
mediawikiwiki, metawiki, commonswiki, strategywiki : (language,en)
commonswiki : (nssplit,2) (nspart1,[6]) (nspart2,[])
dewiki, frwiki : (spell,20,5)
dewiki, frwiki, itwiki, ptwiki, jawiki, plwiki, nlwiki, ruwiki, svwiki, zhwiki : (nssplit,2) (nspart1,[0,2,4,12,14]) (nspart2,[]) (warmup,0)

[Database-Group]
<all> : (titles_by_suffix,2) (tspart1,[ wiki|w ]) (tspart2,[ wiktionary|wikt, wikibooks|b, wikinews|n, wikiquote|q, wikisource|s, wikiversity|v])
sv-titles: (titles_by_suffix,2) (tspart1,[ svwiki|w ]) (tspart2,[ svwiktionary|wikt, svwikibooks|b, svwikinews|n, svwikiquote|q, svwikisource|src])
mw-titles: (titles_by_suffix,1) (tspart1, [ mediawikiwiki|mw, metawiki|meta ])

# Search hosts layout
[Search-Group]
# search 1 (enwiki) 
search1: enwiki.nspart1.sub1 enwiki.nspart1.sub2 
search2: enwiki.nspart1.sub1.hl enwiki.spell #enwiki.nspart1.sub2.hl
search3: enwiki.nspart1.sub1 enwiki.nspart1.sub2
search4: enwiki.nspart1.sub1 enwiki.nspart1.sub2
search5: enwiki.nspart1.sub2.hl enwiki.spell #enwiki.nspart1.sub1.hl
search8: enwiki.prefix #enwiki.spell
search9: enwiki.nspart1.sub1 enwiki.nspart1.sub2
search12: enwiki.spell
search13: enwiki.nspart2*
# disable en-titles using a non-existent hostname ending in "x"
search13x: en-titles*
search14: enwiki.nspart1.sub1.hl
search19: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl
search20: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl

# search 2 (de,fr,jawiki) 
search6: dewiki.nspart1 dewiki.nspart2 frwiki.nspart1 frwiki.nspart2 jawiki.nspart1 jawiki.nspart2
search6: itwiki.nspart1.hl
search15: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl
search16: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl 
search17: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl 

# search 3 (it,nl,ru,sv,pl,pt,es,zhwiki)
search14: eswiki
#search20: eswiki
search7: itwiki.nspart1 ruwiki.nspart1 nlwiki.nspart1 svwiki.nspart1 plwiki.nspart1 ptwiki.nspart1 zhwiki.nspart1
#search7: itwiki.nspart1 itwiki.nspart2 nlwiki.nspart1 nlwiki.nspart2 ruwiki.nspart1 ruwiki.nspart2 svwiki.nspart1
#search9: svwiki.nspart2 plwiki.nspart1 plwiki.nspart2 ptwiki.nspart1 ptwiki.nspart2 zhwiki.nspart1 zhwiki.nspart2
search15: itwiki.nspart2 nlwiki.nspart2 ruwiki.nspart2 svwiki.nspart2  plwiki.nspart2 ptwiki.nspart2 zhwiki.nspart2
search15: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl 
#search15: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl 
search15: ptwiki.nspart1.hl ptwiki.nspart2.hl
search16: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl
search16: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl 
search16: ptwiki.nspart1.hl ptwiki.nspart2.hl
search17: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl
search17: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl
search17: ptwiki.nspart1.hl ptwiki.nspart2.hl

# search 2-3 interwiki/spellchecks
# disable titles by using a non-existent hostname ending in "x"
search10x: de-titles* ja-titles* it-titles* nl-titles* ru-titles* fr-titles*
search10x: sv-titles* pl-titles* pt-titles* es-titles* zh-titles*
search10: dewiki.spell frwiki.spell itwiki.spell nlwiki.spell ruwiki.spell 
search10: svwiki.spell plwiki.spell ptwiki.spell eswiki.spell

# search 4
# disable spell/hl by using a non-existent hostname ending in "x"
search11x: commonswiki.spell commonswiki.nspart1.hl commonswiki.nspart1 commonswiki.nspart2.hl commonswiki.nspart2
search11: commonswiki.nspart1 commonswiki.nspart1.hl commonswiki.nspart2.hl
search11: commonswiki.nspart2
search11: *?
# disable tspart by using a non-existent hostname ending in "x"
search11x: *tspart1 *tspart2
search19: (?!(enwiki.|dewiki.|frwiki.|itwiki.|nlwiki.|ruwiki.|svwiki.|plwiki.|eswiki.|ptwiki.))*.spell
search12: (?!(enwiki.|dewiki.|frwiki.|itwiki.|nlwiki.|ruwiki.|svwiki.|plwiki.|eswiki.|ptwiki.|jawiki.|zhwiki.))*.hl

# prefix stuffs
search18: *.prefix

# stuffs to deploy in future
searchNone: *.related jawiki.nspart1.hl jawiki.nspart2.hl zhwiki.nspart1.hl zhwiki.nspart2.hl
searchNone: enwiki.spell enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl

# Indexers
[Index]
searchidx2: *

# Rsync path where indexes are on hosts, after default value put 
# hosts where the location differs
# Syntax: host : <path>
[Index-Path]
<default> : /search

[OAI]
simplewiki : http://simple.wikipedia.org/w/index.php
rswikimedia : http://rs.wikimedia.org/w/index.php
ilwikimedia : http://il.wikimedia.org/w/index.php
nzwikimedia : http://nz.wikimedia.org/w/index.php
sewikimedia : http://se.wikimedia.org/w/index.php
alswiki : http://als.wikipedia.org/w/index.php
alswikibooks : http://als.wikibooks.org/w/index.php
alswikiquote : http://als.wikibooks.org/w/index.php
alswiktionary : http://als.wiktionary.org/w/index.php
chwikimedia : http://www.wikimedia.ch/w/index.php
crhwiki : http://chr.wikipedia.org/w/index.php
roa_rupwiki : http://roa-rup.wikipedia.org/w/index.php
roa_rupwiktionary : http://roa-rup.wiktionary.org/w/index.php
be_x_oldwiki : http://be-x-old.wikipedia.org/w/index.php
ukwikimedia : http://uk.wikimedia.org/w/index.php
brwikimedia : http://br.wikimedia.org/w/index.php
dkwikimedia : http://dk.wikimedia.org/w/index.php
trwikimedia : http://tr.wikimedia.org/w/index.php
arwikimedia : http://ar.wikimedia.org/w/index.php
mxwikimedia : http://mx.wikimedia.org/w/index.php
commonswiki: http://commons.wikimedia.org/w/index.php

[Namespace-Boost]
commonswiki : (0, 1) (6, 4)
<default> : (0, 1) (1, 0.0005) (2, 0.005) (3, 0.001) (4, 0.01), (6, 0.02), (8, 0.005), (10, 0.0005), (12, 0.01), (14, 0.02)

# Global properies
[Properties]
# suffixes to database name, the rest is assumed to be language code
Database.suffix=wiki wiktionary wikiquote wikibooks wikisource wikinews wikiversity wikimedia

# Allow only up to 500 results per page
Search.maxlimit=501

# Age scaling based on last edit, default is no scaling
# Below are suffixes (or whole names) with various scaling strength
AgeScaling.strong=wikinews
AgeScaling.medium=mediawikiwiki metawiki
#AgeScaling.weak=wiki

# Use additional per-article ranking data, more suitable for non-encyclopedias
AdditionalRank.suffix=mediawikiwiki metawiki

# suffix for databases that should also have exact-case index built
# note: this will also turn off stemming!
ExactCase.suffix=wiktionary jbowiki

# wmf-style init file, attempt to read OAI and lang info from it
# for sample see http://noc.wikimedia.org/conf/InitialiseSettings.php.html
#WMF.InitialiseSettings=file:///home/wikipedia/common/php-1.5/InitialiseSettings.php
#WMF.InitialiseSettings=file:///home/wikipedia/common/wmf-deployment/wmf-config/InitialiseSettings.php
WMF.InitialiseSettings=file:///home/wikipedia/common/wmf-config/InitialiseSettings.php

# Where common images are
Commons.wiki=commonswiki.nspart1

# Syntax: <prefix_name> : <coma separated list of namespaces>
# <all> is a special keyword meaning all namespaces
# E.g. all_talk : 1,3,5,7,9,11,13,15
[Namespace-Prefix]
all : <all>
[0] : 0
[1] : 1
[2] : 2
[3] : 3
[4] : 4
[5] : 5
[6] : 6
[7] : 7
[8] : 8
[9] : 9
[10] : 10
[11] : 11
[12] : 12
[13] : 13
[14] : 14
[15] : 15
[100] : 100
[101] : 101
[104] : 104
[105] : 105
[106] : 106
[0,6,12,14,100,106]: 0,6,12,14,100,106
[0,100,104] : 0,100,104
[0,2,4,12,14] : 0,2,4,12,14
[0,14] : 0,14
[4,12] : 4,12

todo

what we would need for the indexer to work for the long term is some script that make a file://home/wikipedia/common/pmtpa.dblist with the list of all the wikis that should be indexed in the same format
push search resources into an artifactory repository
- localsetting.php
- dump.b2z

repository layout

in /org/wikimedia/labs/

bastion
search
- search-test
deployment-prep
- deployment-sql
- deployment-squid
- deployment-dbdump
- deployment-nfs-memc
- deployment-web
- deployment-indexer
- deployment-searcher

murder

https://github.com/lg/murder/blob/master/README.md

OAIRepository testing and what it does

here is a transcript of brion on OAIReposiory

pop over to say https://en.wikipedia.org/wiki/Special:OAIRepository

at the HTTP auth prompt use usr 'testing' pass 'mctest' is this a public login? I think we should suppress it from ep, if it was public why is there a login

see http://www.openarchives.org/OAI/openarchivesprotocol.html for general protocol documentation to install locally.... in theory:

make sure you've got OAI extension dir in place

and do the usual require "$IP/extensions/OAI/OAIRepo.php";

you'll only need the repository half

run maintenance/updates.php to make sure it installs its tables...

which i think should work

as pages get edited/created/deleted, it'll internally record things into its table and those records can get read out through the Special:OAIRepository iface iirc it records the page id (?), possibly a rev id, and a created/edited/deleted state flag then the interface slurps out current page content at request time it's meant to give you current versions of stuff, rather than to show you every individual change (eg, potentially multiple changes since your last query will be "rolled up" into one, and you just download the entire page text as of the last change) or if it's deleted, you get a marker indicating the page was deleted it's relatively straightforward, but doesn't always map to what folks want :) for search index updates it's good enough... as long as you're working with source all you probably need is the 'ListRecords' verb here is a url to test OAI for search: https://en.wikipedia.org/wiki/Special:OAIRepository?verb=ListRecords&metadataPrefix=lsearch