User:Oren
- communicate via http://etherpad.wikimedia.org/DeploymentPrep
- http://www.mediawiki.org/wiki/Extension:MWSearch
Some Java Related Puppet Definitions
Classes
- apt
- apt::clean-cache
apt::clean-cache
Variables
- $apt_clean_minutes*: cronjob minutes - default uses ip_to_cron from module "common"
- $apt_clean_hours*: cronjob hours - default to 0
- $apt_clean_mday*: cronjob monthday - default uses ip_to_cron from module "common"
Require: - module common (http://github.com/camptocamp/puppet-common)
Definitions
* apt::conf * apt::key * apt::sources_list
apt::conf
apt::conf{"99unattended-upgrade":
ensure => present,
content => "APT::Periodic::Unattended-Upgrade \"1\";\n",
}
apt::key
apt::key {"A37E4CF5":
source => "http://dev.camptocamp.com/packages/debian/pub.key",
}
apt::sources_list
apt::sources_list {"camptocamp":
ensure => present,
content => "deb http://dev.camptocamp.com/packages/ etch puppet",
}
Jenkins
class jenkins {
include apt
apt::key {"D50582E6":
source => "http://pkg.jenkins-ci.org/debian/jenkins-ci.org.key",
}
apt::sources_list {"jenkins":
ensure => "present",
content => "deb http://pkg.jenkins-ci.org/debian binary/",
require => Apt::Key["D50582E6"],
}
package {"jenkins":
ensure => "installed",
require => Apt::Sources_list["jenkins"],
}
service {"jenkins":
enable => true,
ensure => "running",
hasrestart=> true,
require => Package["jenkins"],
}
}
Search setup
using maven + ant to do the build + sync
- Getting Windows, Eclipse, Ant and Rsync to Play Nicely Together
- apend text to file with ANT
- search and replace with ANT
- finding string in file via ANT
client wikis
MWSearch - needs to be configured with the ips of all searchers OpenSeachXml - not sure what if it is significant
both
notpeter can supply a debian
- java
- ant
svn of lucenesearch
A script is required make a like the file://home/wikipedia/common/pmtpa.dblist with the list of all the wikis that should be indexed in the same format
deployment-searchidx
- currently uses 600gig, 48 gigs of ram
needs access to
- locasettings of each wiki being indexed
- scripts - Migrate to puppet - assigned notPeter
- https://gerrit.wikimedia.org/r/#patch,sidebyside,1825,1,files/lucene/lucene.jobs.sh - that is a script that I made based on the scripts that run as crons in prod
- there are a couple of others for doing things like building the .jar on the search indexer and pushing it out to the various search boxes
- there are also some start/stop scripts that I want to turn into an init script
deployment-searcher01
search boxes have 16 or 32 gigs of ram, and looks to be using 50-100 gigs of storage
Local.config
diff --git a/templates/lucene/lsearch.conf.erb b/templates/lucene/lsearch.conf.erb
index 9da2958..ad5b5f2 100644
--- a/templates/lucene/lsearch.conf.erb
+++ b/templates/lucene/lsearch.conf.erb
@@ -1,16 +1,22 @@
1 +######################################################
2 +##### THIS FILE IS MANAGED BY PUPPET ####
3 +##### puppet:///templates/search/lsearch.conf.erb ####
4 +######################################################
5 +
6 # By default, will check /etc/mwsearch.conf
7
8 ################################################
9 # Global configuration
10 ################################################
11
12 +### TO DO: resturcture so this doesn't depend on nfs
13 # URL to global configuration, this is the shared main config file, it can
14 # be on a NFS partition or available somewhere on the network
15 MWConfig.global=file:///home/wikipedia/conf/lucene/lsearch-global-2.1.conf
16
17 # Local path to root directory of indexes
18 Indexes.path=/a/search/indexes
19
20 # Path to rsync
21 Rsync.path=/usr/bin/rsync
22
@@ -28,21 +34,21 @@
34 Search.updateinterval=0.5
35
36 # In seconds, delay after which the update will be fetched
37 # used to scatter the updates around the hour
38 Search.updatedelay=0
39
40 # In seconds, how frequently the dead search nodes should be checked
41 Search.checkinterval=30
42
43 # Disable wordnet aliases
-Search.disablewordnet=true
44 +Search.disablewordnet=<% if scope.lookupvar('search::searchserver::indexer') == "true" then -%>true<% else -%>false<% end -%>
45
46 ################################################
47 # Indexer related configuration
48 ################################################
49
50 # In minutes, how frequently is a clean snapshot of index created
51 # 2880 = two days
52 Index.snapshotinterval=2880
53
54 # Daemon type (http is started by default)
@@ -50,41 +56,51 @@
56
57 # Port of daemon (default is 8321)
58 #Index.port=8080
59
60 # Maximal queue size after which index is being updated
61 Index.maxqueuecount=5000
62
63 # Maximal time an update can remain in queue before being processed (in seconds)
64 Index.maxqueuetimeout=120
65
66 +<% if scope.lookupvar('search::searchserver::indexer') == "true" then -%>
67 Index.delsnapshots=true
68 +<% end -%>
69
70 ################################################
71 # Log, ganglia, localization
72 ################################################
73
74 SearcherPool.size=6
75
76 +### TO DO: resturcture so this doesn't depend on nfs
77 # URL to message files, {0} is replaced with language code, i.e. En
78 Localization.url=file:///home/wikipedia/common/php/languages/messages
79
80 +<% if scope.lookupvar('search::searchserver::indexer') == "true" then -%>
81 # Pattern for OAI repo. {0} is replaced with dbname, {1} with language
82 #OAI.repo=http://{1}.wikipedia.org/wiki/Special:OAIRepository
83 OAI.username=lsearch2
84 OAI.password=<%= lucene_oai_pass %>
85 +<% end -%>
86 # Max queue size on remote indexer after which we wait a bit
87 OAI.maxqueue=5000
88
89 # Number of docs to buffer before sending to inc updater
90 OAI.bufferdocs=500
91
92 +<% if scope.lookupvar('search::searchserver::indexer') == "true" then -%>
93 +# UDP Logger config
94 +UDPLogger.port=51234
95 +UDPLogger.host=208.80.152.184
96 +<% end -%>
97
98 # RecentUpdateDaemon udp and tcp ports
99 #RecentUpdateDaemon.udp=8111
100 #RecentUpdateDaemon.tcp=8112
101 # Hot spare
102 #RecentUpdateDaemon.hostspareHost=vega
103 #RecentUpdateDaemon.hostspareUdpPort=8111
104 #RecentUpdateDaemon.hostspareTcpPort=8112
105
106 # Log configuration
Global.Config
http://noc.wikimedia.org/conf/lsearch-global-2.1.conf
# Logical structure, maps different roles to certain db
[Database]
{file:///home/wikipedia/common/pmtpa.dblist} : (single,true,20,1000) (prefix) (spell,10,3)
enwiki : (nssplit,2)
enwiki : (nspart1,[0],true,20,500,2)
enwiki : (nspart2,[],true,20,500)
enwiki : (spell,40,10) (warmup,500)
mediawikiwiki, metawiki, commonswiki, strategywiki : (language,en)
commonswiki : (nssplit,2) (nspart1,[6]) (nspart2,[])
dewiki, frwiki : (spell,20,5)
dewiki, frwiki, itwiki, ptwiki, jawiki, plwiki, nlwiki, ruwiki, svwiki, zhwiki : (nssplit,2) (nspart1,[0,2,4,12,14]) (nspart2,[]) (warmup,0)
[Database-Group]
<all> : (titles_by_suffix,2) (tspart1,[ wiki|w ]) (tspart2,[ wiktionary|wikt, wikibooks|b, wikinews|n, wikiquote|q, wikisource|s, wikiversity|v])
sv-titles: (titles_by_suffix,2) (tspart1,[ svwiki|w ]) (tspart2,[ svwiktionary|wikt, svwikibooks|b, svwikinews|n, svwikiquote|q, svwikisource|src])
mw-titles: (titles_by_suffix,1) (tspart1, [ mediawikiwiki|mw, metawiki|meta ])
# Search hosts layout
[Search-Group]
# search 1 (enwiki)
search1: enwiki.nspart1.sub1 enwiki.nspart1.sub2
search2: enwiki.nspart1.sub1.hl enwiki.spell #enwiki.nspart1.sub2.hl
search3: enwiki.nspart1.sub1 enwiki.nspart1.sub2
search4: enwiki.nspart1.sub1 enwiki.nspart1.sub2
search5: enwiki.nspart1.sub2.hl enwiki.spell #enwiki.nspart1.sub1.hl
search8: enwiki.prefix #enwiki.spell
search9: enwiki.nspart1.sub1 enwiki.nspart1.sub2
search12: enwiki.spell
search13: enwiki.nspart2*
# disable en-titles using a non-existent hostname ending in "x"
search13x: en-titles*
search14: enwiki.nspart1.sub1.hl
search19: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl
search20: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl
# search 2 (de,fr,jawiki)
search6: dewiki.nspart1 dewiki.nspart2 frwiki.nspart1 frwiki.nspart2 jawiki.nspart1 jawiki.nspart2
search6: itwiki.nspart1.hl
search15: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl
search16: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl
search17: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl
# search 3 (it,nl,ru,sv,pl,pt,es,zhwiki)
search14: eswiki
#search20: eswiki
search7: itwiki.nspart1 ruwiki.nspart1 nlwiki.nspart1 svwiki.nspart1 plwiki.nspart1 ptwiki.nspart1 zhwiki.nspart1
#search7: itwiki.nspart1 itwiki.nspart2 nlwiki.nspart1 nlwiki.nspart2 ruwiki.nspart1 ruwiki.nspart2 svwiki.nspart1
#search9: svwiki.nspart2 plwiki.nspart1 plwiki.nspart2 ptwiki.nspart1 ptwiki.nspart2 zhwiki.nspart1 zhwiki.nspart2
search15: itwiki.nspart2 nlwiki.nspart2 ruwiki.nspart2 svwiki.nspart2 plwiki.nspart2 ptwiki.nspart2 zhwiki.nspart2
search15: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl
#search15: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl
search15: ptwiki.nspart1.hl ptwiki.nspart2.hl
search16: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl
search16: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl
search16: ptwiki.nspart1.hl ptwiki.nspart2.hl
search17: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl
search17: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl
search17: ptwiki.nspart1.hl ptwiki.nspart2.hl
# search 2-3 interwiki/spellchecks
# disable titles by using a non-existent hostname ending in "x"
search10x: de-titles* ja-titles* it-titles* nl-titles* ru-titles* fr-titles*
search10x: sv-titles* pl-titles* pt-titles* es-titles* zh-titles*
search10: dewiki.spell frwiki.spell itwiki.spell nlwiki.spell ruwiki.spell
search10: svwiki.spell plwiki.spell ptwiki.spell eswiki.spell
# search 4
# disable spell/hl by using a non-existent hostname ending in "x"
search11x: commonswiki.spell commonswiki.nspart1.hl commonswiki.nspart1 commonswiki.nspart2.hl commonswiki.nspart2
search11: commonswiki.nspart1 commonswiki.nspart1.hl commonswiki.nspart2.hl
search11: commonswiki.nspart2
search11: *?
# disable tspart by using a non-existent hostname ending in "x"
search11x: *tspart1 *tspart2
search19: (?!(enwiki.|dewiki.|frwiki.|itwiki.|nlwiki.|ruwiki.|svwiki.|plwiki.|eswiki.|ptwiki.))*.spell
search12: (?!(enwiki.|dewiki.|frwiki.|itwiki.|nlwiki.|ruwiki.|svwiki.|plwiki.|eswiki.|ptwiki.|jawiki.|zhwiki.))*.hl
# prefix stuffs
search18: *.prefix
# stuffs to deploy in future
searchNone: *.related jawiki.nspart1.hl jawiki.nspart2.hl zhwiki.nspart1.hl zhwiki.nspart2.hl
searchNone: enwiki.spell enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl
# Indexers
[Index]
searchidx2: *
# Rsync path where indexes are on hosts, after default value put
# hosts where the location differs
# Syntax: host : <path>
[Index-Path]
<default> : /search
[OAI]
simplewiki : http://simple.wikipedia.org/w/index.php
rswikimedia : http://rs.wikimedia.org/w/index.php
ilwikimedia : http://il.wikimedia.org/w/index.php
nzwikimedia : http://nz.wikimedia.org/w/index.php
sewikimedia : http://se.wikimedia.org/w/index.php
alswiki : http://als.wikipedia.org/w/index.php
alswikibooks : http://als.wikibooks.org/w/index.php
alswikiquote : http://als.wikibooks.org/w/index.php
alswiktionary : http://als.wiktionary.org/w/index.php
chwikimedia : http://www.wikimedia.ch/w/index.php
crhwiki : http://chr.wikipedia.org/w/index.php
roa_rupwiki : http://roa-rup.wikipedia.org/w/index.php
roa_rupwiktionary : http://roa-rup.wiktionary.org/w/index.php
be_x_oldwiki : http://be-x-old.wikipedia.org/w/index.php
ukwikimedia : http://uk.wikimedia.org/w/index.php
brwikimedia : http://br.wikimedia.org/w/index.php
dkwikimedia : http://dk.wikimedia.org/w/index.php
trwikimedia : http://tr.wikimedia.org/w/index.php
arwikimedia : http://ar.wikimedia.org/w/index.php
mxwikimedia : http://mx.wikimedia.org/w/index.php
commonswiki: http://commons.wikimedia.org/w/index.php
[Namespace-Boost]
commonswiki : (0, 1) (6, 4)
<default> : (0, 1) (1, 0.0005) (2, 0.005) (3, 0.001) (4, 0.01), (6, 0.02), (8, 0.005), (10, 0.0005), (12, 0.01), (14, 0.02)
# Global properies
[Properties]
# suffixes to database name, the rest is assumed to be language code
Database.suffix=wiki wiktionary wikiquote wikibooks wikisource wikinews wikiversity wikimedia
# Allow only up to 500 results per page
Search.maxlimit=501
# Age scaling based on last edit, default is no scaling
# Below are suffixes (or whole names) with various scaling strength
AgeScaling.strong=wikinews
AgeScaling.medium=mediawikiwiki metawiki
#AgeScaling.weak=wiki
# Use additional per-article ranking data, more suitable for non-encyclopedias
AdditionalRank.suffix=mediawikiwiki metawiki
# suffix for databases that should also have exact-case index built
# note: this will also turn off stemming!
ExactCase.suffix=wiktionary jbowiki
# wmf-style init file, attempt to read OAI and lang info from it
# for sample see http://noc.wikimedia.org/conf/InitialiseSettings.php.html
#WMF.InitialiseSettings=file:///home/wikipedia/common/php-1.5/InitialiseSettings.php
#WMF.InitialiseSettings=file:///home/wikipedia/common/wmf-deployment/wmf-config/InitialiseSettings.php
WMF.InitialiseSettings=file:///home/wikipedia/common/wmf-config/InitialiseSettings.php
# Where common images are
Commons.wiki=commonswiki.nspart1
# Syntax: <prefix_name> : <coma separated list of namespaces>
# <all> is a special keyword meaning all namespaces
# E.g. all_talk : 1,3,5,7,9,11,13,15
[Namespace-Prefix]
all : <all>
[0] : 0
[1] : 1
[2] : 2
[3] : 3
[4] : 4
[5] : 5
[6] : 6
[7] : 7
[8] : 8
[9] : 9
[10] : 10
[11] : 11
[12] : 12
[13] : 13
[14] : 14
[15] : 15
[100] : 100
[101] : 101
[104] : 104
[105] : 105
[106] : 106
[0,6,12,14,100,106]: 0,6,12,14,100,106
[0,100,104] : 0,100,104
[0,2,4,12,14] : 0,2,4,12,14
[0,14] : 0,14
[4,12] : 4,12
todo
- what we would need for the indexer to work for the long term is some script that make a file://home/wikipedia/common/pmtpa.dblist with the list of all the wikis that should be indexed in the same format
- push search resources into an artifactory repository
- localsetting.php
- dump.b2z
repository layout
in /org/wikimedia/labs/
- bastion
- search
- search-test
- deployment-prep
- deployment-sql
- deployment-squid
- deployment-dbdump
- deployment-nfs-memc
- deployment-web
- deployment-indexer
- deployment-searcher
murder
https://github.com/lg/murder/blob/master/README.md
OAIRepository testing and what it does
here is a transcript of brion on OAIReposiory
pop over to say https://en.wikipedia.org/wiki/Special:OAIRepository
at the HTTP auth prompt use usr 'testing' pass 'mctest' is this a public login? I think we should suppress it from ep, if it was public why is there a login
see http://www.openarchives.org/OAI/openarchivesprotocol.html for general protocol documentation to install locally.... in theory:
make sure you've got OAI extension dir in place
and do the usual require "$IP/extensions/OAI/OAIRepo.php";
you'll only need the repository half
run maintenance/updates.php to make sure it installs its tables...
which i think should work
as pages get edited/created/deleted, it'll internally record things into its table and those records can get read out through the Special:OAIRepository iface iirc it records the page id (?), possibly a rev id, and a created/edited/deleted state flag then the interface slurps out current page content at request time it's meant to give you current versions of stuff, rather than to show you every individual change (eg, potentially multiple changes since your last query will be "rolled up" into one, and you just download the entire page text as of the last change) or if it's deleted, you get a marker indicating the page was deleted it's relatively straightforward, but doesn't always map to what folks want :) for search index updates it's good enough... as long as you're working with source all you probably need is the 'ListRecords' verb here is a url to test OAI for search: https://en.wikipedia.org/wiki/Special:OAIRepository?verb=ListRecords&metadataPrefix=lsearch