Jump to content

Machine Learning/Technical Meeting Notes

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

2022-02-23

Feast

  • https://phabricator.wikimedia.org/T294434
  • Andy: what do we want to learn? what would be a good demo for online feature store?
    • Just load a bunch of revscoring features in?
    • What kind of storage makes sense for us? Swift, parquet, sql?
  • Luca: How do we want to structure the +3 nodes in eqiad & codfw
    • we have codfw, eqiad still needs to be racked
    • score cache could be a seperate Redis instance
  • Tobias: having separate instances would allow us to tune each instance.
  • Luca: Feast wants a single redis endpoint for host & config, if we have multiple nodes, we may need a proxy in the middle.
  • Let's figure out how we save registry, also how to handle single redis endpoint.
  • Followup: how do we load data into Feast? (airflow) How much space do we need? etc..

ORES deploy

  • https://phabricator.wikimedia.org/T300195
  • Andy: was there a issue on beta?
  • Luca: There is a local proxy issue, Taavi fixed, not sure if deployment-prep vm is fixed.
    • Nothing is burning, no weird errors, things seem to work.
    • Might be nice for learning if Aiko wants to work on the task w/ Luca
  • Aiko: I would like to learn more about the difference between ORES & Lift Wing

Changeprop calling ORES

  • https://phabricator.wikimedia.org/T301409
  • Luca: we could post to eventgate in our model.py for Lift Wing
  • Tobias: could we do this in Istio on our side? a bit like request logging right?
  • Luca: its just a simple POST request so we could do it in the python code, could maybe try Knative eventing but we are using an older version of Knative.
  • Tobias: agreed, doing it w/ a library or wrapper makes alot of sense, only tricky bit is you don't want to delay the call.
  • Andy: does consistency matter? can we just fire off a post request via asyncio and then return our prediction to the user?
  • Luca: that should be fine, if some are missing it's not super problematic.

Lift Wing migration

  • https://phabricator.wikimedia.org/T301409
  • Luca: moving models will take space on the cluster with the current cgroup configs
    • was hoping the new system would not
  • Tobias: Lift Wing is not homebrew, which is an advantage.
  • Andy: the revscoring images are pretty big, also they include a ton of assets
    • other models won't necessarily be like this
  • Luca: we are halfway through migrating ORES images and cpu/memory is filling up.
    • maybe fine with 8 nodes?

2022-02-16

ORES deploy

  • Luca: we are unblocked, recent patches are now running on beta

API Gateway platform

  • Chris: i think there is now a big push to get it in a good place, which is awesome for us
  • Luca: We should connect and start making sure everything works as expected
    • header pathing map etc..

editquality migration

  • Luca: we may need to change Swift clusters to MOSS
    • the paths should stay the same

eventgate scores

2022-02-09

API Gateway

  • https://phabricator.wikimedia.org/T288789
  • Are we blocked?
    • Chris: short-term- Hugh & us will unblock, unsure of long-term status of project
    • Tobias: will let you know outcome of upcoming meeting, all our asks could be no big issue.

Transformers (again)

  • https://phabricator.wikimedia.org/T294419#7688032
  • Images are big (transformer + predictor)
    • also transformer + predictor both need to mount/load model into pod from storage
  • editquality will have 30+ isvcs running two large images
  • Chris: Do we want to use transformers on future models? Yes. The ORES models are a special case.
  • Kevin: the transformers seemed fine until we needed to load the models into the separate pods, now it seems really heavy.
  • Andy: my one argument for keeping the heavy transformers was that we could use it in an explainer, but that does not seem to work (maybe a kserve bug?)
  • Tobias: Forcing transformers architecture on revscoring models may not help us gain much, other than keeping it alive longer.
  • Luca: for editquality it might make sense to go back to predictor
    • the mw transformers are not async either so there is a bottleneck.
  • Kevin: Also, regarding maintainability, we are currently loading models + revscoring and all its dependencies in both the transformer and predictor. Loading them in only the predictor is much more maintainable. It's more DRY.

ORES migration

  • Chris: We should get all the models onto Lift Wing before ORES dies (hardware out of warranty, stretch is ending LTS in a few months, etc.) and then we can improve the models.
  • Luca: Looking at traffic in ores1001 - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=ores1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ores
    • Lots of hosts doing nothing
  • Tobias: I wish we could easily sample say 5% of current ORES-bound traffic and see what happens to it on LW
  • Luca: We could use changeprop and see how it handles on LW
  • Chris: We could do an experiment, although every single model we are not able to migrate over is a conversation we need to have with the community.
  • Andy: It would be pretty easy to migrate editquality over to LW. We just need to add the model binary files to Thanos Swift and then update the helm files.
  • Chris: Let's load every single ORES model into LW. We got 110+ of them, let's start moving them.
  • Luca: For editquality, let's get rid of transformer, move to predictor-only and then start spinning up pods.
  • Andy: I will make a task and then Kevin and I can split the work from there.

Staging Environment

  • Kevin: What is our staging environment going to be?
    • Chris: Should we do dev on staging?
  • Luca: We have ml-sandbox
    • Kevin: ml-sandbox is good
  • Andy: I think we were unsure of what the testing cluster would be used for. Also the cluster-local-gateway networking issues hadn't been solved on ml-sandbox yet, so we were unsure if we should continue maintaining our dev cluster. Things are good now and I think ml-sandbox is good for dev.

2022-01-26

ORES deploy planning

API Gateway Integration

  • https://phabricator.wikimedia.org/T288789
  • Chris: Where are we on this?
  • Tobias: Luca and I have been discussing about our wants & needs, still need to get info about feasability.
  • Chris: Lets figure out nice-to-haves, needs for production system and what we need to get to MVP.
  • Luca: All things we have asked for have almost been delivered, but we need to start testing the integration
    • Hugh has been very helpful in deploying changes to prod.

Image Recommendation?

  • Chris: it's not really standalone ML model
    • we won't need to host this (built-in logistic regression feature in elasticsearch)
  • Luca: where is this hosted/ who owns this?
    • Chris: Cormac on platform i believe.

2022-01-19

Lift Wing MVP check-in

Wikidata ORES spikes

  • https://phabricator.wikimedia.org/T299137
  • score_errored
  • no visibility
    • missing logs
  • Luca: i think it happens during feature extraction?
  • Luca: adding more logging would help us figure out what is happening
  • Chris: this will help keep ores stable while we migrate to lift wing
  • Andy: this could be helpful for us to see if there is a bug buried somewhere.
  • Luca: maybe not fix the bug but help us know where the issue is

ORES deploy

  • Andy: we need to deploy the new nlwiki articlequality model, maybe this week?
  • Luca: let's include the logging updates for the spikes
  • Andy: +1, i'm happy to do the deploy, maybe we can record it and use it as a side-by-side comparison video w/ lift wing?

ORES clients

  • Luca: let's catalog all clients, bots, how people are calling ores (apis etc.) a good starting task for maybe Aiko?
  • Andy: +1 - we need local contacts for wiki communities too
    • Hal (privacy engineer) had suggested having 'service cards' that describe all downstream users for a model, might be good to have early-on for lift wing.
  • Luca: a preliminary list of users, tools, etc.

2022-01-12

Feast spike & hardware order

  • Luca: We should discuss our plans for the Feature store(s), we have to review procurement tasks for dcops this week. Current Plan:
    • 3 redis-like nodes in eqiad
    • 3 redis-like nodes in codfw
    • 2 nodes in eqiad (for the offline store, but it was in early stages, not even sure if we need those)
  • Online Feature Store Task: https://phabricator.wikimedia.org/T294434
  • Score Cache:
    • Luca: ORES models may not need a feature store right away
    • Chris: from product perspective, score cache would handle MVP use-case
    • Tobias: Having a cache and not needing it is lower risk than doing full feature store
  • Chris: let's try to use the same boxes for score cache and then later on the online feature-store.
    • build score-cache first, then progress to online-feature store

nlwiki articlequality deployment

ORES migration

  • Luca: we will need to update clients / all users of ORES with new endpoints
    • New urls will not be a simple redirect due to api gateway etc..
    • Chris: Let's start getting in touch with down-stream users, I will start asking around.

ORES wikidata spike

  • Luca: we are seeing occasional spikes: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=72&orgId=1&refresh=1m
  • ML monitoring
    • Prometheus -> Grafana
    • logs -> logstash(?)
    • Status codes
      • Tobias: 4xx is client screwed up, 5xx is we screwed up
    • Tracing
      • Tobias: there are some great rpc tracing tools that let you explore each step in a workflow, it would be helpful to have something similar
      • Andy: I've seen zipkin and jaeger recommended for distributed tracing in our stack

Load Testing

2021-12-15

Transformers

Deployment Pipeline image issues