Jump to content

Search Platform/Weekly Updates/2024-02-09

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

Further investigation of failed queries on the WDQS main graph shows that most are coming for a few sources, which gives us some confidence that we can improve the situation significantly by focusing on a small number of use cases.

Other projects are moving along nicely.

What we've accomplished

Improve multilingual zero-results rate

  • ICU token repair corpus is built and daily diffs are running. Reviewing diffs from enabling the ICU tokenizer. Mostly looks good, but there are a few things to track down. (Malayalam has the most unusual results and I'm having a little trouble figuring out what's going on—diffs from my regresion test set aren't reproducing easily in focused testing. I'll get to the bottom of it eventually.)

WDQS graph splitting

Misc

  • Investigated, restarted and back filled failed data pipeline. https://phabricator.wikimedia.org/T356030
  • We participated to a Unicode Consortium meeting about the Foundation's membership. Nothing concrete yet, but a lot of good will and promises to do introductions and work together in the future. This is especially timely with our current work on ICU token repair.