Show HN: BM25opt – 30-40 x faster BM25 search algorithms (FOSS)

(github.com)

1 points | by jankovicsandras9 hours ago

1 comments

compressedgas8 hours ago
I expected them to be API compatible. Not that it matters to me but I had looked to see.
- jankovicsandras7 hours ago
  This is a good point and was a difficult design decision. The reasons for changing the API are:
  - easier to use with untokenized corpus and questions
  - to fix issues with the tokenizing ( e.g. https://github.com/dorianbrown/rank_bm25/issues/38 ); also rank_bm25 provides no default tokenizer, a naive split-on-whitespace is a wrong choice
  - considerably simplify the code (way less SLOC)
  - point out the similarities of the algorithms for educational purpuses / further development
  In practice, the differences are minimal ( see Example 3: comparison with rank_bm25 ).