1
0
mirror of https://github.com/laurent22/joplin.git synced 2024-12-15 09:04:04 +02:00
joplin/readme/spec/search_sorting.md
Shawn Axsom 5eb0417b1a
All: Sort search results by average of multiple criteria, including 'Sort notes by' field setting (#3777)
* Weight search results by most recently updated

As discussed here: https://github.com/laurent22/joplin/pull/3777#issuecomment-696491859
Before this commit, results were rarely sorted by date. Content weights and fuzziness were
determined, and then the first criteria to differ would win in sort order (and user_updated_time
was the last criteria checked).

Now the weight score itself will also include age of user_updated_time, surfacing fresh content.
At the current alpha level, results are weighted logarithmically, prioritizing mostly within the
last 30 days, and especially heavily within the past week.

* Updated unit tests to weight search results by last updated date

* Updated unit test title

* Fixed issue with weighted search engine test, and made it more deterministic using mock date

Date was being calculated only at the start of the test suite. It also wasn't using a set mock date, so the milliseconds between the real search engine calculations and the test calculation caused differences in results

* Added initial Search Engine spec

* Added Search Engine spec to README.md

* Renamed Search Sorting spec per laurent22's mentioned naming

* Revised copy in search sorting spec

Co-authored-by: Laurent <laurent22@users.noreply.github.com>
2020-10-09 21:51:11 +01:00

38 lines
1.7 KiB
Markdown

# Search Engine
The Search Engine powers the Search input in the note list and the Goto Anything dialog.
## Search algorithm
### Discretely using only the most critical parameter in sorting
Sorting occurs as the Search Engine processes results, after searching for and weighting these results.
Parameters include fuzziness, title matching, weight (based on BM25 and age), the completed status of to-dos, and the note's age.
The Search Engine uses only the first relevant parameter to determine the order, rather than a weighted average.
In effect, this means search results with note title matches will appear above all results that only matched the note body,
regardless of weight or other parameters.
### Determining weight as a sorting parameter
The Search Engine determines the weight parameter using both [BM25](https://en.wikipedia.org/wiki/Okapi_BM25)
and the number of days since last user update.
#### BM25
The Search Engine determines BM25 based on "term frequency-inverse document frequency."
The "TF–IDF" value increases proportionally to the number of times a word appears in the document
and is offset by the number of documents in the corpus that contain the word, which helps to adjust
for the fact that some words appear more frequently in general.
BM25 returns weight zero for a search term that occurs in more than half the notes.
So terms that are abundant in all notes to have zero relevance w.r.t. BM25.
#### Days since last user update
Sorting increases the BM25 weight by the inverse number of days since the note was updated.
Recent notes will, therefore, be weighted highly in the search results.
This time-based weight decays logarithmically, becoming less of a factor than BM25 after months have passed.