Refactoring Fuzzier score calculation

Return to front page

Why would I do that?

I've been working with Fuzzier for over four months now, and it is getting closer to being at 1.0/feature ready. With good amount of other things and QoL changes done since its inception.

The file path and search string comparison/score calculation is one likely the most important feature of the plugin. Having the ability to tweak and modify the search results, so it works the best for each of us is important when using a file finder.

I didn't know what the actual needs were when implementing the initial version and that is alright, but the time to improve is closing in and this time I need to do a better job and have a better idea what kind of structure I want.

I'll most likely start with what feels good and do some minor pivoting and refactoring later on. This usually gives me results when I have the time to really redo things. I've built it once, so it's almost guaranteed that the second time is better when done with time.

During this process, I also moved/created the documentation to a GitHub Wiki. With the goal that all of the documentation is in a single place. I also reduced the readme to a quick start guide.

Where to start

I had to mull for a while before I knew what kind of structures I wanted.

My initial though was to do the calculation in the FuzzyMatchContainer and just add a new sub class to it, but this would not have any positive effect on the project structure. Because I'm hoping to also make the structure a bit clearer with these changes, I decided not to handle the calculation in the entity itself.

I wanted to separate the results of the scores to different fields to make the actual logic and the results visible. This would allow a nice table structure to be shown on the test bench and make testing singular match weights much easier.

To accomplish this, I created a new class ScoreCalculator (really imaginative) that is used to calculate the different score values. This class is instantiated once per search to avoid having to create multiple instances. Just create it once with the correct parameters.

For example, only parse the search string once:

class ScoreCalculator(private val searchString: String) {
    private val lowerSearchString: String = searchString.lowercase()
    private val searchStringParts = lowerSearchString.split(" ")
    private val uniqueLetters = lowerSearchString.toSet()
    

To show and save the values clearly I created FuzzyScore class, which contains the different values. This is then used as a instance variable on the FuzzyMatchContainer class.

class FuzzyScore {
    var streakScore = 0
    var multiMatchScore = 0
    var partialPathScore = 0
    var filenameScore = 0
    
    fun getTotalScore(): Int {
    return streakScore + multiMatchScore + partialPathScore + filenameScore
    }
    }
    

Copying the current implementation

Do start, I want to implement more or less one-to-one copy of the current implementation. Then I can more easily start making some changes without having to worry about breaking things too much.

Preparing for the future

This is (hopefully) a recurring thing in this refactor, but I want to keep eye on some future needs and possible changes.

Tolerance

The easy first thing that jumps to me is tolerance, which is still waiting to be implemented. This would pretty simple, but just havent gotten around to it.

Tolerance prevents a non-match situation.

e.g. usually Fuzzier requires that each of the search string letters are found from the file path in the correct order. Tolerance then would be used to allow an x amount of these "failures" to happen before the program decides that a match is impossible.

When checking if the search string is longer than the current evaluated file path the tolerance must be accounted for in the length check.

// Check if the search string is longer than the file path, which results in no match
    if (lowerSearchString.length > currentFilePath.length) { // TODO: + tolerance when it is implemented
    return null
    }
    

The for loop

I want to process each letter of the search string against each letter of the current filepath. This requires a for loop and my goal is to think of anything that results in less processing.

As many times these days, it still seems that the best solution is to focus on simplicity and not get too tricky. I had big plans on how to optimize the processing and trying process the least amount of characters possible.

In the end, I went for the simple solution and while I'm doing some extra processing the end result is easily readable, ready for future changes and still not slow.

Could it be faster? Yes. Does it need to be faster? Currently, not in my opinion.

Streaks

One of the match weights is streak weight. During the processing, I'm keeping track of the longest streak and it is used in combination with the match weight to add points to the file path.

Multi match

By default same characters are not given any extra score for being present more than once. With multi match we can score them for each appearance on the file path and use the weight multiplier to adjust the scoring.

This requires processing each of the filepath characters with each of the search string characters, which just is more expensive. I don't want to do this unless explicitly necessary, so if no multi match is set we don't run this loop.

I decided to try something a bit more resource intensive, but much more simple. Which is to just iterate over each character of each file path and count the occurences.

Partial path match

For each "full" match of the search string with a partial path we

Filename match weight

This was initially going to be in the future section, but it seemed natural to implement at the same time.

Having a more weight on the actual filename when searching for files sounds like a good idea. This was raised as an issue by bavo to the Fuzzier issues.

(bavo also requested the alternative render mode to show only filename, which was awesome)

Testing

I created a new tests for each of the scoring functions and because they are now a separate scores the tests were really easy to make and feel much better than just checking the total score at the end.

I copied most of the legacy tests from the previous implementation and re-created those for the new process. They worked pretty much out of the box with only a few exceptions and even with these it was much nicer to count the singular scores instead of the compound.

Performance

Running some extra simple "performance tests" using the tensorflow repository, which is quite big with 30k files.

This is not nearly enough to be considered performance testing, but it gives some idea how good the performance is between the two implementations. I really need to do some actual performance testing soon.
I just wanted to have the confidence to release the new version knowing that it won't be noticeable slower than the old one.

Timing the process

I added the timing and count to the meat of the program; the content iterator that does the score calculation. I decided to only check the actual processing time that I was changing, disregarding the parts of the program that stay the same.

fun getContentIterator(projectBasePath: String, searchString: String, listModel: DefaultListModel<FuzzyMatchContainer>): ContentIterator {
    return ContentIterator { file: VirtualFile ->
    val currentTime = currentTimeMillis()
    // actual processing
    timer += currentTimeMillis() - currentTime
    count++
    }
    }
    

Results of the runs

With none of these going over 150 ms and the averages being similar with the multi match taking slighty longer while processing 33473 files, I'm not feeling worried about the performance impact. This is actually a bit better than what I expected.

Running the same six string sequence one-by-one three times on the tensorflow repository and taking the average processing time:

Improving list handling

My solution for re-implementing the html rendering was to create a limit on how many files would be shown on the file list at once. Looking at the code that was handling the file list sorting I realized that it wasn't that great.

val sortedList = listModel.elements().toList().sortedByDescending { it.getScore() }
listModel.clear()
sortedList.forEach { listModel.addElement(it) }

Here, the listModel contains all the files that had a score of one or more, leaving out the files that could not have had a match.
Sorting the whole list, clearing the list model and re-inputting all the files seemed a bit unnecessary. I mean, nobody needs to get a result of thousands of files on the popup.

Instead, I could implement the list limit and only keep the top n files at hand, and only process any file that has a larger score, than the minimum in the list. Fortunately, I've been brushing up with my algorithms lately with The Primeagen's "The Last Algorithms Course You'll Need" so I actually recognized that this looks like a heap.

In Kotlin, you can use Java.util classes, which contain the PriorityQueue. This is a priority heap and can be used with a custom comparator to handle the sorting of the elements. So when adding an element to the queue, it automatically places the item to the correct spot based on the comparator. In this case the score of the FuzzyMatchContainer.
Calling `.remove()` removes the first item on the queue, which is always the one with the smallest score.

The result is not quite as simple looking as the initial implementation, but I think it is clear enough.

/**
 * Process all the elements in the listModel with a priority queue to limit the size
 * and keeping the data sorted at all times
 *
 * Priority queue's size is limit + 1 to prevent any resizing
 * Only add entries to the queue if they have larger score than the minimum in the queue
 *
 * @return a sorted and sized list model
 */
fun sortAndLimit(listModel: DefaultListModel<FuzzyMatchContainer>): DefaultListModel<FuzzyMatchContainer> {
	val priorityQueue = PriorityQueue(listLimit + 1, compareBy(FuzzyMatchContainer::getScore))

	var minimumScore = -1
	listModel.elements().toList().forEach {
		if (it.getScore() > minimumScore) {
			priorityQueue.add(it)
			if (priorityQueue.size > listLimit) {
				minimumScore = priorityQueue.remove().getScore()
			}
		}
	}

	val result = DefaultListModel<FuzzyMatchContainer>()
	result.addAll(priorityQueue.toList().sortedByDescending { it.getScore() }) // Possibly an unnecessary sort, but using poll did not keep the order

	return result
}

With these changes I can easily set up the html rendering, because we can limit the result set to a manageable size. This also enables users to limit the list size even further if they'd like to have even more focused view.

I don't think I ever scroll more than 5 files when using Fuzzier.

Automating the publishing process

Until now, I've been uploading the updates to the plugin manually, but I think it's time to change this.

I want to automate the publishing process to publish a new version to the JetBrains marketplace when I create a new release tag in GitHub.

This means that there is no discrepancy between the GitHub release source and the marketplace binary and I don't have to worry about any local changes that might be present.

Setting up the action

This was remarkably easy.

Publishing using gradle

  1. Create a marketplace token
  2. Set the token to an env variable
  3. Run the gradle task publishPlugin

Setting up the GitHub action

  1. Create a new yml file for the action
  2. I modified the trigger to run on any tag that starts with "v"
  3. Add the token as a repository secret in GitHub
  4. Add an env variable for the token in the yml

The first release using the new publishing process is v0.20.0.

Next steps

I'll be keeping an eye on the GitHub Issues, so if you have any comments or ideas feel free to create an issue to the GitHub issue tracker or contact me with an email (found at the bottom of the page).

Because I really liked the html styled file list renders, I'll be targeting that next. I have a few ideas on how I could prevent the UI from freezing.

I'll also be looking into some light performance testing. This will most likely be the subject for a new article later.

With Fuzzier getting closer to 1.0 release I'm looking forward to moving to a new project and transitioning Fuzzier to maintenance mode. Although I'm not yet sure what the next project will be. I might be looking to do something with Rust.