Workspace Symbols #1876

SuperAuguste · 2024-04-28T04:18:11Z

Note: I plan on editing this write-up and publishing it someplace once this PR is merged :)

Introduction

We're seeking to add workspace symbols, that is the ability to search for declarations by name across all the files in your editor's file tree, to ZLS. This incurs significant construction and access overhead which must be minimized to maintain interactivity, which is crucial in the editor environments where ZLS is run.

N-grams and trigrams

An n-gram is a chunk of text of size n. It is produced by sliding a window of size n and stride 1 across a larger chunk of text, ending when a new window of size n cannot be created. A trigram is an n-gram where n = 3.

To obtain the trigrams of agent, for example, we obtain the first 3 characters, age, then shift our window by 1 to obtain gen, and again to obtain ent. Shifting our window by 1 again would not yield 3 characters, so we stop here. Thus, the trigrams of agent are age, gen, and ent.

N-grams are a nice way to execute approximate searches over a large corpus of text, allowing the consideration not only the entirety, prefix, or suffix of a search target, but also all of its constituent parts. Trigrams also enable efficient large-scale regular expression searches (see Zoekt from my ex-employer Sourcegraph), but that's out of scope for this article.

Indexing

We need to index the name of every single global constant, variable, and function declaration. This is easily doable with the now-refactored DocumentScope, which lists all declarations in a single contiguous list.

Note that we could perform trigram indexing immediately during the construction of the DocumentScope, but that would incur overhead on every edit that we'd rather split into a separate task in our multithreaded setup to keep ZLS fast and responsive.

We can begin by attaching a flag, should_be_indexed_for_trigrams, during the construction of the DocumentScope to each declaration identifying whether it's one of our search targets, thus preventing locals and symbols with names shorter than three characters long from being indexed.

During indexing, we iterate over the declarations for each document and find the trigrams for their names. We then create an inverse mapping from each trigram in the declaration's name to the declaration.

So for example, if declaration Declaration.Index(1) has name agent, our inverse mapping would look like this:

age -> [all other declarations containing trigram age ..., Declaration.Index(1)]
gen -> [all other declarations containing trigram gen ..., Declaration.Index(1)]
ent -> [all other declarations containing trigram ent ..., Declaration.Index(1)]

This inverse mapping is constructed per-document. After it is constructed, we also construct a Binary Fuse filter to quickly disqualify documents that do not contain certain trigrams at query time.

Querying

We begin by obtaining the trigrams for our query. We then iterate through all our documents and check if each document contains each trigram in our query via the Binary Fuse filter, which cannot return false negatives but can return false positives, albeit with a very low false positive rate. This allows us to reduce our computation to only documents that likely have all our query trigrams, and is especially effective for longer queries.

Once we've gathered our candidate documents, querying is essentially just performing an intersection.

We use a "merge intersection," which is ripped out of merge sort, to intersect lists. King tried beating this approach in a couple of purely hashmap-based ways, but with our setup which mostly involves many small inverse mappings (~10,000 trigrams with ~30 declarations each, for example), the merge intersection always won out.

I took a look at and partially implemented Fast Set Intersection in Memory, but it seems to be significant overkill for this sort of small intersection application. In Section 4, "Experimental Evaluation," they show that merge intersection performs rather well and sometimes comparably to the implementations shown in the paper for small intersection sizes, so that's what we're sticking with unless someone can find a better solution.

codecov · 2024-04-30T22:28:57Z

Codecov Report

Attention: Patch coverage is 72.00000% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 80.04%. Comparing base (93b7bbd) to head (1296105).

❗ Current head 1296105 differs from pull request most recent head 25ccfa4

Please upload reports for the commit 25ccfa4 to get more accurate results.

Files	Patch %	Lines
src/TrigramStore.zig	0.00%	5 Missing ⚠️
src/DocumentStore.zig	75.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1876      +/-   ##
==========================================
+ Coverage   78.29%   80.04%   +1.75%     
==========================================
  Files          35       35              
  Lines       10687    12410    +1723     
==========================================
+ Hits         8367     9934    +1567     
- Misses       2320     2476     +156

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SuperAuguste closed this May 3, 2024

SuperAuguste reopened this May 3, 2024

SuperAuguste added 6 commits May 24, 2024 11:35

Trigram indexing names during DocScope construction

07bb2f0

Basic workspace symbols working

6d75192

Track handles in workspace folders, open all files in workspace

eedcaed

TrigramStore split

087c028

Use list

2fe497a

Use merge intersection, multimap

63ac820

SuperAuguste force-pushed the auguste/workspace-symbols branch from 1296105 to 63ac820 Compare May 28, 2024 22:16

Filter not optional

25ccfa4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workspace Symbols #1876

Workspace Symbols #1876

SuperAuguste commented Apr 28, 2024 •

edited

codecov bot commented Apr 30, 2024 •

edited

Workspace Symbols #1876

Are you sure you want to change the base?

Workspace Symbols #1876

Conversation

SuperAuguste commented Apr 28, 2024 • edited

Introduction

N-grams and trigrams

Indexing

Querying

codecov bot commented Apr 30, 2024 • edited

Codecov Report

SuperAuguste commented Apr 28, 2024 •

edited

codecov bot commented Apr 30, 2024 •

edited