Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workspace Symbols #1876

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft

Workspace Symbols #1876

wants to merge 7 commits into from

Conversation

SuperAuguste
Copy link
Member

@SuperAuguste SuperAuguste commented Apr 28, 2024

Note: I plan on editing this write-up and publishing it someplace once this PR is merged :)

Introduction

We're seeking to add workspace symbols, that is the ability to search for declarations by name across all the files in your editor's file tree, to ZLS. This incurs significant construction and access overhead which must be minimized to maintain interactivity, which is crucial in the editor environments where ZLS is run.

N-grams and trigrams

An n-gram is a chunk of text of size n. It is produced by sliding a window of size n and stride 1 across a larger chunk of text, ending when a new window of size n cannot be created. A trigram is an n-gram where n = 3.

To obtain the trigrams of agent, for example, we obtain the first 3 characters, age, then shift our window by 1 to obtain gen, and again to obtain ent. Shifting our window by 1 again would not yield 3 characters, so we stop here. Thus, the trigrams of agent are age, gen, and ent.

N-grams are a nice way to execute approximate searches over a large corpus of text, allowing the consideration not only the entirety, prefix, or suffix of a search target, but also all of its constituent parts. Trigrams also enable efficient large-scale regular expression searches (see Zoekt from my ex-employer Sourcegraph), but that's out of scope for this article.

Indexing

We need to index the name of every single global constant, variable, and function declaration. This is easily doable with the now-refactored DocumentScope, which lists all declarations in a single contiguous list.

Note that we could perform trigram indexing immediately during the construction of the DocumentScope, but that would incur overhead on every edit that we'd rather split into a separate task in our multithreaded setup to keep ZLS fast and responsive.

We can begin by attaching a flag, should_be_indexed_for_trigrams, during the construction of the DocumentScope to each declaration identifying whether it's one of our search targets, thus preventing locals and symbols with names shorter than three characters long from being indexed.

During indexing, we iterate over the declarations for each document and find the trigrams for their names. We then create an inverse mapping from each trigram in the declaration's name to the declaration.

So for example, if declaration Declaration.Index(1) has name agent, our inverse mapping would look like this:

age -> [all other declarations containing trigram age ..., Declaration.Index(1)]
gen -> [all other declarations containing trigram gen ..., Declaration.Index(1)]
ent -> [all other declarations containing trigram ent ..., Declaration.Index(1)]

This inverse mapping is constructed per-document. After it is constructed, we also construct a Binary Fuse filter to quickly disqualify documents that do not contain certain trigrams at query time.

Querying

We begin by obtaining the trigrams for our query. We then iterate through all our documents and check if each document contains each trigram in our query via the Binary Fuse filter, which cannot return false negatives but can return false positives, albeit with a very low false positive rate. This allows us to reduce our computation to only documents that likely have all our query trigrams, and is especially effective for longer queries.

Once we've gathered our candidate documents, querying is essentially just performing an intersection.

We use a "merge intersection," which is ripped out of merge sort, to intersect lists. King tried beating this approach in a couple of purely hashmap-based ways, but with our setup which mostly involves many small inverse mappings (~10,000 trigrams with ~30 declarations each, for example), the merge intersection always won out.

I took a look at and partially implemented Fast Set Intersection in Memory, but it seems to be significant overkill for this sort of small intersection application. In Section 4, "Experimental Evaluation," they show that merge intersection performs rather well and sometimes comparably to the implementations shown in the paper for small intersection sizes, so that's what we're sticking with unless someone can find a better solution.

Copy link

codecov bot commented Apr 30, 2024

Codecov Report

Attention: Patch coverage is 72.00000% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 80.04%. Comparing base (93b7bbd) to head (1296105).

Current head 1296105 differs from pull request most recent head 25ccfa4

Please upload reports for the commit 25ccfa4 to get more accurate results.

Files Patch % Lines
src/TrigramStore.zig 0.00% 5 Missing ⚠️
src/DocumentStore.zig 75.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1876      +/-   ##
==========================================
+ Coverage   78.29%   80.04%   +1.75%     
==========================================
  Files          35       35              
  Lines       10687    12410    +1723     
==========================================
+ Hits         8367     9934    +1567     
- Misses       2320     2476     +156     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SuperAuguste SuperAuguste reopened this May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant