feat: Re-implement importer in Go by michaelkedar · Pull Request #4808 · google/osv.dev

michaelkedar · 2026-02-13T02:34:52Z

Big re-implementation of the importer (and importer-deleter) into Go:

Implemented go gitter getter
Moved distinct logic (git/GCS/REST) into relevant files
Removed some outdated things (e.g. public_logs_bucket)
- I've remove the writing of ImportFindings, because I think the linter is currently also doing (and overwriting) these
Parallelised processing of source repositories
- also parallelised downloading records and publishing to pubsub within a source repository via a shared worker pool
Made interfaces so that the code is testable.

I'd appreciate people reviewing this to see how follow-able/self-documenting the code is, and point out where things may be unclear.

michaelkedar · 2026-02-16T01:02:20Z

go/internal/importer/schema_generated.json

This file gets copied here via go:generate from the osv-schema submodule.
I'm not really sure what the best practice is here.
We currently don't have any automation to update this, nor check that it is up to date.

michaelkedar · 2026-02-16T02:35:31Z

/gemini review

gemini-code-assist

Code Review

This is a significant and well-executed rewrite of the importer to Go. The code is well-structured, leveraging concurrency with worker pools and interfaces for testability. The separation of concerns for different source types (Git, GCS, REST) is clean. My review focuses on improving robustness, clarity, and maintainability in a few areas. Overall, great work on this large refactoring!

go/cmd/importer/main.go

go/internal/importer/importer.go

gemini-code-assist · 2026-02-16T02:37:59Z

go/internal/importer/importer.go

+	if srcTimestamp != nil {
+		msg.Attributes["src_timestamp"] = strconv.FormatInt(srcTimestamp.Unix(), 10)
+	} else {
+		msg.Attributes["src_timestamp"] = ""
+	}


Instead of setting src_timestamp to an empty string when it's not available, it's generally better to omit the attribute from the message altogether. This avoids potential parsing issues on the consumer side and makes the message cleaner.

if srcTimestamp != nil { msg.Attributes["src_timestamp"] = strconv.FormatInt(srcTimestamp.Unix(), 10) }

The Python code did this, so I'm hesitant to change it without changing the worker first.

gemini-code-assist · 2026-02-16T02:37:59Z

go/internal/repos/repos.go

+	if err := os.MkdirAll(dir, 0755); err != nil {
+		return nil, fmt.Errorf("failed to create directory: %w", err)
+	}
+	cmd := exec.CommandContext(ctx, "tar", "-xf", tarPath, "-C", dir)


The code shells out to the tar command to extract the repository archive. This creates an external dependency on the tar binary being present and supporting zstd. To make the application more self-contained and portable, consider using Go libraries to handle the decompression and extraction, such as github.com/klauspost/compress/zstd for zstd decompression and the standard archive/tar package for tar extraction.

I'd like to do this in principle, but it's a lot of lines of code to actually do this in go (especially iterating through the tar archive), versus just one for shelling out.

another-rex · 2026-02-16T05:26:01Z

go/cmd/exporter/exporter.go

 		}
 		// Only log when we see a new ID prefix (i.e. roughly once per data source)
-		prefix := filepath.Base(path)
+		prefix := filepath.Base(obj.Name)


huh? How did this work before?

I changed Objects to return metadata instead of just filenames (to save on number of requests that the importer needs to make)

go/internal/importer/importer.go

go/internal/importer/helper.go

go/internal/importer/git.go

another-rex · 2026-02-18T04:52:31Z

go/internal/database/datastore/vulnerability.go

+	return &VulnerabilityStore{client: client}
+}
+
+func (s *VulnerabilityStore) ListBySource(ctx context.Context, source string, skipWithdrawn bool) iter.Seq2[*models.VulnSourceRef, error] {


Are there plans to split this source thing apart? Or is that more part of the database migration step.

In Datastore, it's annoying but I don't think it really impacts our ability to do queries
(I don't think we'd ever need to query a path without knowing the source, and the < source + ";" works fine)
I think it's fine to leave it here, but it does make sense to have these fields separate if/when we move to postgres

go/internal/importer/importer.go

another-rex · 2026-02-18T05:06:36Z

go/internal/importer/git.go

+	filterPath := func(p string) string {
+		if !strings.HasSuffix(p, sourceRepo.Extension) {
+			return ""
+		}
+		if dirPath := sourceRepo.Git.Path; dirPath != "" {
+			if !strings.HasSuffix(dirPath, "/") {
+				dirPath += "/"
+			}
+			if !strings.HasPrefix(p, dirPath) {
+				return ""
+			}
+		}
+		if shouldIgnore(path.Base(p), sourceRepo.IDPrefixes, compiledIgnorePatterns) {
+			return ""
+		}
+
+		return p
+	}


Suggested change

filterPath := func(p string) string {

if !strings.HasSuffix(p, sourceRepo.Extension) {

return ""

}

if dirPath := sourceRepo.Git.Path; dirPath != "" {

if !strings.HasSuffix(dirPath, "/") {

dirPath += "/"

}

if !strings.HasPrefix(p, dirPath) {

return ""

}

}

if shouldIgnore(path.Base(p), sourceRepo.IDPrefixes, compiledIgnorePatterns) {

return ""

}

return p

}

filterPath := func(p string) string {

// Has Extension suffix?

if !strings.HasSuffix(p, sourceRepo.Extension) {

return ""

}

// Has Git.Path Prefix?

if dirPath := sourceRepo.Git.Path; dirPath != "" {

if !strings.HasSuffix(dirPath, "/") {

dirPath += "/"

}

if !strings.HasPrefix(p, dirPath) {

return ""

}

}

// Does it match an ignore pattern?

if shouldIgnore(path.Base(p), sourceRepo.IDPrefixes, compiledIgnorePatterns) {

return ""

}

return p

}

Are these comments really adding anything?
like, Has Extension suffix? and strings.HasSuffix(p, sourceRepo.Extension) basically communicate the same amount of info

go/internal/importer/bucket.go

another-rex · 2026-02-19T00:10:18Z

go/internal/importer/rest.go

+			return err
+		}
+		req = req.WithContext(ctx)
+		resp, err := config.HTTPClient.Do(req)


Is there an err if response is not 200?

If so do we want to do the fallback when HEAD reqs are not supported?

If not, can we check if it returns 200 and if not log a warning?

err is nil if there's a non-2xx response.
done, and moved to a checkHEAD function

another-rex · 2026-02-19T00:14:19Z

go/internal/importer/rest.go

+		timeToUpdate = lastModTime
+	}
+	sourceRepo.REST.LastUpdated = &timeToUpdate
+	sourceRepo.REST.IgnoreLastImportTime = false


Should we do the update to switch this back immediately? I guess it shouldn't matter if another importer run doesn't run at the same time.

If there is some transient error in the importer and it doesn't complete, I don't know if we want to write back immediately versus retrying.
And yeah, since only one importer may run at a time, I think this is okay.

another-rex · 2026-02-23T00:18:51Z

go/cmd/importer/main.go

 	}
 }
+
+// importerSampleRate returns the sample rate for the importer (not the individual vulnerability entries).


Can you explain the difference between importer and individual vulnerability entries here?

Added some more detail here and in vulnerabilitySampleRate()

another-rex

Probably should mention in the PR/commit description that this does not do git fetches anymore

michaelkedar · 2026-02-23T05:17:48Z

Probably should mention in the PR/commit description that this does not do git fetches anymore

It does still do a pull

go/cmd/importer/main.go

google deleted a comment from gemini-code-assist bot Feb 13, 2026

michaelkedar force-pushed the 📥❌🐍 branch from 06bcec1 to f38470e Compare February 16, 2026 00:59

michaelkedar commented Feb 16, 2026

View reviewed changes

gemini-code-assist bot reviewed Feb 16, 2026

View reviewed changes

michaelkedar requested review from Ly-Joey, another-rex, cuixq and jess-lowe February 16, 2026 02:59

michaelkedar marked this pull request as ready for review February 16, 2026 02:59

michaelkedar requested a review from tobyhawker February 17, 2026 03:32

michaelkedar force-pushed the 📥❌🐍 branch from 715b8ce to b0e215a Compare February 17, 2026 05:36

another-rex reviewed Feb 19, 2026

View reviewed changes

michaelkedar added 15 commits February 19, 2026 03:56

wew

9e27a63

lad

5e09e50

tetsing

b6a4dd7

bucket efficiency

b08b4f9

deletions, tests and the like

cecf62f

git be gone

2ebb0f2

tidy tests a lil

af026e6

support strictness

9d1bd30

parallel deletions

b141cff

formatting

e3f35a4

rebase+gomod

cc7678c

deployment

d10cd59

REST timing things

7deaa5e

🤖🔎

358e0c4

buncha stuff

fe9e8df

michaelkedar force-pushed the 📥❌🐍 branch from 4f194a5 to fe9e8df Compare February 19, 2026 03:58

review

c8fc977

another thing

b4ca398

michaelkedar requested a review from another-rex February 19, 2026 05:35

michaelkedar added 2 commits February 20, 2026 05:24

send vuln proto in pubsub data

1419661

linter my beloathed

8acf865

another-rex previously approved these changes Feb 23, 2026

View reviewed changes

Merge remote-tracking branch 'upstream' into 📥❌🐍

5337d67

michaelkedar dismissed another-rex’s stale review via 5337d67 February 23, 2026 02:05

🤖🖋️🔥

4f60368

another-rex previously approved these changes Feb 23, 2026

View reviewed changes

another-rex reviewed Feb 23, 2026

View reviewed changes

michaelkedar requested review from a team and removed request for a team February 23, 2026 23:54

cuixq previously approved these changes Feb 24, 2026

View reviewed changes

go/cmd/importer/main.go Outdated Show resolved Hide resolved

Merge branch 'master' into 📥❌🐍

743c79e

michaelkedar dismissed stale reviews from cuixq and another-rex via 743c79e February 24, 2026 04:53

michaelkedar added 2 commits February 24, 2026 04:56

change wording of flag

0a3c80f

gitter endpoint & fall back to git binary

0a9025b

michaelkedar requested a review from another-rex February 24, 2026 05:28

another-rex approved these changes Feb 24, 2026

View reviewed changes

michaelkedar merged commit 49d11e3 into google:master Feb 24, 2026
19 checks passed

michaelkedar deleted the 📥❌🐍 branch February 24, 2026 22:52

Conversation

michaelkedar commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelkedar commented Feb 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelkedar Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

another-rex left a comment

Choose a reason for hiding this comment

Uh oh!

michaelkedar commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelkedar commented Feb 13, 2026 •

edited

Loading

michaelkedar Feb 19, 2026 •

edited

Loading