Skip to content

feat: Re-implement importer in Go#4808

Merged
michaelkedar merged 24 commits intogoogle:masterfrom
michaelkedar:📥❌🐍
Feb 24, 2026

Hidden character warning

The head ref may contain hidden characters: "\ud83d\udce5\u274c\ud83d\udc0d"
Merged

feat: Re-implement importer in Go#4808
michaelkedar merged 24 commits intogoogle:masterfrom
michaelkedar:📥❌🐍

Conversation

@michaelkedar
Copy link
Member

@michaelkedar michaelkedar commented Feb 13, 2026

Big re-implementation of the importer (and importer-deleter) into Go:

  • Implemented go gitter getter
  • Moved distinct logic (git/GCS/REST) into relevant files
  • Removed some outdated things (e.g. public_logs_bucket)
    • I've remove the writing of ImportFindings, because I think the linter is currently also doing (and overwriting) these
  • Parallelised processing of source repositories
    • also parallelised downloading records and publishing to pubsub within a source repository via a shared worker pool
  • Made interfaces so that the code is testable.

I'd appreciate people reviewing this to see how follow-able/self-documenting the code is, and point out where things may be unclear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file gets copied here via go:generate from the osv-schema submodule.
I'm not really sure what the best practice is here.
We currently don't have any automation to update this, nor check that it is up to date.

@michaelkedar
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a significant and well-executed rewrite of the importer to Go. The code is well-structured, leveraging concurrency with worker pools and interfaces for testability. The separation of concerns for different source types (Git, GCS, REST) is clean. My review focuses on improving robustness, clarity, and maintainability in a few areas. Overall, great work on this large refactoring!

Comment on lines +324 to +328
if srcTimestamp != nil {
msg.Attributes["src_timestamp"] = strconv.FormatInt(srcTimestamp.Unix(), 10)
} else {
msg.Attributes["src_timestamp"] = ""
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of setting src_timestamp to an empty string when it's not available, it's generally better to omit the attribute from the message altogether. This avoids potential parsing issues on the consumer side and makes the message cleaner.

	if srcTimestamp != nil {
		msg.Attributes["src_timestamp"] = strconv.FormatInt(srcTimestamp.Unix(), 10)
	}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python code did this, so I'm hesitant to change it without changing the worker first.

if err := os.MkdirAll(dir, 0755); err != nil {
return nil, fmt.Errorf("failed to create directory: %w", err)
}
cmd := exec.CommandContext(ctx, "tar", "-xf", tarPath, "-C", dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code shells out to the tar command to extract the repository archive. This creates an external dependency on the tar binary being present and supporting zstd. To make the application more self-contained and portable, consider using Go libraries to handle the decompression and extraction, such as github.com/klauspost/compress/zstd for zstd decompression and the standard archive/tar package for tar extraction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to do this in principle, but it's a lot of lines of code to actually do this in go (especially iterating through the tar archive), versus just one for shelling out.

}
// Only log when we see a new ID prefix (i.e. roughly once per data source)
prefix := filepath.Base(path)
prefix := filepath.Base(obj.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? How did this work before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed Objects to return metadata instead of just filenames (to save on number of requests that the importer needs to make)

return &VulnerabilityStore{client: client}
}

func (s *VulnerabilityStore) ListBySource(ctx context.Context, source string, skipWithdrawn bool) iter.Seq2[*models.VulnSourceRef, error] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there plans to split this source thing apart? Or is that more part of the database migration step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Datastore, it's annoying but I don't think it really impacts our ability to do queries
(I don't think we'd ever need to query a path without knowing the source, and the < source + ";" works fine)
I think it's fine to leave it here, but it does make sense to have these fields separate if/when we move to postgres

Comment on lines +157 to +174
filterPath := func(p string) string {
if !strings.HasSuffix(p, sourceRepo.Extension) {
return ""
}
if dirPath := sourceRepo.Git.Path; dirPath != "" {
if !strings.HasSuffix(dirPath, "/") {
dirPath += "/"
}
if !strings.HasPrefix(p, dirPath) {
return ""
}
}
if shouldIgnore(path.Base(p), sourceRepo.IDPrefixes, compiledIgnorePatterns) {
return ""
}

return p
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
filterPath := func(p string) string {
if !strings.HasSuffix(p, sourceRepo.Extension) {
return ""
}
if dirPath := sourceRepo.Git.Path; dirPath != "" {
if !strings.HasSuffix(dirPath, "/") {
dirPath += "/"
}
if !strings.HasPrefix(p, dirPath) {
return ""
}
}
if shouldIgnore(path.Base(p), sourceRepo.IDPrefixes, compiledIgnorePatterns) {
return ""
}
return p
}
filterPath := func(p string) string {
// Has Extension suffix?
if !strings.HasSuffix(p, sourceRepo.Extension) {
return ""
}
// Has Git.Path Prefix?
if dirPath := sourceRepo.Git.Path; dirPath != "" {
if !strings.HasSuffix(dirPath, "/") {
dirPath += "/"
}
if !strings.HasPrefix(p, dirPath) {
return ""
}
}
// Does it match an ignore pattern?
if shouldIgnore(path.Base(p), sourceRepo.IDPrefixes, compiledIgnorePatterns) {
return ""
}
return p
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these comments really adding anything?
like, Has Extension suffix? and strings.HasSuffix(p, sourceRepo.Extension) basically communicate the same amount of info

return err
}
req = req.WithContext(ctx)
resp, err := config.HTTPClient.Do(req)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an err if response is not 200?

If so do we want to do the fallback when HEAD reqs are not supported?

If not, can we check if it returns 200 and if not log a warning?

Copy link
Member Author

@michaelkedar michaelkedar Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err is nil if there's a non-2xx response.
done, and moved to a checkHEAD function

timeToUpdate = lastModTime
}
sourceRepo.REST.LastUpdated = &timeToUpdate
sourceRepo.REST.IgnoreLastImportTime = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the update to switch this back immediately? I guess it shouldn't matter if another importer run doesn't run at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is some transient error in the importer and it doesn't complete, I don't know if we want to write back immediately versus retrying.
And yeah, since only one importer may run at a time, I think this is okay.

another-rex
another-rex previously approved these changes Feb 23, 2026
}
}

// importerSampleRate returns the sample rate for the importer (not the individual vulnerability entries).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the difference between importer and individual vulnerability entries here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more detail here and in vulnerabilitySampleRate()

another-rex
another-rex previously approved these changes Feb 23, 2026
Copy link
Contributor

@another-rex another-rex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should mention in the PR/commit description that this does not do git fetches anymore

@michaelkedar
Copy link
Member Author

Probably should mention in the PR/commit description that this does not do git fetches anymore

It does still do a pull

@michaelkedar michaelkedar requested review from a team and removed request for a team February 23, 2026 23:54
cuixq
cuixq previously approved these changes Feb 24, 2026
@michaelkedar michaelkedar dismissed stale reviews from cuixq and another-rex via 743c79e February 24, 2026 04:53
@michaelkedar michaelkedar merged commit 49d11e3 into google:master Feb 24, 2026
19 checks passed
@michaelkedar michaelkedar deleted the 📥❌🐍 branch February 24, 2026 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants