Fix a task fingerprinting bug#2740
Conversation
|
This looks way better, than mine simple solution + test coverage added. |
c5d3920 to
d19a73f
Compare
|
@trulede I have refactored this PR to only have the task fingerprinting fix. |
There was a problem hiding this comment.
If you take a look at this PR there is another way to get an identity.
https://github.com/go-task/task/pull/2287/changes#diff-05ed798798e201a0fe6355799c9adb03b42b72710770a86ddecc4def33773c51R137
Its a bit more involved, but what I'm wondering is if the json.Marshal + xxh3 hash is have more (or less) impact on performance in comparison? I think here, mostly avoiding the JSON encoding, and directly hashing the fields necessary.
The other thing is the globbing of sources and generates. This is also an expensive operation. Can we do this fingerprinting after the task is compiled, and use the already globbed task.sources/generates, OR use this globbing result on the task to avoid a subsequent (and duplicate) globbing.
Note: there is another PR #2671 which introduces dynamic sources/generates. That probably also impacts on checksum operation (i.e. it would need to happen after the task is compiled.
Checks:
- Can/are checksums done after the Task is compiled (support dynamic sources/generates).
- Can the source/generates object on the task be used to avoid recalculation.
- Performance of identity generation.
|
Regarding the performance I went ahead and tested out a version which uses just hashstructure: func taskFingerprintKeyHashStructure(t *ast.Task) string {
name := taskIdentityName(t)
identity := fingerprintIdentity{
Task: name,
Dir: t.Dir,
Sources: globPatterns(t.Sources),
Generates: globPatterns(t.Generates),
}
hash, err := hashstructure.Hash(identity, hashstructure.FormatV2, nil)
if err != nil {
return normalizeFilename(name)
}
return normalizeFilename(fmt.Sprintf("%s-%d", name, hash))
}Its pretty much the exact same function with the hashing swapping out from json encode + xxh3 to use hashstructure. In addition to just hashstructure I also tried out manually writing to an fnv hasher with the individual fields. The thinking here was that maybe avoiding reflection could be a big performance win. func taskFingerprintKeyFnv(t *ast.Task) string {
name := taskIdentityName(t)
fnvHash := fnv.New64a()
_, err := fnvHash.Write([]byte(name))
if err != nil {
return normalizeFilename(name)
}
_, err = fnvHash.Write([]byte(t.Dir))
if err != nil {
return normalizeFilename(name)
}
if err := hashGlobs(t.Sources, fnvHash); err != nil {
return normalizeFilename(name)
}
if err := hashGlobs(t.Generates, fnvHash); err != nil {
return normalizeFilename(name)
}
return normalizeFilename(fmt.Sprintf("%s-%x", name, fnvHash.Sum64()))
}
func hashGlobs(globs []*ast.Glob, h io.Writer) error {
if len(globs) == 0 {
return nil
}
for _, glob := range globs {
if glob == nil {
continue
}
if glob.Negate {
if _, err := h.Write([]byte("!")); err != nil {
return err
}
}
if _, err := h.Write([]byte(glob.Glob)); err != nil {
return err
}
}
return nil
}Then I have this benchmark test: func BenchmarkTaskFingerprinting(b *testing.B) {
task := &ast.Task{
Task: "namespace:copy/single",
FullName: "namespace:copy/single",
Dir: "repro",
Sources: []*ast.Glob{
{Glob: "**/*.in"},
{Glob: "vendor/**", Negate: true},
{Glob: "**/*.tmpl"},
},
Generates: []*ast.Glob{
{Glob: "**/*.out"},
{Glob: "reports/**/*.json"},
},
}
prefix := normalizeFilename(task.Task) + "-"
b.ReportAllocs()
b.ResetTimer()
type benchmarkCase struct {
name string
fn func(task *ast.Task) string
}
cases := []benchmarkCase{
{name: "json+xxh3", fn: taskFingerprintKey},
{name: "hashstructure", fn: taskFingerprintKeyHashStructure},
{name: "fnv", fn: taskFingerprintKeyFnv},
}
for _, bc := range cases {
b.Run(bc.name, func(b *testing.B) {
for i := 0; i < b.N; i++ {
benchmarkTaskFingerprintKeySink = bc.fn(task)
if !strings.HasPrefix(benchmarkTaskFingerprintKeySink, prefix) {
b.FailNow()
}
}
})
}
}The results show hashstructure taking ~2x the CPU time as the json+xxh3 thats committed in this PR. The manual fnv hashing did improve on the json+xxh3 for speed (165 nanoseconds per operation reduction). Obviously with more sources/generates the time to create the fingerprint will increase but I wouldn't expect it to exceed low single digit milliseconds. |
Task Fingerprinting Bug
The first commit in this PR fixes a bug where two task invocations (such as in a for loop) inadvertently where writing the checksum or timestamp files for the task to the same location even though the tasks were executed with different arguments causing them to have different sources.
Reproduction
echo 1 >1.in && echo 2 >2.intask copycopy:singleonce for each *.in fileecho 2.2 > 2.incopy:singletask twice again with neither showing as up to date.Because only 2.in was changed, I was expecting the task to show one copy:single task as up to date and then re-copy 2.in to 2.out.
Fix
Instead of writing out the checksum/timestamps to a single file within the respective directory, the task is first fingerprinted. So instead of the
copy:singletask here recording the checksum/timestamp in a singlecopy-singlefile, it will take a hash of the normalized task name, working directory of the task and the declared sources/generates and store the checksum incopy-single-<hash>. This allows each distinct invocation of the sub-task with different arguments to independently manage whether it is up to date.Previously this PR also contained another fix but that has since been rolled into #2743