Go 中如何有效的比较字符串

当优化软件时字符串比较可能和你想的有些不同。特别是包括拆分跨 goroutines 的循环，找到一个更快的哈希算法，或者一些听起来更科学的方式。当我们做出这样的修改时，会有一种成就感。然而, 字符串比较通常是信息传递中（in a pipeline）的最大瓶颈。下面的代码段经常被使用，但它是最糟糕的解决方案 (参见下面的 benchmarks)，并导致了实际问题。

strings.ToLower(name) == strings.ToLower(othername)

ToLower==ToLower

ToLower

strings.ToLower

// Pseudo code
func CompareInsensitive(a, b string) bool {
    // loop over string a and convert every rune to lowercase
    for i := 0; i < len(a); i++ {  a[i] = unicode.ToLower(a[i])  }
    // loop over string b and convert every rune to lowercase
    for i := 0; i < len(b); i++ {  b[i] = unicode.ToLower(b[i])  }
    // loop over both a and b and return false if there is a mismatch
    for i := 0; i < len(a); i++ {
        if a[i] != b[i] {
            return false
        }
    }
    return true
}

nlen(a) + len(b) + len(a)

CompareInsensitive("fizzbuzz", "buzzfizz")

unicode.ToLower(a[0])unicode.ToLower(b[0])

CompareInsensitive

// Pseudo code
func CompareInsensitive(a, b string) bool {
    // a quick optimization. If the two strings have a different
    // length then they certainly are not the same
    if len(a) != len(b) {
        return false
    }

    for i := 0; i < len(a); i++ {
        // if the characters already match then we don't need to
        // alter their case. We can continue to the next rune
        if a[i] == b[i] {
            continue
        }
        if unicode.ToLower(a[i]) != unicode.ToLower(b[i]) {
            // the lowercase characters do not match so these
            // are considered a mismatch, break and return false
            return false
        }
    }
    // The string length has been traversed without a mismatch
    // therefore the two match
    return true
}

新函数效率更高。上限是一个字符串的长度而不是两个字符串的长度和。怎么看我们上面的比较？循环比较最多只有 8 次。甚至，如果第一个字符不同，那就只循环一次。我们优化使得比较操作减少了大约 20 倍！

stringsstrings.EqualFold

性能测试

// When both strings are equal
BenchmarkEqualFoldBothEqual-8                   20000000               124 ns/op
BenchmarkToLowerBothEqual-8                     10000000               339 ns/op

// When both strings are equal until the last rune
BenchmarkEqualFoldLastRuneNotEqual-8            20000000               129 ns/op
BenchmarkToLowerLastRuneNotEqual-8              10000000               346 ns/op

// When both strings are distinct
BenchmarkEqualFoldFirstRuneNotEqual-8           300000000             11.2 ns/op
BenchmarkToLowerFirstRuneNotEqual-8             10000000               333 ns/op

// When both strings have a different case at rune 0
BenchmarkEqualFoldFirstRuneDifferentCase-8      20000000               125 ns/op
BenchmarkToLowerFirstRuneDifferentCase-8        10000000               433 ns/op

// When both strings have a different case in the middle
BenchmarkEqualFoldMiddleRuneDifferentCase-8     20000000               123 ns/op
BenchmarkToLowerMiddleRuneDifferentCase-8       10000000               428 ns/op

EqualFold

这很重要么？

你可能认为 400 纳秒并不重要。大多数情况下你可能是对的。不管怎样，一些微小的优化处理像其他的处理一样简单。在这个例子中，要比原来的处理方式更简单。合格的工程师在日常工作中就会使用这些微小的优化处理方式。他们不会等到变成问题的时候才去优化软件，他们从开始的时候就写出优化的软件。就算是最优秀的工程师也不可能从 0 开始写出最优化的软件。不可能凭空想象出每个极端的案例然后优化它。并且，在我们提供给用户软件的时候，我们也无法预知用户的行为。不管怎样，在日常工作中加入这些简单的处理方式有益于延长软件的生命周期，预防将来可能不必要的瓶颈。就算那个瓶颈没什么影响，你也不会浪费你的付出。

本文由原创编译，荣誉推出