手撸golang GO与微服务 ChatServer之3 压测与诊断

Posted ioly

tags: .与微指令的执行周期对应的是    10天轻松搞定go服务    π 计算位数程序 golang    [Golang]OS系统    _ golang 引入相对路径 goroot   

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了手撸golang GO与微服务 ChatServer之3 压测与诊断相关的知识,希望对你有一定的参考价值。

缘起

最近阅读<<Go微服务实战>> (刘金亮, 2021.1)
本系列笔记拟采用golang练习之

案例需求(聊天服务器)
  • 用户可以连接到服务器。
  • 用户可以设定自己的用户名。
  • 用户可以向服务器发送消息,同时服务器也会向其他用户广播该消息。
目标(Day 3)
  • 在Day 2中, 已基本完成功能性编码, 测试也通过了, 今天尝试压测, 并诊断是否存在内存和routine泄漏
过程
  • 修改单元测试代码, 并发客户端增加到50个, 每个客户端随机1-3秒发送一条消息
  • 添加ILoggingService, 以更方便的采集全局日志
  • 使用pprof包的api, 采集heap和goroutine的诊断日志
  • 使用go tool pprof <诊断日志>, 观察是否存在内存和routine泄漏
  • 发现单元测试总是failed, 总有若干tChatClient, 服务端未打印关闭日志
  • 查看routine诊断日志, 发现存在routine泄漏
  • 在tChatClient中添加更详细的函数调用日志, 发现是closeChan通道写入阻塞导致
  • 修复routine泄漏的bug, 加大并发客户端到500个, 并进行多轮测试, 观察是否存在内存和routine泄漏
单元测试
  • ChatServer_test.go
  • 增大客户端并发数, 并添加pprof诊断
  • 诊断ILoggingService.AllLogs(), 判断所有客户端是否完成连接/断开的闭环
package chat_server

import (
    "fmt"
    cs "learning/gooop/chat_server"
    "math/rand"
    "os"
    "runtime"
    "runtime/pprof"
    "strings"
    "sync"
    "testing"
    "time"
)

func Test_ChatServer(t *testing.T) {
    fnAssertTrue := func(b bool, msg string) {
        if !b {
            t.Fatal(msg)
        }
    }

    // create dump file
    memProfile, err := os.Create("/home/ioly/chat_server_mem.profile")
    if err != nil {
        t.Fatal(err)
    }
    cpuProfile, err := os.Create("/home/ioly/chat_server_cpu.profile")
    if err != nil {
        t.Fatal(err)
    }
    rtnProfile, err := os.Create("/home/ioly/chat_server_routine.profile")
    if err != nil {
        t.Fatal(err)
    }
    err = pprof.StartCPUProfile(cpuProfile)
    if err != nil {
        t.Fatal(err)
    }
    defer pprof.StopCPUProfile()

    port := 3333
    server := cs.NewChatServer()
    err = server.Open(port)
    if err != nil {
        t.Fatal(err)
    }

    clientCount := 500
    wg := &sync.WaitGroup{}

    rnd := rand.New(rand.NewSource(time.Now().UnixNano()))
    address := fmt.Sprintf("localhost:%v", port)
    for i := 0;i < clientCount;i++ {
        err, client := cs.DialChatClient(address)
        if err != nil {
            t.Fatal(err)
        }

        id := fmt.Sprintf("c%02d", i)
        //client.RecvHandler(func(client cs.IChatClient, msg cs.IMsg) {
        //    t.Logf("%v recv: %v\\n", id, msg)
        //})

        wg.Add(1)
        go func() {
            client.SetName(id)
            client.Send(&cs.NameMsg{id })

            n := 0
            duration := rnd.Intn(3) + 1
            for range time.Tick(time.Duration(duration) * time.Second) {
                client.Send(&cs.ChatMsg{id, fmt.Sprintf("msg %02d from %v", n, id) })

                n++
                if n > 10 {
                    break
                }
            }

            client.Close()
            wg.Done()
        }()
    }

    // wait and set status
    done := []bool{false}
    go func() {
        wg.Wait()
        done[0] = true
    }()

    // print passed seconds
    passedSeconds := 0
    for range time.Tick(time.Second) {
        passedSeconds++
        t.Logf("%v seconds passed", passedSeconds)

        if done[0] {
            break
        }
    }

    time.Sleep(5*time.Second)
    server.Close()

    // dump heap file
    runtime.GC()
    err = pprof.Lookup("heap").WriteTo(memProfile, 0)
    if err != nil {
        t.Fatal(err)
    }
    err = pprof.Lookup("goroutine").WriteTo(rtnProfile, 0)
    if err != nil {
        t.Fatal(err)
    }

    // check server logs
    logs := cs.Logging.AllLogs()
    fnHasLog := func(log string) bool {
        for _,it := range logs {
            if strings.Contains(it, log) {
                return true
            }
        }
        return false
    }
    for i := 0;i < clientCount;i++ {
        msg := fmt.Sprintf("tChatClient.postConnClosed, c%02d, serverFlag=false", i)
        fnAssertTrue(fnHasLog(msg), "expecting log: " + msg)

        msg = fmt.Sprintf("tChatServer.handleIncomingConn, clientCount=%v", i + 1)
        fnAssertTrue(fnHasLog(msg), "expecting log: " + msg)

        msg = fmt.Sprintf("tChatClient.postConnClosed, c%02d, serverFlag=true", i)
        fnAssertTrue(fnHasLog(msg), "expecting log: " + msg)

        msg = fmt.Sprintf("tChatServer.handleClientClosed, c%02d", i)
        fnAssertTrue(fnHasLog(msg), "expecting log: " + msg)
    }
}
测试输出

并发500, 仅末尾部分

tChatClient.beginWrite, write error, c496, serverFlag=true
tChatClient.beginWrite, <- closeChan, c496, serverFlag=true
tChatClient.postConnClosed, c496, serverFlag=true
tChatServer.handleClientClosed, c496
tChatServer.handleClientClosed, c496, clientCount=1
tChatClient.closeConn, c498, serverFlag=false
tChatClient.beginWrite, <- closeChan, c498, serverFlag=false
tChatClient.beginRead, read error, c498, serverFlag=false
tChatClient.postConnClosed, c498, serverFlag=false
tChatClient.beginRead, read error, c498, serverFlag=true
tChatClient.closeConn, c498, serverFlag=true
tChatClient.beginWrite, <- closeChan, c498, serverFlag=true
tChatClient.postConnClosed, c498, serverFlag=true
tChatServer.handleClientClosed, c498
tChatServer.handleClientClosed, c498, clientCount=0
    ChatServer_test.go:97: 34 seconds passed
--- PASS: Test_ChatServer (39.50s)
PASS
ok      command-line-arguments  39.535s
routine泄漏诊断(修复前)

有泄漏时的诊断日志(并发50)

$ go tool pprof ~/chat_server_routine.profile 
File: chat_server.test
Type: goroutine
Time: Mar 9, 2021 at 9:12am (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 4, 100% of 4 total
Showing top 10 nodes out of 21
      flat  flat%   sum%        cum   cum%
         3 75.00% 75.00%          3 75.00%  runtime.gopark
         1 25.00%   100%          1 25.00%  runtime/pprof.runtime_goroutineProfileWithLabels
         0     0%   100%          1 25.00%  command-line-arguments.Test_ChatServer
         0     0%   100%          1 25.00%  learning/gooop/chat_server.(*tChatClient).beginWrite
         0     0%   100%          1 25.00%  learning/gooop/chat_server.(*tChatClient).closeConn
         0     0%   100%          1 25.00%  main.main
         0     0%   100%          1 25.00%  runtime.chanrecv
         0     0%   100%          1 25.00%  runtime.chanrecv1
         0     0%   100%          1 25.00%  runtime.chansend
         0     0%   100%          1 25.00%  runtime.chansend1
(pprof) 

可以看到closeConn函数阻塞

func (me *tChatClient) closeConn() {
    if !atomic.CompareAndSwapInt32(&me.closeFlag, 0, 1) {
        return
    }

    Logging.Logf("tChatClient.closeConn, %v, serverFlag=%v", me.name, me.serverFlag)
    me.closeChan <- true
}

从代码看, closeConn函数阻塞只有一种可能, 就是closeChan阻塞.
这个原因有点诡异, 因为已经加了atomic.CompareAndSwapInt32保护,
仍然导致阻塞, 唯一原因是并发环境下, closeChan尚未开始读取, 是无法写入的.

修复routine泄漏

就是给导致写阻塞的closeChan加上缓冲, 大小等于closeConn函数的可能调用入口数量

func openChatClient(conn net.Conn, serverFlag bool) IChatClient {
    it := &tChatClient{
        conn: conn,
        openFlag: 0,
        closeFlag: 0,
        serverFlag: serverFlag,

        closeChan: make(chan bool, 3),    // 给closeChan添加缓冲区!!
        sendChan: make(chan IMsg, gMaxPendingSend),

        name: "anonymous",
        sendLogs: []IMsg{},
        dropLogs: []IMsg{},
        recvLogs: []IMsg{},
    }
    it.open()
    return it
}
routine泄漏诊断(修复后)

修复后, 加大并发到500, 未发现明显routine泄漏(不存在业务代码导致挂起的routine)

$ go tool pprof ~/chat_server_routine.profile 
File: chat_server.test
Type: goroutine
Time: Mar 9, 2021 at 9:28am (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 3, 100% of 3 total
Showing top 10 nodes out of 17
      flat  flat%   sum%        cum   cum%
         2 66.67% 66.67%          2 66.67%  runtime.gopark
         1 33.33%   100%          1 33.33%  runtime/pprof.runtime_goroutineProfileWithLabels
         0     0%   100%          1 33.33%  command-line-arguments.Test_ChatServer
         0     0%   100%          1 33.33%  main.main
         0     0%   100%          1 33.33%  runtime.chanrecv
         0     0%   100%          1 33.33%  runtime.chanrecv1
         0     0%   100%          1 33.33%  runtime.main
         0     0%   100%          1 33.33%  runtime/pprof.(*Profile).WriteTo
         0     0%   100%          1 33.33%  runtime/pprof.profileWriter
         0     0%   100%          1 33.33%  runtime/pprof.writeGoroutine
(pprof) 

(未完待续)

以上是关于手撸golang GO与微服务 ChatServer之3 压测与诊断的主要内容,如果未能解决你的问题,请参考以下文章