i have an application written in Go is doing message processing, need to pick up message from network (UDP) at a rate of 20K/second (potentially more), and each message can be up to UDP packet's maximum length (64KB-headersize), the program needs to decode this incoming packet and encode into another format and send to another network;

runtime.mallocgc
(pprof) top20 -cum (sync|runtime)
245.99s of 458.81s total (53.61%)
Dropped 487 nodes (cum <= 22.94s)
Showing top 20 nodes out of 22 (cum >= 30.46s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%    440.88s 96.09%  runtime.goexit
     1.91s  0.42%  1.75%    244.87s 53.37%  sync.(*Pool).Get
    64.42s 14.04% 15.79%    221.57s 48.29%  sync.(*Pool).getSlow
    94.29s 20.55% 36.56%    125.53s 27.36%  sync.(*Mutex).Lock
     1.62s  0.35% 36.91%     72.85s 15.88%  runtime.systemstack
    22.43s  4.89% 41.80%     60.81s 13.25%  runtime.mallocgc
    22.88s  4.99% 46.79%     51.75s 11.28%  runtime.scanobject
     1.78s  0.39% 47.17%     49.15s 10.71%  runtime.newobject
    26.72s  5.82% 53.00%     39.09s  8.52%  sync.(*Mutex).Unlock
     0.76s  0.17% 53.16%     33.74s  7.35%  runtime.gcDrain
         0     0% 53.16%     33.70s  7.35%  runtime.gcBgMarkWorker
         0     0% 53.16%     33.69s  7.34%  runtime.gcBgMarkWorker.func2

the use of pool is the standard

// create this one globally at program init
var rfpool = &sync.Pool{New: func() interface{} { return new(aPrivateStruct); }}

// get
rf := rfpool.Get().(*aPrivateStruct)

// put after done processing this message
rfpool.Put(rf)

not sure am I doing wrong? or wonder what are the other ways could tune the GC to use less CPU%? the go version is 1.8

the list shows a lot of lock contention happened in the pool.getSlow src to pool.go at golang.org

(pprof) list sync.*.getSlow
Total: 7.65mins
ROUTINE ======================== sync.(*Pool).getSlow in /opt/go1.8/src/sync/pool.go
  1.07mins   3.69mins (flat, cum) 48.29% of Total
         .          .    144:       x = p.New()
         .          .    145:   }
         .          .    146:   return x
         .          .    147:}
         .          .    148:
      80ms       80ms    149:func (p *Pool) getSlow() (x interface{}) {
         .          .    150:   // See the comment in pin regarding ordering of the loads.
      30ms       30ms    151:   size := atomic.LoadUintptr(&p.localSize) // load-acquire
     180ms      180ms    152:   local := p.local                         // load-consume
         .          .    153:   // Try to steal one element from other procs.
      30ms      130ms    154:   pid := runtime_procPin()
      20ms       20ms    155:   runtime_procUnpin()
     730ms      730ms    156:   for i := 0; i < int(size); i++ {
    51.55s     51.55s    157:       l := indexLocal(local, (pid+i+1)%int(size))
     580ms   2.01mins    158:       l.Lock()
    10.65s     10.65s    159:       last := len(l.shared) - 1
      40ms       40ms    160:       if last >= 0 {
         .          .    161:           x = l.shared[last]
         .          .    162:           l.shared = l.shared[:last]
         .       10ms    163:           l.Unlock()
         .          .    164:           break
         .          .    165:       }
     490ms     37.59s    166:       l.Unlock()
         .          .    167:   }
      40ms       40ms    168:   return x
         .          .    169:}
         .          .    170:
         .          .    171:// pin pins the current goroutine to P, disables preemption and returns poolLocal pool for the P.
         .          .    172:// Caller must call runtime_procUnpin() when done with the pool.
         .          .    173:func (p *Pool) pin() *poolLocal {