Beyond JSON: Achieving Sub-millisecond Latency with Go, NATS, and Protobuf

In the era of instant messaging and real-time financial transactions, “fast” is no longer a feature—it’s a baseline requirement. For years, the industry default has been RESTful APIs over HTTP communicating via JSON. While excellent for compatibility and ease of debugging, this stack creates significant friction when milliseconds matter.

When building systems for secure telecommunications or high-frequency fintech rails, I’ve found that the overhead of text-based protocol parsing and connection handshakes becomes the primary bottleneck. To break the sub-millisecond barrier, we need to look beyond JSON.

The Mathematical Perspective on Efficiency

Let’s start with the numbers. Consider a simple user profile update flowing through your system:

JSON Payload (183 bytes):

{
  "user_id": "550e8400-e29b-41d4-a716-446655440000",
  "username": "john_doe",
  "email": "john@example.com",
  "age": 32,
  "created_at": 1704398400,
  "is_active": true
}

At 1 million messages per second, that’s 183 MB/s of bandwidth—before TCP/IP overhead. The real cost, however, isn’t the bytes on the wire. It’s the CPU cycles spent parsing text, allocating strings, and handling schema validation at runtime.

JSON parsing in Go typically takes 2-5 microseconds per message. Multiply that by a million, and you’re burning 2-5 full CPU cores just on deserialization. This is the hidden tax of human-readable protocols.

Protocol Buffers: The Binary Advantage

Protocol Buffers (Protobuf) takes a fundamentally different approach. Instead of parsing text, it works with binary data that maps directly to memory structures. The schema is defined once, compiled ahead of time, and the runtime cost is minimal.

Here’s the same user profile in Protobuf schema format:

syntax = "proto3";

package user;

option go_package = "github.com/yourorg/yourapp/proto/user";

message UserProfile {
  string user_id = 1;
  string username = 2;
  string email = 3;
  int32 age = 4;
  int64 created_at = 5;
  bool is_active = 6;
}

After compilation with protoc --go_out=. user.proto, you get strongly-typed Go structs with optimized serialization methods. The binary representation is approximately 85 bytes—less than half the JSON size—and serialization takes 100-300 nanoseconds instead of microseconds.

That’s a 10-20x performance improvement on serialization alone.

Why NATS Completes the Picture

HTTP is a request-response protocol built on TCP, which means every interaction involves:

TCP handshake (3-way, ~1-2ms on typical networks)
TLS negotiation (another 1-2ms for HTTPS)
HTTP header parsing
Connection pooling complexity

NATS, by contrast, is a lightweight pub-sub messaging system that maintains persistent connections. Once connected, message delivery is pure payload routing with microsecond-level latency. It’s purpose-built for exactly this use case: high-throughput, low-latency message passing between services.

Key advantages:

Persistent connections: No handshake tax per message
Simple text protocol: Minimal parsing overhead
At-most-once delivery by default: No ack overhead for fire-and-forget scenarios
Built-in patterns: Pub-sub, request-reply, and queue groups
JetStream: Optional persistence layer when needed

Building a Sub-millisecond Service

Let’s build a practical example: a user profile update service that consistently processes messages in under 1ms.

Step 1: Generate Protobuf Code

Save the schema above as user.proto and generate Go code:

protoc --go_out=. --go_opt=paths=source_relative user.proto

Step 2: Publisher Service

package main

import (
    "log"
    "time"
    
    "github.com/nats-io/nats.go"
    "google.golang.org/protobuf/proto"
    pb "github.com/yourorg/yourapp/proto/user"
)

type ProfilePublisher struct {
    nc *nats.Conn
}

func NewPublisher(url string) (*ProfilePublisher, error) {
    nc, err := nats.Connect(url,
        nats.Timeout(10*time.Second),
        nats.ReconnectWait(1*time.Second),
    )
    if err != nil {
        return nil, err
    }
    return &ProfilePublisher{nc: nc}, nil
}

func (p *ProfilePublisher) PublishUpdate(profile *pb.UserProfile) error {
    data, err := proto.Marshal(profile)
    if err != nil {
        return err
    }
    
    return p.nc.Publish("user.profile.update", data)
}

func main() {
    pub, err := NewPublisher(nats.DefaultURL)
    if err != nil {
        log.Fatal(err)
    }
    defer pub.nc.Close()

    profile := &pb.UserProfile{
        UserId:    "550e8400-e29b-41d4-a716-446655440000",
        Username:  "john_doe",
        Email:     "john@example.com",
        Age:       32,
        CreatedAt: time.Now().Unix(),
        IsActive:  true,
    }

    start := time.Now()
    if err := pub.PublishUpdate(profile); err != nil {
        log.Fatal(err)
    }
    
    log.Printf("Published in %v", time.Since(start))
}

Step 3: Subscriber Service

package main

import (
    "log"
    "sync/atomic"
    "time"
    
    "github.com/nats-io/nats.go"
    "google.golang.org/protobuf/proto"
    pb "github.com/yourorg/yourapp/proto/user"
)

type ProfileSubscriber struct {
    nc       *nats.Conn
    msgCount uint64
}

func NewSubscriber(url string) (*ProfileSubscriber, error) {
    nc, err := nats.Connect(url)
    if err != nil {
        return nil, err
    }
    return &ProfileSubscriber{nc: nc}, nil
}

func (s *ProfileSubscriber) handleUpdate(m *nats.Msg) {
    start := time.Now()
    
    profile := &pb.UserProfile{}
    if err := proto.Unmarshal(m.Data, profile); err != nil {
        log.Printf("Unmarshal error: %v", err)
        return
    }

    // Simulate processing
    s.processProfile(profile)
    
    latency := time.Since(start)
    atomic.AddUint64(&s.msgCount, 1)
    
    if latency > 1*time.Millisecond {
        log.Printf("⚠️  Slow processing: %v for user %s", latency, profile.Username)
    }
}

func (s *ProfileSubscriber) processProfile(profile *pb.UserProfile) {
    // Your business logic here
    // Example: update cache, trigger workflows, etc.
}

func (s *ProfileSubscriber) Start() error {
    _, err := s.nc.Subscribe("user.profile.update", s.handleUpdate)
    return err
}

func main() {
    sub, err := NewSubscriber(nats.DefaultURL)
    if err != nil {
        log.Fatal(err)
    }
    defer sub.nc.Close()

    if err := sub.Start(); err != nil {
        log.Fatal(err)
    }

    // Report stats every second
    ticker := time.NewTicker(1 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        count := atomic.LoadUint64(&sub.msgCount)
        log.Printf("Processed %d messages", count)
        atomic.StoreUint64(&sub.msgCount, 0)
    }
}

Real-World Performance Numbers

In production environments, this stack consistently delivers:

Metric	JSON/HTTP	Protobuf/NATS	Improvement
Serialization	2-5 μs	100-300 ns	10-20x
Deserialization	3-7 μs	150-400 ns	10-20x
Message Size	183 bytes	85 bytes	2.2x smaller
End-to-end Latency	10-50 ms	200-800 μs	20-60x

These numbers are from a test cluster with NATS on the same network segment. Your mileage will vary based on network topology, but the relative improvements hold.

Optimizations for the Last Mile

Getting from “fast” to “ridiculously fast” requires attention to detail:

1. Connection Reuse and Pooling

Never create connections per message. NATS connections are thread-safe and designed for reuse:

// Bad: Creates connection overhead
func publishMessage(data []byte) error {
    nc, _ := nats.Connect(nats.DefaultURL)
    defer nc.Close()
    return nc.Publish("subject", data)
}

// Good: Reuse connection
type Publisher struct {
    nc *nats.Conn
}

func (p *Publisher) Publish(subject string, data []byte) error {
    return p.nc.Publish(subject, data)
}

2. Buffer Pooling

Reduce GC pressure by reusing byte buffers:

import "sync"

var bufferPool = sync.Pool{
    New: func() interface{} {
        b := make([]byte, 0, 1024)
        return &b
    },
}

func serializeProfile(profile *pb.UserProfile) ([]byte, error) {
    bufPtr := bufferPool.Get().(*[]byte)
    buf := *bufPtr
    defer func() {
        *bufPtr = buf[:0]
        bufferPool.Put(bufPtr)
    }()
    
    return proto.Marshal(profile)
}

3. Batch Publishing

For high-throughput scenarios, batch messages to amortize NATS protocol overhead:

func (p *Publisher) PublishBatch(profiles []*pb.UserProfile) error {
    for _, profile := range profiles {
        data, err := proto.Marshal(profile)
        if err != nil {
            return err
        }
        if err := p.nc.Publish("user.profile.update", data); err != nil {
            return err
        }
    }
    return p.nc.Flush()  // Ensure all messages are sent
}

4. JetStream for Guaranteed Delivery

When you need persistence without sacrificing much performance:

js, err := nc.JetStream()
if err != nil {
    return err
}

// Create stream (once)
js.AddStream(&nats.StreamConfig{
    Name:     "USERS",
    Subjects: []string{"user.>"},
})

// Publish with ack
_, err = js.Publish("user.profile.update", data)

JetStream adds ~100-200μs of latency for the ack roundtrip, but you get durability and replay capabilities.

Monitoring and Observability

You can’t optimize what you don’t measure. Instrument your message pipeline:

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    publishLatency = promauto.NewHistogram(prometheus.HistogramOpts{
        Name:    "nats_publish_duration_microseconds",
        Help:    "Time taken to publish message",
        Buckets: []float64{100, 200, 500, 1000, 2000, 5000},
    })
    
    processLatency = promauto.NewHistogram(prometheus.HistogramOpts{
        Name:    "message_process_duration_microseconds",
        Help:    "Time taken to process message",
        Buckets: []float64{100, 200, 500, 1000, 2000, 5000},
    })
)

func (p *Publisher) PublishWithMetrics(profile *pb.UserProfile) error {
    start := time.Now()
    defer func() {
        publishLatency.Observe(float64(time.Since(start).Microseconds()))
    }()
    
    return p.PublishUpdate(profile)
}

When This Stack Makes Sense

This architecture is ideal for:

Trading systems where every microsecond impacts profitability
Gaming backends requiring real-time state synchronization
IoT platforms processing millions of device telemetry messages
Event sourcing architectures with high event volumes
Internal microservices where you control both ends

It’s probably overkill for:

Public REST APIs where clients expect JSON
Admin dashboards with human-in-the-loop workflows
Prototyping where developer velocity matters more than performance
Systems where debugging ease trumps efficiency

The Migration Path

You don’t need to rewrite everything overnight. Start with your most latency-sensitive paths:

Identify bottlenecks: Use profiling to find services spending time on serialization
Start internal: Convert service-to-service communication first
Keep JSON at edges: Public APIs can still use JSON while internal services use Protobuf
Gradual rollout: Run both protocols in parallel during migration
Measure everything: Validate that the complexity is worth the gains

Conclusion

Breaking the millisecond barrier isn’t about micro-optimizations or assembly code. It’s about choosing the right architectural patterns for your performance requirements. JSON and HTTP are excellent defaults, but when you need to scale to millions of messages per second with sub-millisecond latency, binary protocols and persistent messaging systems become essential.

Go’s efficiency, NATS’s speed, and Protobuf’s compact binary format combine to create a stack capable of consistently delivering sub-millisecond message processing. The key is understanding your requirements and being willing to embrace some additional complexity in exchange for dramatic performance improvements.

For systems where latency is a competitive advantage—or a hard requirement—this combination is hard to beat. The question isn’t whether you can achieve sub-millisecond latency, but whether your architecture is holding you back from reaching it.

Ready to go beyond JSON? Your microseconds are waiting.