Engineering & Infrastructure

• Oct 22, 2025 • 6 MINS READ

Building Real-Time Travel Insurance at Scale: Lessons from 700K+ Bookings

Shubham Chaturvedi, Akshay Patil

Author

The Real Problem: More Than Just "Digitizing Insurance"

The most challenging problems aren't technical; they're about bridging the gap between what users need and what existing systems can deliver. When we started building our travel insurance platform, the problem seemed straightforward: make travel insurance instant and seamless during booking flows.

However, as we delved deeper, we uncovered the true challenges that were holding India's travel insurance back in the analog age.

The Dependency Hell

The core issue wasn't just the slow manual processes, but the reliability nightmare of depending on multiple external systems:

Insurance Provider APIs: Downstream dependencies that could sometimes become unresponsive or return unexpected results
Travel Partner Systems: High-booking platforms that needed instant responses but couldn't work with delays in external APIs
Financial Settlement: Real money transactions that couldn't be lost, duplicated, or processed incorrectly
Regulatory Compliance: Every policy needed proper documentation, audit trails, and regulatory reporting

The fundamental challenge that we encountered was - How do you promise real-time insurance when your downstream dependencies can sometimes fail?

The Cascade Failure Problem

Before building our platform, we witnessed what happened when travel booking systems tried to integrate directly with insurance APIs:

Scenario 1: User books ticket → Travel platform calls insurance API → Insurance API times out → Entire booking flow fails → User abandons purchase → Lost revenue for everyone

Or worse, the API fails silently: Booking completes → Insurance API fails silently → User thinks they're insured → While they're not insured → Disaster occurs when they actually need coverage as there is no record of their insurance

Traditional solutions tried to solve this with:

Synchronous retries: Making users wait 30+ seconds for multiple API attempts
Manual fallback: Customer service teams manually processing failed policies
Best-effort integration: This one "usually works" but wasn't good enough when it comes to financial products

We found that none of these approaches could deliver the real-time, guaranteed reliability that modern travel platforms needed.

Why We Chose Go: Concurrency-First Architecture

After working with various programming languages, we needed a language that could handle our specific challenges. Go's concurrency model was perfect for this:

The Concurrency Challenge

We needed to call multiple insurance providers simultaneously, handle thousands of concurrent bookings, and process async operations without blocking the main request flow. Go's goroutines made this simple:


// Instead of sequential calls (6000-8000ms total)
Call Insurer A → Wait 3000ms
Call Insurer B → Wait 3000ms
Call Insurer C → Wait 2000ms

// Concurrent calls with Go (1000-1200ms total)
Launch 3 goroutines simultaneously
Return first successful response

Memory Efficiency at Scale

With 100K+ monthly bookings, we needed predictable memory usage. Go's garbage collector and small goroutine overhead (2KB initial stack) meant we could handle thousands of concurrent requests without memory bloat.

Deployment Simplicity

Go's single binary deployment was refreshing. No dependency conflicts, no runtime version issues, we could just copy a binary and run it.

The Tech Stack: Battle-Tested Components for Unreliable Dependencies

Our Complete Tech Stack: We Tested These To Survive Unreliable Dependencies

Every component in our stack was chosen to solve specific reliability and scalability challenges:

AWS API Gateway + Authorization + ALB: The Shield Wall

image 1.png

Why This Combination?

API Gateway: Built-in DDoS protection, request validation, and rate limiting Authorization Layer: JWT-based token validation, partner authentication, and request signing verification ALB: Health checks and automatic failover between EKS instances via target groups Target Groups: Intelligent routing and health monitoring of backend instances Private EKS: No direct internet access = reduced attack surface

This setup gives us multiple layers of protection and automatic failover. When one EKS instance fails, ALB routes traffic to healthy instances within milliseconds.

Authorization Flow:

Partner makes request to API Gateway with JWT token
Custom authorizer validates token signature and claims
Authorizer checks partner permissions and rate limits
Valid requests proceed to ALB → Target Groups → EKS
Invalid requests rejected at authorization layer (no backend load)

PostgreSQL as Financial Ledger: ACID Guarantees

For insurance, data consistency isn't optional. PostgreSQL's ACID properties ensure that:

Money movements are always properly recorded
Policy states are never inconsistent
Audit trails are complete and reliable

While PostgreSQL handles our critical financial data, DynamoDB manages:

API request/response logs (millions of records)
Partner session data
Real-time integration status

The key insight: Use the right database for each use case rather than forcing everything into one solution.

The Game Changer For Us: Message Broker for Async Processing

This is where we solved the "unreliable dependencies" problem. Instead of making users wait for slow downstream APIs, we implemented an async pattern:

Psudo code block

The Self-Healing Architecture

This async pattern creates a self-healing system:

Immediate Response: Users get instant confirmation, travel booking flow continues
Resilient Processing: If insurer APIs are down, we retry automatically
Zero Data Loss: Message Broker guarantees message delivery, failed processes are retried
Graceful Degradation: System remains functional even if all insurers are down
Automatic Recovery: When insurers come back online, queued policies are processed automatically

Disaster Recovery Example:

11:00 AM: Major insurer API goes down
11:01 AM: 500 policies queue up in Message Broker (users still get instant responses)
11:30 AM: Circuit breakers route traffic to backup insurers
12:00 PM: Primary insurer comes back online
12:05 PM: All queued policies processed automatically

Result: Zero user impact, zero lost policies

Architecture Deep Dive: The Adapter Pattern That Scales

A Deep Dive Into Our System: The Adapter Pattern

The heart of our system is an adapter-based architecture that decouples our business logic from the chaos of external APIs:

Multi-Insurer Abstraction Layer

Instead of tightly coupling our code to specific insurer APIs, we created a common interface:


type InsuranceProvider interface {
    CreatePolicy(request PolicyRequest) (*PolicyResponse, error)
    GetPolicyStatus(policyID string) (*PolicyStatus, error)
    CancelPolicy(policyID string) error
}

Each insurer implements this interface differently:

Insurer A: Uses XML/SOAP with custom authentication
Insurer B: REST APIs with OAuth2 tokens
Insurer C: Legacy HTTP with API keys

But our core business logic sees them all the same way.

Smart Provider Selection


// Business rules for provider selection
if req.TravelType == "DOMESTIC" && req.Amount < 10000 {
    return findProvider("PREFERRED") // Best rates for domestic travel
}

if req.Priority == "INSTANT" {
    return findProvider("FASTEST") // Fastest response times
}

// Default: Round-robin for load distribution
return providers[time.Now().Unix() % len(providers)]

This makes it easy to:

Add new insurance providers
Route traffic based on business rules
Handle provider failures gracefully

Real-Time Processing: The 1200ms Promise

Our platform promises policy creation within 1200ms. Here's how we achieve this consistently:

Request Processing Pipeline

Stage 1: Validation (< 50ms)

Instant validation of booking details
Partner authentication checks via authorization layer
Basic data format verification

Stage 2: Provider Selection (< 100ms)

Business rule evaluation
Provider availability checks
Backup provider identification

Stage 3: Concurrent Processing (< 900ms)

Multiple insurer API calls simultaneously
First successful response wins
Automatic failover to backup providers

Stage 4: Response Formatting (< 150ms)

Standardized response creation
Async database storage
Notification queuing to Message Broker

Concurrent API Calls Strategy


// Launch multiple goroutines for different insurers
resultChan := make(chan *PolicyResponse, 3)
errorChan := make(chan error, 3)

go callInsurer1(request, resultChan, errorChan)
go callInsurer2(request, resultChan, errorChan)
go callInsurer3(request, resultChan, errorChan)

// Return first successful response
select {
case response := <-resultChan:
    return response // Success within 1200ms!
case <-time.After(1200 * time.Millisecond):
    return timeout_error
}

Monitoring & Reliability: Achieving 99.9% Uptime

We built comprehensive monitoring without relying on external services:

Health Check System

Automated Health Checks Every 30 Seconds:

Database connectivity and response times
Insurer API availability and latency
Message Broker queue depth and processing rates
Memory usage and garbage collection metrics
Authorization layer performance and token validation times

Custom Alert Thresholds:

Error rate > 5%: Immediate Slack notification
Response time P95 > 1200ms: Performance alert
Queue depth > 1000 messages: Capacity alert
3 consecutive health check failures: Critical alert
Authorization failures > 10/minute: Security alert

Auto-Recovery Mechanisms

Our system automatically attempts to fix common issues:

Database Connection Issues:

Detect connection failure
Close existing connections
Wait 5 seconds
Attempt reconnection with exponential backoff
Send alert if recovery fails

Circuit Breaker Pattern:


type CircuitBreaker struct {
    failures    int
    lastFailure time.Time
    state       string // "closed", "open", "half-open"
}

func (cb *CircuitBreaker) Call(operation func() error) error {
    if cb.state == "open" {
        if time.Since(cb.lastFailure) > 30*time.Second {
            cb.state = "half-open" // Try again
        } else {
            return errors.New("circuit breaker open")
        }
    }

    err := operation()
    if err != nil {
        cb.failures++
        cb.lastFailure = time.Now()
        if cb.failures >= 5 {
            cb.state = "open" // Stop trying
        }
    } else {
        cb.failures = 0
        cb.state = "closed" // All good
    }

    return err
}

The Self-Healing Promise: Disaster Recovery in Action

Our system handles various failure scenarios automatically:

Scenario 1: Primary Insurer API Down

09:00 AM: Primary insurer API starts returning errors
09:01 AM: Circuit breaker trips after 5 consecutive failures
09:01 AM: All new requests automatically route to backup insurer
09:02 AM: Queued requests start retrying with exponential backoff
12:00 PM: Primary insurer API recovers
12:01 PM: Circuit breaker automatically closes, traffic resumes

Impact: Zero user-facing errors, zero lost policies

Scenario 2: Complete System Overload

During peak booking times (holidays, festivals), our system automatically scales:

Peak Traffic Detected: 3x normal volume → Auto-scaling triggers additional EKS instances → Load balancer distributes traffic across target groups to new instances → Message Broker queues absorb burst traffic → Processing continues at steady rate → No requests timeout or get lost

Key Lessons: Building Systems That Work in the Real World

This is what we learnt by building systems for the real world:

What Worked Exceptionally Well

Go's Concurrency Model The ability to handle thousands of goroutines with minimal overhead was game-changing. Our concurrent insurer calls reduced response time from 8000-10000ms to 1000-1200ms.
Async-First Design Separating user-facing responses from downstream processing was the key architectural decision. It turned unreliable dependencies into a non-blocking background concern.
Circuit Breakers Everywhere Every external dependency gets a circuit breaker. This single pattern prevented cascade failures that would have taken down our entire system.
PostgreSQL for Financial Data Despite the allure of NoSQL, ACID guarantees were non-negotiable for insurance transactions. PostgreSQL's reliability saved us countless data consistency issues.
Authorization Layer Adding a dedicated authorization layer before the load balancer prevented unauthorized traffic from ever reaching our backend services, significantly reducing load and improving security.

master cloud infra setup

Our Hard-Learned Lessons

Don't Trust External APIs To our finding, even enterprise systems can have reliability issues. Always build for failure, not success.
Monitoring is Not Optional We spent 30% of our development time on monitoring and alerting. This investment paid for itself within the first month of production.
Database Connection Pools Are Critical We initially underestimated connection pool configuration. Proper pooling improved our database performance by 300%.
Message Broker Message Ordering Matters For financial operations, we had to implement our own sequencing logic on top of the Message Broker's eventual consistency.

Here's What We Would Do Differently

Start with Distributed Tracing Earlier

Debugging issues across multiple insurers and async processing was challenging. Distributed tracing should have been a day-one requirement.

More Granular Metrics

Our initial metrics were too high-level. Per-partner, per-insurer, and per-endpoint granularity was crucial for optimization.

Load Testing with Real API Behavior

Our initial load tests used mock APIs that were far more reliable than real insurer APIs. Load testing should simulate real-world failures.

We Processed 7+ Lakh Policies With Zero Data Loss

After 12 months in production:

700,000+ total policies processed
100,000+ monthly bookings (current scale)
99.9% uptime maintained
Average response time: 800ms
P95 response time: 1100ms
Zero data loss incidents
Zero financial reconciliation issues

Future Evolution: Beyond Travel Insurance

Our adapter-based architecture positions us for expansion:

Technical Roadmap

Multi-region deployment for international travel coverage
Machine learning to identify high risk corridors for better pricing
GraphQL APIs for more flexible partner integrations
Microservices decomposition for independent service scaling

Conclusion: Building Systems That Work in the Real World

Building a real-time insurance platform taught us that reliability isn't about perfect components, but about graceful handling of imperfect ones. Our success came from:

Accepting that external dependencies will fail and designing around that reality
Using async processing to decouple user experience from system complexity
Choosing battle-tested technologies over trendy ones
Investing heavily in monitoring and observability from day one
Building self-healing mechanisms that reduce operational overhead
Implementing robust authorization to protect backend services

The Indian travel insurance market was ready for disruption, not because of technology limitations, but because of integration and reliability challenges. By solving the fundamental problem of unreliable dependencies, we created a platform that processes 100,000+ monthly bookings with 99.9% uptime.

Our Most Important Lesson

Engineering isn't about using the latest frameworks or architectures. It's about understanding real-world constraints and building systems that work reliably despite those constraints. This platform represents everything we've learned about creating software that actually solves problems for real people. The 700,000+ travelers who got instant insurance coverage are proof that thoughtful engineering can transform the entire industry.

As India's digital economy grows, we anticipate that the principles we've applied here — reliability, scalability, and self-healing design, will become even more critical for the next generation of fintech platforms.

Frequently Asked Questions

Explore more

1. What types of insurance policies do you offer?

At Covrzy, we provide a wide range of plans, including health, marine, directors and officers insurance, commercial general liability, and commercial crime insurance. We offer complete customisation for each policy based on your coverage needs and budget.

2. How do I choose the right insurance plan?

You can contact our in-house IRDAI-licensed insurance experts for free and book a consultation call to get all your insurance doubts clarified. We help you to identify the best policy based on your business needs and annual objectives. With Covrzy, you can compare plans based on coverage limits, exclusions, premiums, and claim settlement ratios. Our trained advisors can help you evaluate and pick the best match for your needs.

3. What is the difference between personal and business insurance?

Personal insurance covers your individual and family insurance needs against health risks, accidents, or property damage. Business insurance protects organizations of any size from financial losses arising from events such as employee injury, property loss, or liability claims.

4. How can I file an insurance claim with Covrzy?

To make a claim, you can contact our claims department through our 24/7 hotline, submit a claim online through your customer portal, or visit any of our branch offices. You'll need to provide relevant documentation and details about the incident that led to your claim.

5. How long does it take to settle a claim?

The claim process typically takes 5-15 business days from the time we receive all required documentation. Simple claims may be processed faster, while complex claims requiring investigation may take longer. We'll always keep you updated throughout the process.

6. What is not covered under a typical insurance policy?

Each policy has its own exclusions. For example, in health insurance, the common claim exclusions include pre-existing conditions (if not declared), self-harm, negligence, or health loss due to addictions. Always review your policy wording to know the exact list.

7. Can I cancel my policy after purchase?

Yes, you can cancel your policy only if you’re within the free-look period, which is usually 15 days from the date of issuance. After that, you may face applicable deductions based on policy terms.

8. Do businesses really need insurance if they already have savings to protect against losses?

Yes. Even with strong safety measures and ample savings, businesses are always exposed to risks like natural disasters, data breaches, or third-party claims. Business insurance acts absorb all such losses to let the business continue without any disruption.

9. Can I insure my employees under a group plan?

Absolutely, you can. Covrzy offers group health insurance for your entire team at a nominal premium. You get complete health coverage as well as accident coverage for teams of any size with the Covrzy employee wellness plan. You can also extend your employee health insurance to a family floater plan with added benefits.

10. Is there any tax benefit on insurance premiums?

Yes. As per the new tax amendment by the 56th GST Council, you can now opt to purchase health insurance without paying any tax. There is a complete tax exemption for all health and life insurance policies from September 22, 2025, onwards.

Do you have more questions?

Contact us for any queries related to business insurance, coverages, plans and policies. Our insurance experts will assist you.

Reach out to us: [email protected]

Liability Insurance

Property & Casualty (P&C)

Employee Benefits