Building Real-Time Travel Insurance at Scale: Lessons from 700K+ Bookings
Shubham Chaturvedi, Akshay Patil
Author

The Real Problem: More Than Just "Digitizing Insurance"
The most challenging problems aren't technical; they're about bridging the gap between what users need and what existing systems can deliver. When we started building our travel insurance platform, the problem seemed straightforward: make travel insurance instant and seamless during booking flows.
However, as we delved deeper, we uncovered the true challenges that were holding India's travel insurance back in the analog age.
The Dependency Hell
The core issue wasn't just the slow manual processes, but the reliability nightmare of depending on multiple external systems:
- Insurance Provider APIs: Downstream dependencies that could sometimes become unresponsive or return unexpected results
- Travel Partner Systems: High-booking platforms that needed instant responses but couldn't work with delays in external APIs
- Financial Settlement: Real money transactions that couldn't be lost, duplicated, or processed incorrectly
- Regulatory Compliance: Every policy needed proper documentation, audit trails, and regulatory reporting
The fundamental challenge that we encountered was - How do you promise real-time insurance when your downstream dependencies can sometimes fail?
The Cascade Failure Problem
Before building our platform, we witnessed what happened when travel booking systems tried to integrate directly with insurance APIs:
Scenario 1: User books ticket → Travel platform calls insurance API → Insurance API times out → Entire booking flow fails → User abandons purchase → Lost revenue for everyone
Or worse, the API fails silently: Booking completes → Insurance API fails silently → User thinks they're insured → While they're not insured → Disaster occurs when they actually need coverage as there is no record of their insurance
Traditional solutions tried to solve this with:
- Synchronous retries: Making users wait 30+ seconds for multiple API attempts
- Manual fallback: Customer service teams manually processing failed policies
- Best-effort integration: This one "usually works" but wasn't good enough when it comes to financial products
We found that none of these approaches could deliver the real-time, guaranteed reliability that modern travel platforms needed.
Why We Chose Go: Concurrency-First Architecture
After working with various programming languages, we needed a language that could handle our specific challenges. Go's concurrency model was perfect for this:
The Concurrency Challenge
We needed to call multiple insurance providers simultaneously, handle thousands of concurrent bookings, and process async operations without blocking the main request flow. Go's goroutines made this simple:
// Instead of sequential calls (6000-8000ms total)
Call Insurer A → Wait 3000ms
Call Insurer B → Wait 3000ms
Call Insurer C → Wait 2000ms
// Concurrent calls with Go (1000-1200ms total)
Launch 3 goroutines simultaneously
Return first successful response
Memory Efficiency at Scale
With 100K+ monthly bookings, we needed predictable memory usage. Go's garbage collector and small goroutine overhead (2KB initial stack) meant we could handle thousands of concurrent requests without memory bloat.
Deployment Simplicity
Go's single binary deployment was refreshing. No dependency conflicts, no runtime version issues, we could just copy a binary and run it.
The Tech Stack: Battle-Tested Components for Unreliable Dependencies
Our Complete Tech Stack: We Tested These To Survive Unreliable Dependencies
Every component in our stack was chosen to solve specific reliability and scalability challenges:
AWS API Gateway + Authorization + ALB: The Shield Wall

Why This Combination?
API Gateway: Built-in DDoS protection, request validation, and rate limiting Authorization Layer: JWT-based token validation, partner authentication, and request signing verification ALB: Health checks and automatic failover between EKS instances via target groups Target Groups: Intelligent routing and health monitoring of backend instances Private EKS: No direct internet access = reduced attack surface
This setup gives us multiple layers of protection and automatic failover. When one EKS instance fails, ALB routes traffic to healthy instances within milliseconds.
Authorization Flow:
- Partner makes request to API Gateway with JWT token
- Custom authorizer validates token signature and claims
- Authorizer checks partner permissions and rate limits
- Valid requests proceed to ALB → Target Groups → EKS
- Invalid requests rejected at authorization layer (no backend load)
PostgreSQL as Financial Ledger: ACID Guarantees
For insurance, data consistency isn't optional. PostgreSQL's ACID properties ensure that:
- Money movements are always properly recorded
- Policy states are never inconsistent
- Audit trails are complete and reliable
While PostgreSQL handles our critical financial data, DynamoDB manages:
- API request/response logs (millions of records)
- Partner session data
- Real-time integration status
The key insight: Use the right database for each use case rather than forcing everything into one solution.
The Game Changer For Us: Message Broker for Async Processing
This is where we solved the "unreliable dependencies" problem. Instead of making users wait for slow downstream APIs, we implemented an async pattern:

The Self-Healing Architecture
This async pattern creates a self-healing system:
- Immediate Response: Users get instant confirmation, travel booking flow continues
- Resilient Processing: If insurer APIs are down, we retry automatically
- Zero Data Loss: Message Broker guarantees message delivery, failed processes are retried
- Graceful Degradation: System remains functional even if all insurers are down
- Automatic Recovery: When insurers come back online, queued policies are processed automatically
Disaster Recovery Example:
- 11:00 AM: Major insurer API goes down
- 11:01 AM: 500 policies queue up in Message Broker (users still get instant responses)
- 11:30 AM: Circuit breakers route traffic to backup insurers
- 12:00 PM: Primary insurer comes back online
- 12:05 PM: All queued policies processed automatically
Result: Zero user impact, zero lost policies
Architecture Deep Dive: The Adapter Pattern That Scales
A Deep Dive Into Our System: The Adapter Pattern
The heart of our system is an adapter-based architecture that decouples our business logic from the chaos of external APIs:
Multi-Insurer Abstraction Layer
Instead of tightly coupling our code to specific insurer APIs, we created a common interface:
type InsuranceProvider interface {
CreatePolicy(request PolicyRequest) (*PolicyResponse, error)
GetPolicyStatus(policyID string) (*PolicyStatus, error)
CancelPolicy(policyID string) error
}
Each insurer implements this interface differently:
- Insurer A: Uses XML/SOAP with custom authentication
- Insurer B: REST APIs with OAuth2 tokens
- Insurer C: Legacy HTTP with API keys
But our core business logic sees them all the same way.
Smart Provider Selection
// Business rules for provider selection
if req.TravelType == "DOMESTIC" && req.Amount < 10000 {
return findProvider("PREFERRED") // Best rates for domestic travel
}
if req.Priority == "INSTANT" {
return findProvider("FASTEST") // Fastest response times
}
// Default: Round-robin for load distribution
return providers[time.Now().Unix() % len(providers)]
This makes it easy to:
- Add new insurance providers
- Route traffic based on business rules
- Handle provider failures gracefully
Real-Time Processing: The 1200ms Promise
Our platform promises policy creation within 1200ms. Here's how we achieve this consistently:
Request Processing Pipeline
Stage 1: Validation (< 50ms)
- Instant validation of booking details
- Partner authentication checks via authorization layer
- Basic data format verification
Stage 2: Provider Selection (< 100ms)
- Business rule evaluation
- Provider availability checks
- Backup provider identification
Stage 3: Concurrent Processing (< 900ms)
- Multiple insurer API calls simultaneously
- First successful response wins
- Automatic failover to backup providers
Stage 4: Response Formatting (< 150ms)
- Standardized response creation
- Async database storage
- Notification queuing to Message Broker
Concurrent API Calls Strategy
// Launch multiple goroutines for different insurers
resultChan := make(chan *PolicyResponse, 3)
errorChan := make(chan error, 3)
go callInsurer1(request, resultChan, errorChan)
go callInsurer2(request, resultChan, errorChan)
go callInsurer3(request, resultChan, errorChan)
// Return first successful response
select {
case response := <-resultChan:
return response // Success within 1200ms!
case <-time.After(1200 * time.Millisecond):
return timeout_error
}
Monitoring & Reliability: Achieving 99.9% Uptime
We built comprehensive monitoring without relying on external services:
Health Check System
Automated Health Checks Every 30 Seconds:
- Database connectivity and response times
- Insurer API availability and latency
- Message Broker queue depth and processing rates
- Memory usage and garbage collection metrics
- Authorization layer performance and token validation times
Custom Alert Thresholds:
- Error rate > 5%: Immediate Slack notification
- Response time P95 > 1200ms: Performance alert
- Queue depth > 1000 messages: Capacity alert
- 3 consecutive health check failures: Critical alert
- Authorization failures > 10/minute: Security alert
Auto-Recovery Mechanisms
Our system automatically attempts to fix common issues:
Database Connection Issues:
- Detect connection failure
- Close existing connections
- Wait 5 seconds
- Attempt reconnection with exponential backoff
- Send alert if recovery fails
Circuit Breaker Pattern:
type CircuitBreaker struct {
failures int
lastFailure time.Time
state string // "closed", "open", "half-open"
}
func (cb *CircuitBreaker) Call(operation func() error) error {
if cb.state == "open" {
if time.Since(cb.lastFailure) > 30*time.Second {
cb.state = "half-open" // Try again
} else {
return errors.New("circuit breaker open")
}
}
err := operation()
if err != nil {
cb.failures++
cb.lastFailure = time.Now()
if cb.failures >= 5 {
cb.state = "open" // Stop trying
}
} else {
cb.failures = 0
cb.state = "closed" // All good
}
return err
}
The Self-Healing Promise: Disaster Recovery in Action
Our system handles various failure scenarios automatically:
Scenario 1: Primary Insurer API Down
- 09:00 AM: Primary insurer API starts returning errors
- 09:01 AM: Circuit breaker trips after 5 consecutive failures
- 09:01 AM: All new requests automatically route to backup insurer
- 09:02 AM: Queued requests start retrying with exponential backoff
- 12:00 PM: Primary insurer API recovers
- 12:01 PM: Circuit breaker automatically closes, traffic resumes
Impact: Zero user-facing errors, zero lost policies
Scenario 2: Complete System Overload
During peak booking times (holidays, festivals), our system automatically scales:
Peak Traffic Detected: 3x normal volume → Auto-scaling triggers additional EKS instances → Load balancer distributes traffic across target groups to new instances → Message Broker queues absorb burst traffic → Processing continues at steady rate → No requests timeout or get lost
Key Lessons: Building Systems That Work in the Real World
This is what we learnt by building systems for the real world:
What Worked Exceptionally Well
- Go's Concurrency Model The ability to handle thousands of goroutines with minimal overhead was game-changing. Our concurrent insurer calls reduced response time from 8000-10000ms to 1000-1200ms.
- Async-First Design Separating user-facing responses from downstream processing was the key architectural decision. It turned unreliable dependencies into a non-blocking background concern.
- Circuit Breakers Everywhere Every external dependency gets a circuit breaker. This single pattern prevented cascade failures that would have taken down our entire system.
- PostgreSQL for Financial Data Despite the allure of NoSQL, ACID guarantees were non-negotiable for insurance transactions. PostgreSQL's reliability saved us countless data consistency issues.
- Authorization Layer Adding a dedicated authorization layer before the load balancer prevented unauthorized traffic from ever reaching our backend services, significantly reducing load and improving security.

Our Hard-Learned Lessons
- Don't Trust External APIs To our finding, even enterprise systems can have reliability issues. Always build for failure, not success.
- Monitoring is Not Optional We spent 30% of our development time on monitoring and alerting. This investment paid for itself within the first month of production.
- Database Connection Pools Are Critical We initially underestimated connection pool configuration. Proper pooling improved our database performance by 300%.
- Message Broker Message Ordering Matters For financial operations, we had to implement our own sequencing logic on top of the Message Broker's eventual consistency.
Here's What We Would Do Differently
- Start with Distributed Tracing Earlier
Debugging issues across multiple insurers and async processing was challenging. Distributed tracing should have been a day-one requirement.
- More Granular Metrics
Our initial metrics were too high-level. Per-partner, per-insurer, and per-endpoint granularity was crucial for optimization.
- Load Testing with Real API Behavior
Our initial load tests used mock APIs that were far more reliable than real insurer APIs. Load testing should simulate real-world failures.
We Processed 7+ Lakh Policies With Zero Data Loss
After 12 months in production:
- 700,000+ total policies processed
- 100,000+ monthly bookings (current scale)
- 99.9% uptime maintained
- Average response time: 800ms
- P95 response time: 1100ms
- Zero data loss incidents
- Zero financial reconciliation issues
Future Evolution: Beyond Travel Insurance
Our adapter-based architecture positions us for expansion:
Technical Roadmap
- Multi-region deployment for international travel coverage
- Machine learning to identify high risk corridors for better pricing
- GraphQL APIs for more flexible partner integrations
- Microservices decomposition for independent service scaling
Conclusion: Building Systems That Work in the Real World
Building a real-time insurance platform taught us that reliability isn't about perfect components, but about graceful handling of imperfect ones. Our success came from:
- Accepting that external dependencies will fail and designing around that reality
- Using async processing to decouple user experience from system complexity
- Choosing battle-tested technologies over trendy ones
- Investing heavily in monitoring and observability from day one
- Building self-healing mechanisms that reduce operational overhead
- Implementing robust authorization to protect backend services
The Indian travel insurance market was ready for disruption, not because of technology limitations, but because of integration and reliability challenges. By solving the fundamental problem of unreliable dependencies, we created a platform that processes 100,000+ monthly bookings with 99.9% uptime.
Our Most Important Lesson
Engineering isn't about using the latest frameworks or architectures. It's about understanding real-world constraints and building systems that work reliably despite those constraints. This platform represents everything we've learned about creating software that actually solves problems for real people. The 700,000+ travelers who got instant insurance coverage are proof that thoughtful engineering can transform the entire industry.
As India's digital economy grows, we anticipate that the principles we've applied here — reliability, scalability, and self-healing design, will become even more critical for the next generation of fintech platforms.
Frequently Asked Questions
Explore more1. What types of insurance policies do you offer?
At Covrzy, we provide a wide range of plans, including health, marine, directors and officers insurance, commercial general liability, and commercial crime insurance. We offer complete customisation for each policy based on your coverage needs and budget.
2. How do I choose the right insurance plan?
You can contact our in-house IRDAI-licensed insurance experts for free and book a consultation call to get all your insurance doubts clarified. We help you to identify the best policy based on your business needs and annual objectives. With Covrzy, you can compare plans based on coverage limits, exclusions, premiums, and claim settlement ratios. Our trained advisors can help you evaluate and pick the best match for your needs.
3. What is the difference between personal and business insurance?
Personal insurance covers your individual and family insurance needs against health risks, accidents, or property damage. Business insurance protects organizations of any size from financial losses arising from events such as employee injury, property loss, or liability claims.
4. How can I file an insurance claim with Covrzy?
To make a claim, you can contact our claims department through our 24/7 hotline, submit a claim online through your customer portal, or visit any of our branch offices. You'll need to provide relevant documentation and details about the incident that led to your claim.
5. How long does it take to settle a claim?
The claim process typically takes 5-15 business days from the time we receive all required documentation. Simple claims may be processed faster, while complex claims requiring investigation may take longer. We'll always keep you updated throughout the process.
6. What is not covered under a typical insurance policy?
Each policy has its own exclusions. For example, in health insurance, the common claim exclusions include pre-existing conditions (if not declared), self-harm, negligence, or health loss due to addictions. Always review your policy wording to know the exact list.
7. Can I cancel my policy after purchase?
Yes, you can cancel your policy only if you’re within the free-look period, which is usually 15 days from the date of issuance. After that, you may face applicable deductions based on policy terms.
8. Do businesses really need insurance if they already have savings to protect against losses?
Yes. Even with strong safety measures and ample savings, businesses are always exposed to risks like natural disasters, data breaches, or third-party claims. Business insurance acts absorb all such losses to let the business continue without any disruption.
9. Can I insure my employees under a group plan?
Absolutely, you can. Covrzy offers group health insurance for your entire team at a nominal premium. You get complete health coverage as well as accident coverage for teams of any size with the Covrzy employee wellness plan. You can also extend your employee health insurance to a family floater plan with added benefits.
10. Is there any tax benefit on insurance premiums?
Yes. As per the new tax amendment by the 56th GST Council, you can now opt to purchase health insurance without paying any tax. There is a complete tax exemption for all health and life insurance policies from September 22, 2025, onwards.
Do you have more questions?
Contact us for any queries related to business insurance, coverages, plans and policies. Our insurance experts will assist you.