In early 2024, we started a migration project that seemed simple at first: move a B2B SaaS platform from on-premise servers to AWS. The client handled compliance workflows and document processing for large European customers. The business case made sense, but the actual work was much more complicated.
We inherited a Java monolith that had grown over 12 years, reaching 4,000,000 lines of code. It was built on Spring and included custom internal frameworks with many quirks and workarounds from years of development. The database was a 1.5 TB PostgreSQL setup with a primary node and one replica. There were 6 application servers, shared NFS storage, HAProxy for load balancing, and manual deployments with little centralized monitoring.
The business gave us a clear mandate: move the system to AWS without rewriting it. No microservices, no long innovation projects, and no major re-architecture. The goal was to move it safely, lower operational risk, and modernize the underlying infrastructure.
That limitation was the right choice. Still, it took us almost a month of preparation before we even started working with AWS.
The First Reality Check: on prem to cloud migration aws
We were told the system was loosely coupled. Technically, modules were separated into packages, and there was an effort to keep boundaries. In practice, though, this was misleading.
Shared utility libraries were woven through almost every module. Several features relied on local disk behavior: assumptions baked in so deeply they weren't documented anywhere, just understood. Batch jobs reached directly into internal services. The database schema had deeply nested foreign key chains that made it hard to understand the blast radius of any change. And buried in configuration files, we found hardcoded IP addresses. Not many, just enough to break things silently inside a VPC.
Where documentation existed, it was outdated. Some production cron jobs had no documentation at all. We only found them by observing the system in action.
We spent nearly a month just mapping dependencies before starting any migration code. It felt slow at the time, but looking back, that month saved us from several costly problems during the migration.
On-prem to AWS migration strategy: Lift, stabilize, improve
We made a deliberate choice early on to avoid the "cloud-native purity" trap, the instinct to turn every migration into a modernization project at the same time. We had seen that approach add months of scope and risk without delivering proportional value at the end of it.
Instead, we structured the work in 5 sequential stages: containerize the application, move it to AWS, stabilize it under production conditions, optimize cost and performance, and only then consider architectural decomposition. The target architecture was straightforward: Amazon ECS on Fargate, Amazon RDS PostgreSQL in Multi-AZ, S3 for file storage, ElastiCache for Redis, an Application Load Balancer, CloudWatch for logging, and automated backups throughout.
We considered using Kubernetes but decided against it. The extra operational work was not worth it for the first phase of a migration that was already complex.
AWS Cloud migration: Containerization was not trivial
We dockerized the application largely as-is, which is exactly where the hidden assumptions started popping up. The application expected a persistent local disk. Temp files were being reused across processing flows, something that works fine on a long-lived server and breaks immediately in an ephemeral container. GC tuning that had been calibrated for bare-metal behavior produced erratic results under container memory limits. Background jobs that had always had abundant CPU suddenly behaved differently in CPU-constrained environments.
Our first load test caused containers to crash with out-of-memory errors. We went through 3 rounds of changes to stabilize things: setting -XX: MaxRAMPercentage, switching to G1GC with container-aware settings, increasing Fargate task memory by about 30%, and moving all file storage to S3. Each change needed testing, regression checks, and validation with production-like loads.
This phase took longer than we expected and set the tone for the rest of the project: uncover assumptions early, test in real conditions, and don’t move forward until the system is truly stable.
The 1.5 TB PostgreSQL Migration
The database migration was the riskiest part of the project. We used AWS DMS for the initial copy, logical replication for syncing changes, and aimed for RDS PostgreSQL 14 as the target. The database had multi-column indexes that collectively exceeded 200 GB.
DMS struggled with them: replication lag climbed past six hours, and DMS memory usage spiked to the point where we had to intervene. The fix was to drop non-critical indexes before completing the migration, then rebuild them after cutover. That one decision reduced replication pressure dramatically and gave us a viable path to the cutover window we needed.
The second problem was subtler. The on-prem PostgreSQL instance had been tuned over the years by people who understood its specific workload. When we moved to RDS, that accumulated tuning didn't come with it. We experienced checkpoint spikes and elevated write latency during high-load periods — not catastrophic, but enough to be a problem in production. Getting autovacuum and write behavior stable required tuning max_wal_size, autovacuum_vacuum_cost_limit, and checkpoint_completion_target across nearly two weeks of iterative testing.
The cutover itself was planned for one hour of downtime. We ran 47 minutes over. A long-running transaction blocked the final sync, and we had to manually terminate sessions before we could switch endpoints. Not catastrophic. Genuinely stressful. We documented exactly what happened and used it to sharpen our runbook for future migrations.
AWS services: Budget Strategy: Cost Was Not an Afterthought
The client made it clear that moving to the cloud shouldn’t cause their operating costs to skyrocket. We treated this as a design constraint from the beginning, not something to fix later.
Before we committed any infrastructure, we modeled the workload: compute load patterns across peak and average periods, database growth trends, storage usage, and batch workload frequency. What that analysis revealed was that 35 to 40 percent of compute load was predictable and steady, while 60 percent of batch workloads were burst-based. Peak loads are concentrated around monthly processing windows. That profile shaped every cost decision that followed.
Savings plans and seserved instances. We structured a one-year Compute Savings Plan for baseline ECS capacity and Reserved Instances for RDS. The key decision was not to over-commit: we locked in approximately 60 percent of steady-state compute, leaving the rest flexible. Over-committing sounds like better savings on paper, but it erodes your ability to adjust as the system evolves. The result was a 28 to 32% reduction on the compute baseline and approximately 35% on RDS compared to full on-demand pricing. The client was initially skeptical about committing to reserved capacity. We walked them through a three-year ROI projection, and that made the decision straightforward.
For batch-heavy, non-critical jobs like data reconciliation, report generation, and historical reprocessing, we used ECS tasks on Fargate Spot and EC2 Spot Instances for some background workers. We only moved workloads that were idempotent, retry-safe, and not user-facing. This led to up to 70 percent savings on batch compute and an 18% drop in overall monthly infrastructure costs compared to using only on-demand resources.
We started with higher-tier RDS storage as a buffer. After two months of monitoring usage, we moved older data to S3 using lifecycle policies, reduced the growth buffer, and cut storage costs by another 10 to 12%.
Taken together, the optimization strategy produced roughly 25% overall cost reduction compared to a fully on-demand AWS model, with predictable monthly spend, no hardware refresh cycles, and no manual failover staffing costs. Compared to the previous on-prem TCO: factoring in hardware, support, and downtime impact, the model reached cost neutrality within 18 months and came out ahead after that.
What we would do differently
Looking back, there are a few things we would do differently. We would simulate the production load earlier, before finishing containerization, rather than use it only for validation. We would tune RDS parameters before high-load testing, not during. We’d also wait for 30 days of real usage data before making final cost commitments, instead of relying only on pre-migration models. These changes wouldn’t have changed the overall outcome, but each would have made the project smoother.
Final thoughts
This project was about making a business-critical system stable, without adding new risks while trying to reduce the old ones. Observability, automated CI/CD pipelines, predictable performance, and a cost model the client could plan around. The monolith was still a monolith and that was intentional. Attempting to break apart a 12-year-old system while migrating its infrastructure would have multiplied risk without a proportional benefit. We sequenced the work properly.
Once the system was stable and the team was comfortable with the new environment, we started the next phase: gradually extracting services, separating domains, and breaking down the architecture. This work is much safer when you’re not also handling an active migration. That second phase is a topic for another time.
Cloud migration isn’t the final goal. It’s just the starting point.
If you’re evaluating what your current platform can realistically support and where its limits are starting to show, that conversation is worth having. Modernization doesn’t begin with replacing everything. It begins with understanding what you already have and strengthening it with purpose. Contact us to learn more about how we can help modernize legacy systems.
