A Year of Cloud Conversations
Anyone who has worked at an early-stage startup knows the key to finding out what users want...is to continuously ask users what they want.
Granted, it isn’t always easy. Finding the right people is hard, and asking the right questions is even harder. However, there’s something positively magical about getting unfiltered feedback from people dealing with the exact problems you’re trying to solve.
Over the past year, we’ve spoken to hundreds of cloud experts about their teams, their roles, and their infrastructure. We’ve gathered insight from senior DevOps engineers at major enterprises, product managers at SMBs, and CTOs at rapidly scaling startups. Every one of them had something valuable to say, and every one of them had their own unique challenges and preferred solutions. The more we listened, the clearer the story of Seaplane became.
Here’s what we’ve learned after a year of conversations about the cloud.
Long before we started to formally gather user feedback, we knew from casual conversations that a product like Seaplane couldn’t be one-size-fits-all right out of the gate. Partially because any startup claiming to solve all your problems is probably lying, and partially because there are a lot of roads to Rome. Not every company takes the same path to the cloud, and not every company builds its applications in the same way. As much as we believe everyone should use Seaplane, we don’t want to sell sawdust to a lumber mill if you catch our drift.
Our Target Organizations
Setting out, we knew our target audience would primarily consist of companies working on larger, more complex application stacks serving a geographically distributed user base. Put simply, we knew we wanted to work with companies running cloud-native applications with users in more than one cloud region. This is still a pretty broad definition, but our thinking went something like this:
Startups are interesting because they care about velocity and getting their app to the world ASAP. They want to minimize friction for their users without worrying about the complexity involved with being highly available. Opportunity is everywhere, so their apps need to be everywhere too.
Scale-ups and enterprises are interesting because they also care velocity, but the complexity in scaling is where the real challenge lies. Add in concerns surrounding geographically optimized delivery, respecting local data regulations, and engaging with multiple providers at the same time and you have a very interesting conversation.
Despite our desire to cast a wide net, we expected that the “global” nature of our platform would scare off the startups, and that scaleups and large businesses would be the only ones interested in talking with us. Instead, we wound up speaking to participants from companies with anywhere from 20 to 20,000 employees.
It turns out cloud pain is pretty universal — which is good news for a company trying to address cloud pain!
Our Target Roles
With the kind of organization we wanted to talk to out of the way, we focused on who in that organization would be a good source of insight on current cloud needs and challenges. We wanted a cross-section of our target organizations to learn how current cloud tools affect their day-to-day work. While application development is primarily the responsibility of engineering departments, we found cloud complexity often leaks into other departments as well. As a result, we included respondents in company leadership (CxO), engineering, and product management roles.
Going into our research, we wanted to keep the questions consistent but the structure relatively loose. It wasn’t uncommon for respondents to go down fascinating rabbit holes about the many and varied ways they tackled their infrastructure needs. We saw a lot of value in those tangents, so we wanted the feedback to be qualitative first and foremost. As a result, this blog will mostly examine the narrative throughlines of our conversations. Sorry for all the graph lovers out there, the data and visualizations will have to wait for part two!
It wasn’t long into our research before patterns emerged across participants, and we were able to re-contextualize all the inconveniences, problems, and roadblocks into four major challenges.
“We are already at a point with just our distributed services where an individual engineer can't reason about what is happening. It’s just impossible to do. A single human being can't pull this together.”
- Staff Engineer
Deploying an application is all well and good, but deploying to multiple regions is where the real “fun” starts. Region aware routing, global and local load balancing, geo-aware data services, auto-scaling, capacity planning across many “little” clusters — things that are (relatively) trivial for a single-region setup instantly become more complex when introducing additional regions. In fact, everything becomes so complex that many organizations, even the ones with the time and resources to execute, decided not to go multi-region despite the benefits.
“I have seen multi-region design taking down sites more than keeping them up. It’s intellectually challenging to make it work effectively. With enough constraints you can be truly multi-region, but having your cake and eating it too? Complex to pull that off.”
Multi-cloud was an even sorer spot. While there was a tacit understanding that a multi-cloud model would solve for things like vendor lock-in, using the best services on a per app basis, disaster recovery, and meeting customers where they are, it was so far beyond most teams’ resources that they couldn’t even entertain the idea. They came into the conversation already knowing the benefits, they just couldn’t justify hiring the infrastructure engineers necessary to execute.
“Once our team grows and we have other performance concerns we’ll address those needs by going multi-cloud for reliability, but there’s a lot of internal scaling that needs to happen first. We’ve thought about it every time GCP goes down.”
- Engineering Team Lead
As a result, most of the multi-cloud deployments we encountered were a result of mergers and acquisitions rather than purposeful design. Engineers relayed nightmarish tales about the painful and expensive ways they’ve wired together disparate clouds and, unsurprisingly, they weren’t keen on giving multi-cloud deployments another go.
While sales and marketing love a good market expansion, the same cannot always be said for engineering. There is a ton of work that goes into supporting new regions, so much so that we found multiple instances of expansion plans being altered, delayed, or even cancelled due to the high overall costs of adding new regions.
“A lot of times you don’t do multi-cloud/multi-region unless you have to. You wait until the last moment to decide if it’s worth it, and even then you don’t really know if it is. Too often you get started not realizing the complexity and cost.”
- DevOps Engineer
Some of our participants described launching in new countries while relying on existing infrastructure, then having to contend with the horrible latency that followed. Not only did this strategy alienate users, it introduced tension in the business between sales and engineering. Instead of being a cause for celebration, new users cropping up in different locations became a source of panic.
“We have a lot of different customers from a lot of different places. It’s always a struggle with latency.”
- Lead Developer
Almost every person we spoke to brought up testing as a key concern when discussing cloud deployments. One participant visibly shivered at the thought of how much untested code is currently in production, which would be funny if it wasn’t terrifying.
“Customer environments are so bespoke you have to tailor the solution to each customer — which means runHooks, scripting, all sorts of things and each thing needs to be tested. They may have tested in a development environment, but they haven’t gone any higher. It really is frightening!”
- Cloud Consultant
The differences and incompatibilities between testing, staging, and production environments were the main culprits. Existing cloud deployment setups are often very different from what engineers run locally, and there is no amount of testing that can catch every error when the deployment environment is completely different from the local testing environment.
This problem is best summarized as the ”but it works on my machine” conundrum.
“Deployment is slow with low confidence. We find differences in production and testing and the local environments all the time. Releases take forever to get out! We have this “best in class” system that doesn’t do the things we need as a business. “
The alternative solutions used by our participants were imperfect at best. Running entire copies of a production environment just for testing purposes was considered too costly to maintain and too burdensome on an already overtaxed engineering team. The result was many companies doing minimal testing and staying on the lookout for inevitable bugs and breaks.
Most participants were deployed on a single cloud provider either in a single region or in an active-passive configuration. Meaning, if their provider went down, they went down.
“If AWS goes down, we go down.”
- Lead Developer
In December 2021 alone AWS had three major outages and so long as there are fires, floods, and clumsy fingers, there will always be a risk of outage regardless of which provider you use.
It was surprising how many organizations accepted outage-related downtime as the cost of doing business on public clouds. Many participants described the problem as a business decision reached with simple math: money lost during an outage < money spent implementing a solution. This held true for the smallest organization we spoke with (a startup with around 20 people) to the largest (an enterprise with 18,000).
“It comes down to a cost benefit analysis. A lot of the time when you get to specific numbers of what it takes to be highly available across regions companies come back saying they lose less money being down for fifteen minutes than it would cost to implement this solution.”
- Cloud Consultant
The teams that did have passive-active deployments rarely used them, as switching over was often more manual and time-consuming than waiting for Amazon or Google to get their services back online. For those participants, their disaster recovery plans were more like disaster recovery hypotheticals.
“The biggest issue [with Disaster Recovery] is data swing. We use Replication, but it’s not an easy exercise, switching back and forth between environments. We don’t do it often.”
- Senior Director, DevOps
Solving Cloud Complexity
All of the problems we heard, big or small, stemmed from one thing: cloud complexity. Managing cloud infrastructure is hard. Learning multiple disparate systems is hard. Building a platform team is hard. Hiring qualified engineers is hard. Every single moving part of the cloud adds up to a level of complexity that cannot be addressed without a lot of time, resources, or both.
Generally, we found participants tackled that complexity in one of two ways.
DIY Cloud Customization
Larger organizations with headcounts at or above 1000 tended to err on the side of hiring to mitigate cloud pain. Bigger budgets and bigger teams have the time and resources they need to do the heavy lifting of implementing custom infrastructure solutions.
This usually takes the form of Kubernetes implementations which, while certainly a toolbox for building containerized platforms, still requires a lot of expertise to build and a lot of engineering hours to maintain. Kubernetes can scale, but the complexity compounds with scale. Multi-region, multi-cloud deployments usually require an entire team to run.
“Many companies say Kubernetes is great — that Kubernetes is an abstraction layer, and engineers never have to think about infrastructure. That is a flat-out lie. As an industry, we add layers, but all those layers leak and don't hide the bottom completely. Up to the application level, engineers still need to know how it works all the way to the bottom.”
- Staff Engineer
All that being said, the level of customization is unmatched, and allows participants with mature infrastructure teams to create dynamic, usable platforms for the rest of the engineering organization. While the total cost of ownership is quite high given the need to hire additional engineers, vendor bills are usually lower overall.
Startups and scaleups generally turned to managed services to maintain their shipping velocity and avoid getting bogged down in infrastructure work. SMBs with geographically diverse customer bases tended to do the same to avoid unsustainable or unnecessary hiring.
Managed services (like Google Cloud Run or AWS Lambda) allowed respondents to deploy with relative ease, and presented a lower barrier of entry for organizations that otherwise didn’t have or didn’t need DevOps expertise. Managed services could theoretically do the heavy lifting.
”Leadership decided to use everything AWS offers. It allows us to scale quickly, but we have to accept the resulting vendor lock-in”
- Senior Front End Engineer
However, managed services are expensive and you need a lot of them to support a full-stack application. Plus the more services you use, the more locked into a single provider you become. The total cost of ownership is still lower than hiring a platform team and building it all yourself, but parsing the bill at the end of the month is nigh on impossible. Not to mention wiring all those services together is no easy task, and once you’re operating at scale you need so many services working together in so many complex ways it requires hiring a platform team anyway. Even then, with a full platform team behind you, if your provider of choice doesn’t have the tool you need you’re left with inevitable gaps.
“We use some services from Azure and sometimes there’s not really an equivalent in AWS or GCP. That’s really a pickle because you need an alternative or you have to do a lot of extra work to do it on another provider.”
- DevOps Engineer
The Ideal Solution
The perfect cloud solution, then, needs to combine the best of both managed services and custom solutions. It needs to have the easy implementation and maintenance of a managed service like Cloud Run, with the scale and customization of solutions like Kubernetes.
And thus, Seaplane was born. In the interest of avoiding a sales pitch, you can read more about what we’re building (and how we incorporated this research) in another blog.
The quest for knowledge is never-ending, just like our product development cycle!
We’re always conducting user interviews to better understand the problems we’re solving and the people we’re solving them for. If you have any thoughts on our findings, recognize these challenges in your own organization, or just have some insights into the industry — we’d love to hear from you!
You can reach out directly or schedule an interview with our research team. We promise they’re very friendly (and we promise they didn’t make us write that). In the meantime, you can subscribe to our newsletter for updates or contact us below to get early access.
Thanks for reading, and stay tuned for our next batch of findings!