Platform Engineering with John Peña
We sat down with Correlated CTO John Peña for a live Ask me Anything (AMA) event to talk about his experience with platform engineering, scaling a startup, and effective developer tooling. You can listen to the AMA or read our recap below and follow us @seaplane_io for details on upcoming industry events!
Thank you so much for being here John! Maybe you can start by giving us a little bit of a background on what you're working on today.
As I mentioned I'm the CTO and co-founder of Correlated. Our product allows you to send us data about your customers. Our goal is to tell you which ones are the best in terms of their likelihood to use your product as a paid customer or to expand their usage. We tell you about things like which customers are going to convert, and then we give you all kinds of tools to automate your sales process around these kinds of leads.
We’ve been around for a little under two years and we have a team of 18 based all across the U.S. and 10 of us are engineers. We’re working across all kinds of different areas where do a lot of data pipeline work, ETL, and we do some amount of data science and analytics. We work on the web and our web app. So we've built out a lot of infrastructure in a lot of different areas. I'm excited to talk to you all about it!
When we're looking at internal engineering platforms, what are some of your main goals when you think about your internal engineering team?
I think there's a handful of things that I like to pay attention to, to make sure that the team is both productive and doing work that is serving our customers. One of the biggest things that I try to focus on is the amount of time it takes for us to produce something in the form of, like, a pull request and then get it out into production in front of customers. I want that amount of time to be as low as possible.
The platform we've built uses quite a lot of continuous delivery patterns to make sure that when we put something out there — like when we put a pull request up and it gets reviewed — it gets merged into master and within 10 minutes it's in production. We deploy sometimes dozens of times a day across a bunch of different services so we've put a lot of effort into making sure that our CD pipeline is really quick, really stable. We know when there're problems and I could talk a little bit more about some of the tools we've adopted to do that.
Another thing that we pay a lot of attention to is obviously the quality of our systems and we've come up with a number of different SLAs for the services we provide to make sure that they are the highest quality possible and also fast. So we've adopted a ton of different observability tools to make sure that, not only are things functioning properly, but when they're not we know about them long before our customers see issues in production.
We’ve had a lot of fun building this stuff and I love to geek out on it!
Could you tell us a little bit more about the tools that you guys are using and why you picked them?
We are on Google Cloud and we really like Google Cloud. We've gotten a lot of value out of it. The tools are better connected to each other than really anything else that we've come across in AWS or Azure or any of the other cloud providers.
We deploy all of our services on Cloud Run. I think of it as very highly managed Kubernetes that does auto scaling and has a very simple contract with the developer. The idea is you give Cloud Run a container, like a Docker image, and it just needs to run something on a certain port over HTTP, and then Google handles all of the auto-scaling, a lot of the provisioning of resources for that service, and you kind of just let it run.
So all of our services, all of our data pipelines, they all run within Cloud Run. It has allowed us to really not spend a lot of time figuring out some of the scaling challenges that I've encountered in the past. We can defer a lot of the work to tune a service properly until long after it's in production. We can watch it over time to understand what kinds of resource usage patterns it uses and then tune it properly as we go.
It’s really easy to get started. We launch everything pretty quickly, we can auto scale, we can deploy, we can revert. We can do all these things sort of at the click of a button. So that's been really useful for us.
Beyond that, we use pretty traditional tools. We do a lot of our data processing just in Postgres. We have a pretty large Postgres instance that we use to manage all of our customer data and to sort of weave everything together and make sense of it. We’ve gotten a ton of value out of it and we’re really just scratched the surface of functionality. But we still kind of live in a batch relational world and so that's where Postgres is providing us value.
We also do a lot of data processing in Big Query. Basically any kind of analytic workloads are happening in Big Query. We started to do more with some of the other major data warehouses that our customers have brought in to integrate with us. So we integrate with Big Query, Snowflake, Redshift, and we started to have a better understanding of what value each of those provides and how to tune them to get the best out of them. But a lot of work is still just done in Postgres. Kind of old school.
You guys are obviously building a SaaS tool, so essentially anybody in the world can sign up for the platform — meaning you need to comply with a bunch of data regulations. How does that come into play and how do you shield your engineers from that complexity?
That's a great question. The way I think about dealing with PII and other kinds of sensitive data is really to have a good set of primitives to work with that give you a lot of freedom in what you're able to do and allow you to not have to think too hard about it.
One of the things we've been spending a lot of time doing is segmenting our customer data into different logical sets such that they don't intermingle and we have a lot of safety in our access patterns. The holy grail for me, and I think this is something that really excites me about Seaplane, is the idea of not only segmenting customer data physically — like in different disks or different locations — but also segmenting the resources that you use to access that data.
The holy grail for me would be: customer comes in, they sign up, we deploy a whole new set of our services into a regionally segmented set of serverless instances. The same customer, same organization, would use the same physical and virtual instances to compute, to do any kind of analysis, and to do data access all within a locked region. That's where I think a lot of the SaaS industry is headed and what I'm really excited about for the future.
So the way I think about that is really just giving those good primitives where everything is region locked from the get-go. You can assume that data is not going to leave the country of origin or the region of origin and customers have an instance running in their region. And you can do all this local compute without having a single, main spot where you land all your data.
How do you keep that manageable then from an engineering perspective for the team? How do you keep the complexity to a level where it's sustainable?
What you need to do is push a lot of it to the DevOps side. I'm really happy to be developing this kind of product in 2022 versus 10 or 15 years ago where you needed to actually engage with a data center operator in the different areas where you’re working or where you need to work. You need to potentially manage data centers across the world. You have to worry about routing traffic to the right data center. You have to worry about data you're writing not leaving that data center.
Fast forward to 2022 and there's much better primitives for dealing with those kinds of patterns where again the contract you can have with a developer is you just need to produce a service running in a container. And you have primitives, either through something like Cloud Run or what Seaplane provides where you can say, “I want this container running in these end regions across the world.” They should be accessing a local instance of your data that's in that region and you, as a developer, really don't have to worry “where is this thing running?” I just know it's running in the right place.
I think a lot of that gets pushed to the DevOps side where you can start to actually describe where something should run in your code using something like Terraform or Google’s Libraries. We use a tool called Pulumi which is very similar to Terraform. What we do is we encode all of this in code and what it allows us to do is, as developers, specify all of our infrastructure in our code base.
We don't have to worry about exactly how it gets deployed. The tool’s usually the one that's managing it for us. That's the contract we have with Pulumi and with Google Cloud. When we specify these things in code we're just going to describe how they run and how they should run and then the tool takes over and does all the deployments and puts things where they need to be. That's really great from a DevOps perspective, but it also allows us to put these things through code review. We can put a new Pulumi file or resource up in code review and we can have developers review it. The whole process gets automated from there. Pretty nice.
A question from Twitter: do you have any tips for companies that don’t have a DevOps function?
We don't have a DevOps team. In a sense, all of our developers are doing their own DevOps. My biggest recommendation is finding a tool that can manage your infrastructure as code early on, before you build out too much of your cloud infrastructure, where you can start to adopt this tool from the beginning and then put everything through it. It requires some amount of strictness from the beginning, but if you can manage that it's really, really helpful. It pays a lot of dividends.
There are a ton of tools. There's Terraform, there's Pulumi, there's Cloud Formation, things like that. Adopting that from the beginning, like getting serious about it, training your team on it, pays a lot of dividends. And you don't necessarily have to do it from the beginning, you can adopt it over time. Pulumi has some great tools for importing things you've created by hand into your code base. But as much as possible you need to just be strict about doing it in code as part of your code base.
On the flip side, what if you’re already an established company? How do you transition to that kind of model smoothly?
When I was at Twitter, they were making the shift from running everything on bare metal infrastructure to running everything through Mesos — which was sort of a precursor to Kubernetes — but very much the same kind of contract where you have a container, you specify the resources, and it just magically gets run. I think you definitely need to start small and either identify individual services that can be reprovisioned within your cloud infrastructure, or reprovisioned using infrastructure as code, or start more horizontally where you identify a product area where you want to start to port things over.
The best place to start is doing things through Docker. If you can start to adopt Docker as your deployment format then it helps a lot. If you're running things either in Supervisor or systemd or something like that, putting those use cases over in Docker is often a required first step. You can simply run your services through Docker on a machine that's running containerized Linux or something like that. If you take that first step it really alleviates a lot of the pain down the road. Then from there you can start to actually specify how would something run in a more serverless environment.
The nice thing is that the serverless stuff is like pretty off the shelf these days. You don't really need to run your own Kubernetes or some sort of serverless infrastructure on your own, you can get it off the shelf. All the major cloud providers have something like that. So once you start to get things into this Dockerized format, adopting these products is a lot easier. Then as you start to adopt them, managing them through code is where you can start.
A question from Twitter: what’s the number one tip that you’d give an engineering leader building an internal developer platform from scratch?
I would think really hard about the technologies that you adopt and to try to adopt as few of them as possible. Try to adopt things that are more tried and true that either you have a lot of experience with or that has existed out there for a long time.
A good example is Postgres. Postgres is a really old technology, but it's still incredibly useful. You can do kind of anything in there and get to the point where you grow out of it over a really long amount of time. The nice thing about something like that existing out in the wild is that people have put every use case you can think of through it. It’s very easy, especially with a small team, to get answers to any kinds of questions you have. It might sound a bit trivial, but you could go to Stack Overflow to get pretty much any answer you want about something like Postgres.
When we started out we put a lot of things into Firebase and it had some really exciting promises that we were excited to use. But what we found is that there's all kinds of use cases that it doesn't cover, and being on the bleeding edge is not so exciting when you run into those. We wound up porting all those use cases over to Postgres and more and more we find ourselves just turning to it as the Swiss army knife.
So I think for engineering leaders that are starting up today, find a couple of those things that you can rely on, that you understand the basics of, and that have been around for a long time. Things like Postgres, Java, a lot of monitoring tools, logging tools that have been around for a long time. You'll get really far with them before they start to break on you.
A question from Twitter: what would you say is the most exciting thing happening in the cloud space right now? You mentioned using tried and true platforms, but what are you looking at that's new and exciting?
There are three things. One is serverless, another is infrastructure as code, and the third is OpenTelemetry. I kind of talked about the first two already, but the thing I'm really excited about right now is OpenTelemetry.
OpenTelemetry is a set of standards for observability data from your infrastructure and your services. There's reference implementations for pretty much every language where, when you're starting a service, you install OpenTelemetry. It's going to collect a lot of information about how your service is running, automatically. Then you can add your own custom metrics as you go and also add traces and logs. So what you get out of this is that a ton of tooling plugs into it. So as you're running your services, it makes observability for those services really, really easy and standard.
When we started we plugged OpenTelemetry into all of our services and pretty much for free we could plug into Datadog, Splunk, Grafana, Google Cloud Monitoring — like any monitoring tool out there that you could think of. We've swapped some in and out and we've landed on a few that we really like, but we were able to do that because we adopted OpenTelemetry early. It gives you this standard way of monitoring everything within your infrastructure. I think it's really cool.
It's still early, it's still a bit bleeding edge, so I'm kind of violating my own principle I just talked about, but I do think it's really powerful and it stands to get even even more powerful as it's developed more. I think the last mile of composability between all your different monitoring tools is starting to be solved by OpenTelemetry.
A question from Twitter: how do you keep up your speed while building a platform? Do you prioritize the platform or your product?
I think there's always a push and pull. It's hard to say that there's one thing that matters more than the other. I think ultimately when you're building a company, you have to let your product lead. Your customers are going to be asking for things every day that they need, and it's hard to prioritize every last thing. But I do think, to the extent that you are solving for customer value and leading with that, all the rest will follow.
That said, I think building a platform is all about finding efficiencies that pay out over time, doubling down on them as much as you can, and getting really good at understanding your tooling. Understanding what it's good at, understanding what problems it solves for you, understanding what it doesn't. I think that really does lead from building the product. As you build the product, you're going to see where it breaks. You're going to see where people are upset with it. You're going to hear from angry customers and you just have to keep going and keep building.
As you go, if you focus on building customer value, I think all the rest follows.
Thank you again to John for his time! You can learn more about Correlated on their website and follow them on Twitter and LinkedIn for updates.
We host regular events on Twitter so be sure to follow us there and join us for our next live AMA with experts like John!
Some quotes in this transcript have been edited for clarity.