Internal Developer Platforms with Jon Skarpeteig
We sat down with Signicat Tribe Lead, Global Platform Jon Skarpeteig for a live Ask me Anything (AMA) event to talk about his experience building an internal developer platform for multiple teams and functions. You can read our recap below, or listen to the full recording on YouTube.
Jon, could you please introduce yourself?
My name is Jon Skarpeteig and I'm the Tribe Lead, Global Platform at Signicat. Signicat is an identity solutions service provider; we provide a lot of solutions for banking, finance, health insurance, and government [with] trusted services online. That’s the gist of it.
Can you tell us a little bit more about the situation at Signicat and why you decided to build what you did?
Very quick backstory: Signicat has, for many years, been a hyper growth company. Meaning very rapid, organic growth [with] a strategy of acquisition. A couple of years back we had outgrown the startup phase and then we had a couple more companies joining so we had a need to standardize on platform. That's kind of the birth of the Platform Tribe and platform engineering, like as a company on a couple of levels. Also, two years ago, we made a strategic shift that we're really going to bet on this and get a centralized unified platform to consolidate the tech stack with all the challenges that entails.
You mentioned that there's this combination of multiple companies into one, what were some of the challenges that you were trying to solve with the new platform approach?
The business side was like, “we're now solving the customer problem three times over after our first requisitions, because they were in the same space, all kind of doing the same thing.”
We had to kind of come up with, okay, we want to do our “best of breed.” We want to select the [approach] that is best for the market, and solves customer problems. For all the customers, these are different technologies, different tech stacks, different programming languages. It was kind of hard to go with, let's say the library approach, where you have some common things. That means we went for containers and microservices. Then we started having a lot of microservices and microservices…they kind of have needs, so we need a lot of tooling to support it. Then we looked at, okay, how do we support all of these different microservices? We want to do this in a way that is centralized in an internal development platform so that it reduces the cognitive overhead of the teams and that we have operational efficiency, essentially.
Was everybody still in a monolith there?
We had a bit of everything. We had one that was clearly a monolith, a very big one. We had one that was clearly microservices. And then we had one that was kind of in between, they had a few services that were separated.
We spent quite some time in the beginning trying to figure out “what's the right approach that can unify all of this?” One very concrete example is that we were on different hosting providers, and now we want to start looking at standardizing one hosting provider — that has a lot of impact, because it's going to influence everyone that runs on a hosting provider that isn't selected. It's going to be a lot of work to try and transition and to migrate. So that is one of the things that we spent quite some time on. We wanted it to be a consensus-based outcome for this one, because we really needed to get the buy-in from from the different parts of the company in order to ensure that this would actually be successful.
How did you deal with the culture side of things? There's two trends in the industry right now. [One is] very prescriptive like, “this is how we're going to do it.” The other is more about getting what the developers want and accepting the risk of having ten plus different systems running you have to support. How do you guys deal with that push/pull internally?
These are not new concepts, but looking at the internal development platform as a product — meaning you will have customers, meaning you will need to do marketing — we did the approach where we started off with a small group and said “let's get something running, anything running, that solves an actual problem that fits an actual niche for the development community.”
Then we start out with, let's call it the “reference customers.” We had one product that was the first product out, so we started with that one. Then, we really tailor-make the platform to solve the problems of that product group so that we have a good reference and we have a good story. Really leveraging these product management techniques that you want to look at.
At least make sure that your reference customers are happy. Make sure that you solve their problems in an excellent way instead of making everybody miserable trying to come up with a general solution that fits everyone [which] doesn't necessarily exist.
I got a message here in the chat: was there any worry that you'd be optimizing too much for that single customer? Not making it generalizable enough for the other customers in the company?
Yes and no. We were quite aware of this so the solutions we chose, we always went through with the generic ones. We came up with some principles, some basic things that we wanted to keep in mind. One was that, if we could avoid [putting] complexity onto the development teams and rather deal with that complexity on the platform or at the platform level, then that was preferred.
We also had a strong preference to look for established tooling, established tech. If it wasn't objectively better, if there were any doubts, then we would go with the dominant player just to make sure that there's more familiarity, [it’s] easier to hire for, that kind of thing.
The last one which was very important for us being in the industry that we are (security and compliance) it has to be secure by default and it has to be compliant by default. Then we implemented guardrails to make sure that we stay compliant, even though we don't necessarily know all the ins and outs of all the different setups we have. One example — we leveraged the Open Policy Agent to have a policy base the rules around what you can and cannot do to make sure that we really enforce compliance so that we don't mess it up.
I imagine there's one happy path they can take and then there’re a lot of barriers on doing things the wrong way.
”Customers” for us [means] internal customers because we have a product portfolio that is not the platform. But, we did also make an active choice [to go with] the Golden Path approach. So the 80/20, we really optimized for the 80 percent. We will still support the specialized needs for the specialized tooling, but that is not the streamlined path with the documentation and boilerplate code and templates that you can just get from out of the box.
Are you slowly working to put that 20% into the developer platform as well, or are you thinking the one-offs don't fit the model you've created?
Yeah, though we're gonna have some things that don't fit. One product that we offer is a qualified timestamping authority, and to be a qualified timestamping authority comes with a lot of compliance requirements. We need to have a hardware security module, we need to have lists of the people that have physical access to the hardware, all kinds of processes. It’s not something that we want to enforce on all the products. We have all the operations we have so that will stay kind of on the outside and I don't think that'll change.
Did you evaluate anything else beyond the approach you took? Maybe you can shine some light on why you didn't go that way.
Since the inception of the strategic shift we [said we] are no longer a startup, we’re a scale up. That means we have to look at how we can automate business processes and not customize contracts for every single customer. There was a lot of automation there that we didn't really have before. So with that as the outcome, we started working our way back. What do we need in order to accomplish this? It wasn't obvious.
If you have a monolith, that is extremely efficient to operate. It's quite easy. All the dependencies are known; you share the failure domain which is quite useful. We probably could have stuck on that one for a bit longer, but the second we got multiple programming languages and multiple tech stacks, trying to connect them together…we couldn't avoid having multiple services. We didn't really consider not running these services and interconnecting them, because that would mean reinventing some of our products. The cost factor of this was not really applicable.
Someone asked on Twitter: what about the budgeting for all of this? Obviously a lot of these tools, especially the managed ones, are a little bit more expensive. Did you do a rough calculation beforehand? How did you guys approach that?
The operational efficiency and scale up — that's the key to get right. The previous setup was based on virtual machines and then with our growth factor we had a significant amount of buffer that we can grow into so that we didn't have to provision machines in the middle of the night. We had some fairly manual capacity planning there, I'd say. And also tied into the release process, we had bluegreen deployment, which meant that we had enough capacity in the standby DMS to take the entire load when we did software upgrades.
With the new cloud native approach with containers, the bin packing is much more efficient. The autoscaler is fairly fast. That means that this overhead, we can reduce that significantly. So we ran the numbers on “okay, what's the hosting cost for us?” essentially, cost of goods sold. Then we compare the new possibilities of cloud native and internal developer platforms with this virtual machine-based [solution] with the overhead and the calculation is quite pretty [laughs].
We also got significant backing from our investors to really accelerate that journey. So we actually spent something like 30% of the developer capacity to really accelerate this journey. We got good backing and the numbers were really pretty because of the rearchitecting so that also enabled us to really accelerate the journey.
You did the calculations beforehand, is it matching up?
Surprisingly close, actually! I think we're off by something like 10%. Anybody that's tried to estimate like, moving the entire company over to a different hosting provider has got to know that's not the ideal growth scenario. That’s hard to predict.
When you look at the whole developer platform, obviously there's the cost of running it, but there's also the opportunity cost of building it. Did you guys take that into account at all?
[We had a business problem where we solved] the same customer problem three times over from three years of business and [two acquisitions in the same problem space]. That's an easy story. If you can reduce three companies into one that just gives you better operational excellence.
We had confusion on the sales side as well because it's like, okay, which one of these do I sell because they are so similar in many regards. That kind of helps. But I mean it's always a tradeoff, because it's a strategic shift. We're changing the architecture. We're changing the infrastructure. We're also changing the way we work because introducing platform engineering by itself isn't really useful unless you empower the teams with more team autonomy in the organization, make people actually responsible so we have this ability to own it. That is a big shift in the way of thinking and [a big shift in] who's responsible for what, so this is a huge risk.
So when we were kind of halfway through the initial stage we also got some external consultant company to really go through and ask “okay, what do you guys do, does this makes sense?” as a risk mitigation to see if we were on the right track. They agreed with us and we got really good feedback which made us more confident that we were on the on the right path to see this through.
A question from the chat which you touched on earlier: how did you get buy-in from the engineers? You need not only all the decision makers to decide that this is what they want to do, eventually you also want happy engineers using the platform. How did you approach that?
You will have the full range. Anybody that's worked with the customer over time, you know you will have some early adopters, you will have some enthusiastic ones, and then you will have somebody that's a bit on the fence and is just going to see how this plays out. You're going to face the full range, but our approach was that we were really keen on the network effect. So having somebody that wasn't afraid of trying it out see how it could benefit them, then really using that to showcase “this is how they did it, this is how it works for them.”
We had this transitioning journey. In the beginning it was very hands on. We had the platform group that was generally the enthusiastic ones, because to them we were going to offload some of the work and really automate more things which is quite fun to work with from the platform side, but we also realized that if nobody used the platform, it's not valuable. So in the beginning we were hands on, to the point where we're committing deployment code, helping build the CI/CD pipelines for the individual product teams, and then after a while that didn't scale anymore.
There were more and more teams joining, more and more products that we're looking to get to, and then we're more in a guiding phase. Like, we will write the how-to guides and getting started guides and the best practices to direct people. And then of course the target goal is the Golden Path that will be self explanatory. It’s like the obvious choice that this is the way to do it and other people are able to do this without getting guidance or getting somebody from the platform group. Right now we have teams in all of these three categories, but the ambition is to really shift towards the self-service internal development platform.
I imagine you also have some internal champions of the teams that kind of take over that role of doing the guidance and helping the newer engineers start with the platform.
With this being a strategic shift, there was also a very good backing from management and from the architecture group. So you had really the top down support, which is really useful when you're doing this big of a change.
A question from the chat: what were some of your milestones building this platform to decide that you were on the right track?
The first one was, okay, we have to agree on the general approach. We have to pick the best of breed, we have to pick the support to link specifically for the ones that were all encompassing like hosting provider. Then we had a few iterations on the observability stack because we quickly discovered that just shipping the tools is not enough, it has to be batteries included.
Then we kind of defined this…we call it “ready for validation.” We had a very specific target where we could onboard a pilot customer on the new platform. Of course not everything is ready by then and not everything is going to be perfect, but it is sufficient enough that we can actually get customer validation. Getting the customer validation — that was what we really aimed for. That was the first milestone. Because without customer validation, we're afraid. We don't know if we are on the right track or not. So, that was the big one!
Then of course we had this “ready for service” milestone now that we have the pilots done, then we start actually onboarding regular customers. The two biggest milestones.
It sounds a lot like a startup — getting an MVP ready, getting your first customer on there, getting feedback. Did you also have any milestones for the customers themselves or things you wanted to make sure that they were able to do as proof you built the right thing?
This is where it becomes a little bit strange because we already had existing products in the market. So the first iteration was about replicating that functionality in a cloud native, cost efficient way. But with the scaleup in mind, doubling down on self-service was a key thing. Really enabling a self-service journey to an extent that we hadn't before was very important for us.
But since we didn't really have any very good measurements from before, we don't really have a good baseline to compare it to. That is one of the, let's say “regrets”— that we didn't get the proper baseline so that we could have actual numbers to show for it. It didn't seem necessary at the time, but in hindsight that would have been very useful now to see if it’s actually working, if you can track this over time. So the baseline is always kind of skewed because we don't have good numbers from the previous setup.
What are some of the other things that you would do differently?
We also got some feedback on this from the third party consultant firm that kind of went through it…I'm quite happy that we were able to leverage new technologies, really get the benefit of cloud native so that was successful. But we also see that the tech alone is not enough [laughs]. If you are a developer, if you are in the tech space, then this kind of resonates with you like, “oh, I can now access production and rollout at will! I have Canary deployments and all of these things that were hard to do before!”
But really getting predictability. Getting the business side more informed and engaged. Having a project plan that was a bit more detailed than what we had with a bit longer time horizon and not just focused on “we need a pilot, every focus is on the pilot!” But when we get there, then what? Getting some more clarity on that one, I think could have improved the journey organization.
If somebody's starting this journey today and building a developer platform in a situation similar to yours, what would you tell them to do or not to do?
Do what you're doing today! Engage with the community and get inspirations from others. Learn from others that are ahead of you on the curve. I found it immensely useful just to get the terminology right because it fits with the industry. There's a lot of good little snippets of things like “this is a good argument for this thing that I'm trying to explain to people outside of tech.”
You're not alone! Please be sure to learn from others and take it from there.
Thank you so much for your time, Jon we really appreciate it.
Thanks for having me!
Thank you again to Jon for his time! You can learn more about Signicat on their website and follow them on Twitter and LinkedIn for updates.
We host regular platform engineering events (like this one!) so be sure to follow us on LinkedIn and Twitter and join us for our next live AMA with experts like Jon!
Some quotes in this transcript have been edited for clarity.