For a client of mine we used Azure B2C with custom user flows, and used Azure Service Fabric. Both in a production environment. And we decided to move away from it.
Why this happened is something I am going to cover in this postmortem. I write this blog so others can better decide on when to use these powerful tools and when to refrain from using it.
TLDR;
Azure B2C: a powerful IAM solution. The custom user flows: very specialist tooling, requires in depth specialist knowledge. There is good documentation on it, but very few actual examples. With it missing the nuances of real world scenario’s, it is very hard to learn by just using it.
- Azure B2C custom user flow documentation is lacking on actual production experiences.
- Azure B2C custom user flows use a xml configuration approach that are hard to understand, and not that many people are actually using it.
- Azure B2C is more limited in feature set than Azure AAD. Not problematic, but deceiving.
Service Fabric: a powerful micro-service orchestration tool with optional interesting programming models. Just don’t be fooled by the simplicity of the programming models they provide like the stateful service and actors: there require serious thoughts to be effective, require specific architectures to work properly, and require additional administration to be useful in production environments.
- Missing documentation on SF actual production experiences.
- High administrating costs for SF.
- Increased complexity with SF without being able to reap equally big benefits from it in the short run.
- Distributed services require their own architectural patterns to be use-able.
- Moving code to a micro-service architecture is hard to do with small increments.
- You should have automated a lot before moving to SF: data should be already moving fluently throughout the landscape, services the same.
We were loosing flexibility way too quickly without providing as much benefits as we liked to have. It dawned on us that we had no benefit of Iaas and should look more at PaaS or even SaaS for a lot of the non-core systems we had.
The business case
It started with the client wanting to get a good performing system. We all know a micro-services architecture can help with that, but they are hard to pull off. The promise of SF was to give almost infinite scalability when using the actor pattern, with at the same time providing a programmer friendly solution with the typed data storage and the fluent remoting capabilities, and being blazing fast while doing so.
For their IAM they had this simple request: we need an application specific account to be created when a user does a sign-up, and we want our application specific authorization information on the token. To stay within the Microsoft ecosystem and leverage the power of the cloud, they looked to Azure B2C with custom user flows to provide this. On top of that, the pay-by-call offering of Azure is really competitive with alternative hosted solutions.
Implementing the SF infra
The first step was to move away from the previous solution. For hosting they had Functions Apps, and for IAM they had plain Azure B2C without any user flows.
Spinning up an SF cluster on Azure proved to be simple. They put API-M in front of it with SSL offloading. And their Blazor WebAssembly didn’t even had to look to another endpoint because it was already looking at API-M.
However, API-M’s offering without its downsides. It will add a substantial amount delay to each call, it will only allow 4000 calls/sec per instance, and you don’t get high availability with just 1 instance. Service Fabric does have a loadbalancer build into it, and it does have build in capabilities to negotiate SSL connections without the app needing to bother.
So we decided to move away from API-M and connect directly into the cluster. This brought is to another issue: the build-in SSL wasn’t working. Just didn’t work, and was impossible to point why the certificates didn’t get loaded. No one ever had this problem before (of course) and the issues that were reported are from local development systems not being able to run their local cluster. That shows how many people actually got to use this in production…
Since the SSL is for an application running on machines that can be cleaned in an instance, getting certificates to it manually became the next challenge. We ended up getting the certificate through our own keyvault connection on application start, so we had the certificates available when the connections got instantiated. Custom code on a location where I would rather rely on code from smarter people than me…
Another issue we ran into early was the size of the VM needed to run SF on. By default, SF isn’t configured correctly so it blows up, uses all available space, and services just stop working. Really really not cool, Microsoft! Yes they say you need to run on bigger instances by default, but the problem is that the default configuration will need 100GB at least, and still can run into issues. The cleaning up just isn’t configured correctly. VM’s with more that 100GB are very expensive, even for a test environment, and you need at least 3 to create a cluster.
Moving on to another issue we ran into early: quorum loss. SF runs a raid setup of its services and your application. But you need to make sure SF keeps its quorum, even more than your application. While 3 can survive the loss of 1, there is a lot of things that happen automatically on the background that can cause at least 1 VM to be unavailable and if this coincides with another issue, your whole cluster is destroyed. We learned this the hard way. You need to have 5 machines at least if you don’t want to rebuild your cluster once in a while. And even that is tight.
Another thing you want to do is make sure SF knows early that your application is down. It uses health probes to know the health of its nodes, and it can use this information to decide on delaying impactful actions like upgrading a VM, moving your application to a new version, or while moving to a new version sense that something is wrong with the new deployment. There are 3 ways to get a message to SF: on startup of your application, just throw an exception and exit the application if you think the code can’t run well -> SF will push this to the event source where it listens. Another way is to get an event source message with the failure to SF manually within the grace period of an upgrade. And the last one is providing an http endpoint where the loadbalancer can look at to decide if your application is healthy enough to receive load or where the VM group can look at when deciding to do maintenance on 1 of the machines.
This moves us to reporting on the health of SF and your application(s). Another point of concerns. Every dependency that can impact the state of your application needs to be visible. With SF, you are responsible for the fluent working of a lot of components. That isn’t immediately clear when you first look at the offering. And thinking ‘it will probably run o.k. out of the box’ is a very naive approach: it won’t and it can’t. So to have a good understanding of the health of an application on SF, you need to: know if the application is running well, if all partitions are running well, if all standby nodes are running well, if the SF infra is running well, if the machine is running well, if other infra components (like the load balancer) are running well. Getting these metrics out of all these systems uniformly, get them on a dashboard, merge them with your application state… and then keeping them healthy. That is an administrative task that requires a whole different skillset than the original ‘just add SF with Actors and it scales’ promise!
And now the hottest subject, the one that killed our application countless times. Migrations. You see, stateful services are a combination of data with code. And while upgrading, you essentially run 2 different typed versions on the same data at the same time. This tight coupling of data and type gets really problematic. My personal believe is that to migrate you should not change your data format, but move over to a new version of the service instead. And then somehow get the data over to the new service, or let the old data live in the old service and use an intermediate service like a librarian to point to the right data. This is very costly to implement, I know. But this is how you should think if you go distributed-micro-service-always-available. The alternative to this is nightly migrations, with a lot of orchestration and tooling needed.
And while were are at it: the tooling on SF is very limited. My personal experience is that you want to write your own administrative interfaces as soon as possible. The documents are in a limbo on the powershell, moving away from an old version, the SF api calls are very difficult to manually operate. But you need to have these tools ready to quickly respond to situations.
A specific tool that is not that as useful as you think it is, is the backup service provided by SF. It is meant for returning to a point in time, not disaster recovery. To put it mildly: it is not usable for a production system. It needs the backup API to browse through the data, manually looking at the data (or manipulating it) is not possible to do without writing your own parser, it stores data per partition, and you need to have knowledge of the old application partitioning setup to properly put things back. Good luck using this tool, even with manual commands, to restore to a new cluster. Don’t rely on this backup with the current toolset to provide for a scenario like that. This backup service might be useful if you are going to invest time in how it works under the hood and build a lot of custom tooling around it, probably with your own database of metadata, so you can better support a few common recovery flows. We didn’t have the time to build these tools, and we didn’t count on needing those. So we had a few incidents that left permanent marks on the reputation of SF being a useful tool out-of-the-box. Especially the ‘business guys’ could not understand that with us having a backup we could still be having (partial) data losses.
Implementing micro-service patterns into SF-enabled code
Aside from SF being infra you need to administrate, programming on it should be a breeze. It provides a lot of interesting patterns with its api’s.
We did start with our application being a CQRS system. There is a lot of potential to upscale a system like that. When we started moving to SF each domain had its own code base, but the API that brought it all together was just 1 solution. So while it looked like a micro-service with DDD like boundaries, it behaved like a monolithic application in practice.
Then the first issue arose: how to change the code to leverage the Actor model? Can we just replace a DDD instance with an Actor?
No. There is this ‘minor’ notion of Actors not performing well if you put a lot of calls to 1 instance, or if you need to do a cross-actor query. So very quickly the Actor model wasn’t looking that good of a fit without changing over the whole code architecture. It works well if you have each user interacting with its own actor on a large scale, but if you have for instance a root DDD object and you put this into an actor you instantly have a bottleneck that can’t be scaled.
But luckily SF provides a more ‘basic’ stateful service. It works a bit like a documentDb but typed. So we implemented this in a service-repository pattern. Easy to start with, easy to understand. A stateful service for each domain. It works like this: the API gets a command that ends up in its own business logic (BL). This BL then calls the appropriate stateful service through remoting. This stateful service then looks in its typed datastores and changes stuff.
However, the nuances of how to use a distributed database are quickly lost if you ‘just implement’ this. For instance, if you create a single partition statefull service, you can do magic in this thing. It is really fast (really really fast), and since you are so close to the data even multi-record spanning queries are very fast and consume very few resources. So yeah, if you have a DDD model and store the instances of your root object there it can handle a lot of load. But with just 1 partition you still are limited to be able to run on 1 machine in SF, so you can only scale up instead of scale out.
But how to run this on multiple partitions? Well, you can’t change this afterwards. Can’t. You see, SF doesn’t know about how your data is partitioned, only you do. So… we basically got f*cked there. We ran in production with this and it runs very well on 1 partition but understood this wasn’t the way forward.
What we should have done instead was implementing a ‘librarian’ that handles the cross partitioning calls and points to the right partition. And have a data migration strategy, to move from one partition strategy to the other fluently with code still running.
A bit on the partitioning of data: if you for instance look at cosmosDb it is easy to give it a partitioning that will net you a few hundred to a few thousand partitions. There is no additional cost with each partition. With SF it works the same, but it runs a few secondary instances for reliability. Still it bundles up stuff per service/partition, but its not that fine grained as you think it would be. You don’t want to manually administrate these partitions or have to restore a backup manually.
Implementing B2C
Luckily the B2C implementation was more easy to do. Actually, it was very easy to get the basics working.
However, problems started with the use of PKCE in our SPA applications. This is the new way to go with OpenId auth flows, but there isn’t that much experience with it yet on the internet, compared to the old ways. And B2C is in the transition to be fully embracing PKCE, with even functionality missing that was possible in the old flow.
For instance, there isn’t the possibility currently for server to server communications with tokens on B2C. Just… not there. We implemented our own API tokens to do so. Not cool Microsoft, not cool!
And back to the issue with our SPA application: getting the Blazor WebAssembly to properly inject endpoint specific tokens for each different endpoint was not trivial. There are no examples of this because it should ‘just work’, however with our setup it didn’t.
Another thing is that the B2C examples are missing an important part: SSO. With the PKCE flow a lot of the duration on cookies and sessions have been shortened for increased security. But we still want our customers to have a not-that-much-interrupted-flow when it comes to sign-up and sign-in. So it took us a lot of going through examples and experiences of others to piece together a working setup. We are using a couple of identity providers, so a lot of code to go through and keep working.
The main reason for using custom user flows was the ability to create accounts on our own application as part of the sign-up flow on B2C. As with the other B2C things, finding good Microsoft documentation was hard, but luckily there are a few people that share experiences of their own and we could piece together a working setup. The hardest part was to understand how to write these user flow steps and why it works this way. I still don’t understand each and every detail but I can change some steps now.
Another thing we wanted to do was having our own data on the tokens so we don’t have to hit the back-end each time. We got this working, it used most of the same things as the sign-up used anyway so we could rely on our previously acquired knowledge here.
But we did eventually move away from this ‘own data on token’ due to architectural reasons: you want to have tokens that are limited to the resource and even only the action itself. We didn’t do this so our token grew too big. And in addition to that, we wanted more control over the lifetime. This is probably possible to pull off using more of the custom user flows, but we really didn’t like the idea going there. Using something like identity server is much better documented and more people how to use a technology like that.
The issues
Here a brief summary of the problems we faced already
- Missing documentation on SF actual production experiences.
- High administrating costs for SF.
- Increased complexity with SF without being able to reap equally big benefits from it in the short run.
- Distributed services require their own architectural patterns to be use-able.
- Moving code to a micro-service architecture is hard to do with small increments.
- You should have automated a lot before moving to SF: data should be already moving fluently throughout the landscape, services the same.
- Azure B2C custom user flow documentation is lacking on actual production experiences.
- Azure B2C custom user flows use a xml configuration approach that are hard to understand, and not that many people are actually using it.
- Azure B2C is more limited in feature set than Azure AAD. Not problematic, but deceiving.
But there is more. If you look at the problems we were facing, there were underlying decisions and assumptions that caused this from being ‘issues’ to being ‘problematic’.
- Wanting quick performance and scale-ability rewards without spending much effort and time.
- Keep working with a small team, with easy-to-find expertise.
- With code: keeping everything typed, use remoting connections for much easier follow of calls throughout the landscape.
We couldn’t just add a lot of administrative tools for SF, that would be against our own principles. Tools like that needs to be maintained and require in-depth knowledge. Same for B2C.
We couldn’t also just add a proper micro-service architecture. It will increase the complexity a lot, making it harder to find suitable people. We needed to keep this in check and use a more simple system.
To sum it up: we didn’t like were this was going. We were loosing flexibility way too quickly without providing as much benefits as we liked to have. It dawned on us that we had no benefit of Iaas and should look more at PaaS or even SaaS for a lot of the non-core systems we had.
Solutions, if we did continue down that road
One way to leverage the power of SF is to instantiate a new application for each customer. For example: if you host events, and you want to host a new event, just spin up a new instance of your application for each event. This way you can much better utilize SF scalability, and even use its VM’s affinity system to have different offerings of performance and availability.
Embrace micro-services. Make each micro-service really able to live on its own, being discovered, consumed, handle its own data. And then have interfaces on this service to allow it to move its data to another micro-service of the same kind but allow for version differences. For instance: if you have an api that needs a store. Discover this store, connect to it, negotiate an interface version, get the data or mutate something, and return to the customer. But when you want to move to a new version, let the old one live, release a new one, and let this new one finds its place. Some data will end up directly on this new service, other data will end up there eventually. The mayor technologies here are the versioning , the discovery and the migration handling. But if you do it like this you can very easily re-partition services, break versions, do rolling upgrades/downgrades, have services work independent of each other.
Get a proper micro-service orchestration tool with integrated SF flows to execute maintenance procedures. And have application specific logic in for the different applications to allow for even finer grained maintenance control, like exploring the typed data, manual fixing stuff, things like that. You probably have to develop this on your own, I couldn’t find it.
And on Azure B2C: just get someone with specialized expertise in Azure B2C. Creating and managing IAM systems is a trade on its own, doing it in Azure B2C is a product specialization.
Our next step
I tried to my best to get SF to run within the constraints my client had in place, but it didn’t work out. Same with Azure B2C.
They concluded 3 things:
- Use more commonly-used technologies so it is more easy to find expertise for it.
- We want to keep the complexity low to keep the amount of developers needed low. So use more commonly used patterns to solve our problems, use less custom code and use more readily available technology.
- Lets get our code to a better scale-able architecture first, and while we are going there decide on the PaaS/SaaS components we want to use.
For you
If you are still on SF or Azure B2C, here are some helpful things I did find while working with it.
Service Fabric
A troubleshooting guide for most SF problems
https://github.com/Azure/Service-Fabric-Troubleshooting-Guides
If you want to edit some SF properties but the interface isn’t allowing you to do so, you have to use powershell. But there is an alternative: https://resources.azure.com. Especially useful for editing ‘fabricSettings’:
Select your subscription -> resourceGroups -> “your SF cluster resource group” -> “cluster name” -> providers -> Microsoft.ServiceFabric -> clusters -> “cluster name”
Don’t forget to change to ‘Read/Write’ and then use the ‘Edit’ button.
If you don’t see the edit button you browsing went wrong somewhere. There are multiple views of your cluster data that can be very deceiving!
And try it out on a test server first! If you push your data the upgrade will start immediately, depending on your cluster being bronze/silver/gold/platinum the cluster is more graceful in keeping things running. Unless you break something of course, so back up the json before you alter it!
If you need powershell, you do need the Azure powershell one. This is their new way of doing things. For that, head over to https://docs.microsoft.com/en-us/powershell/azure/install-az-ps. The documentation of SF is a bit all-over-the-place. There is normal powershell with SF powershell commands, Azure powershell with Azure specific commands, and you can use the SF rest api by using powershell. Just know these are all different. And some calls are deprecated, some are replaced, and some don’t have a replacement so yeah… thanks Microsoft?
On getting the backup to work, I used these resources.
https://docs.microsoft.com/en-gb/azure/service-fabric/service-fabric-backuprestoreservice-quickstart-azurecluster
https://docs.microsoft.com/en-gb/azure/service-fabric/service-fabric-backuprestoreservice-configure-periodic-backup
A note on using the backup restore in the Explorer: I ended up using the SF rest api to restore, this one I could get working reliably. I could not get the Explorer to work here.
Azure B2C
Interesting blog with a lot of real world examples
https://tsmatz.wordpress.com/2020/05/12/azure-ad-b2c-ief-custom-policy-walkthrough/
Some insights into session behavior
https://docs.microsoft.com/en-us/azure/active-directory-b2c/configure-tokens?pivots=b2c-custom-policy
If you want to integrate an API of your own in Azure B2C, this is the documentation for that piece
https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/active-directory-b2c/restful-technical-profile.md