Plasmic Postmortem: invalid schema version stamp deployed(2023-10-09)
What happened
On 2023-10-09 13:24 UTC, we deployed an update to Plasmic that contained an incorrect“schema version stamp”, which is a version number depicting the expected schema for our project data. This resulted in projects that were access since then to be stamped with this invalid“schema version”, which is higher than expected. (This did not affect the data; it only updated the version stamp).
On 2023-10-09 15:31 UTC(~2 hours later), we deployed an update to Plasmic, which contained the correct version stamp. However, since the correct version number is a smaller number than the invalid one deployed earlier, accessing these projects resulted in an error, as the deployed code assumed it did not know how to handle projects with a schema version number that it did not know about.
Unfortunately, even though the incorrect version stamp is only on projects accessed in this two-hour window, those projects included some core system projects that almost all projects use, so most projects could not be opened successfully.
We became aware of the issue at 15:36 UTC, and immediately began investigation. We began deploying a fix at 15:58 UTC. The fix reached our codegen service at 16:54 UTC, and the Studio was fixed at 17:17 UTC.
Impact
During the downtime, users saw an“Internal server error” when they tried to open a project, and later a persistent“An update to Plasmic is available” that did not go away upon reload. The codegen cluster was also unable to generate code for new publishes. Our CDN continued to serve cached bundles where it could.
What went wrong
We run Plasmic deployments via a Jenkins job that checks out code from git and runs a deployment script. However, a recent change to this deployment script would generate a new schema version stamp when it detects that it is right to do so, and send out a code review for that new schema version stamp for approval. Unfortunately, the script failed to clean up the the Jenkins workspace afterwards, and our Jenkins pipeline was not properly configured to clean out all untracked files after checkout. This left dangling schema version stamp files that were picked up on a next run of the deployment script, which ended up packaging those version stamp files into a Docker image and deploying them to production. This resulted in the invalid version stamp being used in production until the next deployment.
What we’re doing
We have fixed the immediate issue and restored Plasmic deployments.
It is never acceptable to us to cause service disruptions. This quarter, we have been focusing on improving our devops infrastructure to automate error-prone manual processes and speed up our testing pipelines as we ramp up our test coverage. After this incident, we are also exploring ways to test our deployment pipeline and to speed up hot-fixes to minimize down time when it happens.
We know that you rely on Plasmic for your critical business workflows, and we did not live up to your expectations. We are learning and working on improving our processes. Please reach out with any additional questions or concerns you might have.
What happened
Impact
What went wrong
What we’re doing