Armory SLI/SLO Documentation

Integrations
People who know it well
Deep Dive Doc or Resources
Prometheus
Jason

DataDog


NewRelic
Jason, Justin F?

Dynatrace


301 training has some good insights into services, should give a good head start for the deep dive docs: https://docs.google.com/document/d/1ij1Ou3fTDGOjS6CabfvNhS5JkisYV0Corhtk1V2uqX0/edit#

Could potentially re purpose some of these metrics to track/monitor customer Spinnaker environment health - 


These are the biggest leading indicators:
  • Latency between services
  • 5xx responses
  • Orca task queue size
  • Clouddriver rate limit
  • Pipeline execution failure due to timeout
  • Exceptions in pipeline execution

There are two big philosophies that Spinnaker follows:
  • End to end service ownership
  • Immutability


Justin Field thoughts on owning microservices from cradle to grave:
  1. We need everything instrumented and reporting the metrics we need to create dashboards and alerts off of the 4 golden signals of site reliability engineering
  • Latency, the time it takes to service a request.
  • Traffic, a measure of how much demand is being placed on a system, measured in a high-level system-specific metric.
  • Errors, the rate of requests that fail, either explicitly (e.g., HTTP 500s)
  • Saturation, how "full" a service is. A measure of a system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O)
  1. We need log aggregation (DIY, Splunk, New Relic, Datadog, etc.)
  1. We need to enable distributed tracing through out our spinnaker Eco systems (Sleuth w/ Zipkin or New Relic, Datadog)



Releases

Your team is responsible for providing a summary of changes in upstream services before a release is cut so that customers (both internal and external) understand what is changing release to release. These should be added to the release notes on each release.

Netflix does this by taking the following steps:
  1. Engineer works on a feature in, say, Orca
  1. PR gets merged in OSS
  1. Engineer reviews commits since last GitHub release.
  1. Engineer runs the action/script/whatever to create a GitHub release of Orca.