Latency, the time it takes to service a request.
Traffic, a measure of how much demand is being placed on a system, measured in a high-level system-specific metric.
Errors, the rate of requests that fail, either explicitly(e.g., HTTP 500s)
Saturation, how"full" a service is. A measure of a system fraction, emphasizing the resources that are most constrained(e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O)
We need log aggregation(DIY, Splunk, New Relic, Datadog, etc.)
We need to enable distributed tracing through out our spinnaker Eco systems(Sleuth w/ Zipkin or New Relic, Datadog)
Your team is responsible for providing a summary of changes in upstream services before a release is cut so that customers(both internal and external) understand what is changing release to release. These should be added to the release notes on each release.
Netflix does this by taking the following steps:
Engineer works on a feature in, say, Orca
PR gets merged in OSS
Engineer reviews commits since last GitHub release.
Engineer runs the action/script/whatever to create a GitHub release of Orca.
Releases