Data Analysis for Kubernetes Clusters

What problem are we solving?

We want to be able to perform data analysis on Kubernetes scheduling and autoscaling behaviour.

What steps do we need to take to get there?

We need to be able to import data from (say) multiple scheduling simulations
We need to be able to compute metrics and nice visualizations based on the input data
We need to be able to save all of the data analysis somewhere persistent (S3)

What questions do we need to answer?

Where do we get the data from?

Basically the “only” option we have here is Prometheus. Any other solution for scraping metrics involves us re-inventing a huge amount of stuff that Prometheus already does.
The real question here is, how do we get fine-enough-grained data so that we can do some reasonable analysis on it? We need at a minimum 1s resolution, I think. We can’t expect users to run their existing prometheus at 1s scrape_interval, so this essentially means we have to launch our own Prometheus pod per simulation.
- Option 1: rely on the prometheus operator
  - pros: this is what we’re already using in our stack, I think it’s not super hard to create a new prometheus object in sk-ctrl that targets the things we want.
  - cons: we introduce a dependency on the prometheus operator into SimKube.
- Option 2: construct either fireconfig, helm, or raw YAML to ship
  - pros: no extra dependency needed
  - cons: ugh.
- Decision: we’re not expecting users to run this in their production clusters, simulation clusters are supposed to be ephemeral, so we can care a little less about another dependency in here. Let’s do the Prometheus operator because it’s easier. At some point down the line, if someone complains, we can revisit.

How do we get the data out?

Prometheus sucks at exporting data. You can basically either just dump the entire TSDB, at which point you then have to spin up a new copy of Prometheus somewhere else to load it, or you can make targeted queries against your running Prometheus and export the results of those in a more standard/easier to use format.
Option 1: dump the TSDB
- Pros: you get access to all the data, not just the data that you’ve configured
- Cons: you have to provide tooling to spin up a new Prometheus somewhere else once you’re ready to analyze it; also potentially large dataset
Option 2: targeted export
- Pros: smaller, targeted dataset; don’t have to spin up another Prometheus
- Cons: what if you missed the data that you really needed from the simulation? How do I, the SimKube developer, know what metrics someone wants from their sim?
Maybe Option 3: can we provide “export hooks” in the simulation? This could be as simple as promql queries written in the Simulation CR, or possibly we could let users inject callbacks or something that do the initial data export and formatting. These callbacks could potentially run as a sidecar? But that requires building a Docker image and a complete application, which is a pain.
- We could load this as a configmap so that multiple simulations could reference the same metrics config?
- We’ll want to think about the format for specifying these and how that maps to output format
Decision: let’s try going down the Option 3 route and see how far that gets us.
- Update: we ended up using Prometheus remote write targets, which basically let users send their data anywhere they want. The remote write targets need to be set up outside of SimKube, which is maybe a bit annoying, but I don’t really want SimKube to be responsible for configuring this.

What format do we want to store the data in (while we’re doing the analysis? After we’re done?)

Option #1: CSV
- Pros: universal, easy to import ~anywhere
- Cons: hard to do structured data, hard to make queries against
Option #2: SQL
- Pros: lets you do structured data, more flexibility in queries, (maybe?) faster to query?
- Cons: requires more machinery, harder to read, (maybe?) bigger contents, we need some kind of pre-defined schema
Option #3: JSON/msgpack
- Pros: also universal, easy to de/serialize, lets you do some degree of structuring, doesn’t require a schema
- Cons: somewhat hard to query (although jq works pretty well, (maybe?) slower to query), while you don’t have to have a schema it’s almost always a good idea to, fair amount of boilerplate to deal with