I am interested in compiling a list of techniques to generate fake time-series data that looks and behaves realistically. The goal is to make a mock API for developers to work against, without needing bulky sets of real data, which are annoying to deal with, especially as things change and new types of data are needed.
To achieve this, I think several specific things need to be addressed:
y_n = rand() * 10 + 100for data that fluctuates randomly between 90 and 100.
To make the mock API, I imagine we could catalog a set of metrics we want to be able to generate, with the following properties for each:
This reduces the problem from what we currently do (keeping entire data sets, which need to be replaced as our data gathering techniques evolve) into just a dictionary of metrics and their definitions.
Then the mock API would accept requests for a set of metrics, the time range desired, and the resolution desired. The metrics would be computed and returned.
To make this work correctly, the metrics need to be generated deterministically. That is, if I ask for metrics from 5am to 6am on a particular day, I should always get the same values for the metrics. And if I ask for a different time range, I’d get different values. What this means, in my opinion, is that there needs to be a closed-form function that produces the metric’s output for a given timestamp. (I think one-second resolution of data is fine enough for most purposes.)
Does anyone have suggestions for how to do this?
The result will be open-sourced, so everyone who’s interested in such a programmatically generated dataset can benefit from it.