Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Divide data into chunks based on amount rather than time #43

Open
ptoews opened this issue Oct 6, 2020 · 2 comments
Labels
enhancement New feature or request

Comments

@ptoews
Copy link
Contributor

ptoews commented Oct 6, 2020

When I tried to sync a large database I experienced a few errors, for example Request Entitiy too Large which I could not fix yet by increasing the max-points-on-write parameter, and similar issues have been discussed here already for large amounts of data. But this is not the main point of this issue.

My data consists of ~50k points which are contained within about one minute, and I tried to sync the last month. So to decrease the amount of points per chunk, I would have to choose a chunk-interval of a few seconds, which results in a huge amount of empty chunks for this month. So I wondered: what is the reason for dividing the data based on time, instead of actual amount?
Granted, my example is a bit extreme, but in cases where the data distribution is uneven or has spikes this approach might not be the best. Instead it might be better to be able to define a chunk size, for example 1000 points, and then syncflux queries the first 1000 points, then the next 1000 points, and so on, resulting in very even and adjustable chunk sizes.
InfluxQL does support this with the LIMIT and OFFSET clauses.

I cannot even think of a reason why aggregating data over time would be better than simply over amount as described. Am I missing something? What do you think?

@toni-moreno toni-moreno added the enhancement New feature or request label Oct 6, 2020
@toni-moreno
Copy link
Owner

Hello @ptoews , is a great idea data partitioning by amount of series, in addition to chuck of time data (It will enable complete control over the amount of data) we can add by example a "max-series-by-time-chunk" , I will have in mind in next releases, ( I will accept also PR's with these new feature).

Until then , did you test disabling the max-body-size parameter , perhaps could you help, in large read/writes.

Thank-you a lot @ptoews for your suggestion.

@ptoews
Copy link
Contributor Author

ptoews commented Oct 24, 2020

Hi @toni-moreno great to hear that!
I'm just wondering what the advantage of a time-based partitioning is? A combination is surely possible, but I don't see any use case for that. Neither for the time-only case. The current issues with large copies seem much more important to me.

Maybe you can explain this to me, otherwise I would try to implement an amount-only-based partitioning solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants