Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Speed up GetSchema #25

Open
karn09 opened this issue Sep 3, 2019 · 3 comments
Open

[Feature request] Speed up GetSchema #25

karn09 opened this issue Sep 3, 2019 · 3 comments

Comments

@karn09
Copy link

karn09 commented Sep 3, 2019

Loading the schema from a DB with a large number of measurements takes a long time. I've observed anywhere from 8-20 minutes before GetSchema completes.

I suspect the cause of long load times to be a result of:

mf[m.Name].Fields = GetFields(hac.Master.cli, db, m.Name, rp.Name)

This is making individual API calls for each measurement to fetch field keys.

I was thinking that it may be possible to use show field keys on <sdb>, so that the API responds with field keys for ALL measurements in the selected db. I think this would work, but I haven't investigated whether there are any size limitations with influxdb JSON responses, or the rest client used.

With 1000 measurements, the API took 12s to respond with a 1.72MB JSON payload. Compared to a request for fields on a single measurement, which took between 500-800ms within a small sample size of requests.

An alternate could be splitting the list of measurements and fetch field keys in batches, but this could also be very slow. For example, show field keys from disk,diskio,interrupts,kernel would take upward of 12s, sometimes even giving an empty response. Maybe influxdb does not index on this sort of query?

For my limited testing, I am running InfluxDB 1.7.7, with queries being routed through influxdb-srelay. Queries made directly to master were slightly faster, with all fields being returned in 4s, and batches of 4 varying between 4-12s per request.

It would be awesome if we could set a flag at the command line to force bulk loading of all field keys in a single request, or have some sort of logic that automatically switches to bulk loading if a certain amount of measurements are seen in one DB. If batching requests is workable with additional configuration in influxdb, that would also be great.

I'd be happy to submit a PR with my proposed solution, but would appreciate some feedback on the correct approach to take.

@sbengo
Copy link
Collaborator

sbengo commented Sep 4, 2019

Hi @karn09 , thanks for the detailed feedback.

I'm very surprised of time spent retrieving the schema, how many measurements do you have on DB and what cardinality?

I think I know what you propose but I'm not sure if a raw show field keys on <sdb> is the valid approach. As we allow measurement filters we should not retrive anything from filtered ones (imagine that by an error, it has generated 1M of fields and a single query can down your DB)

Just to be sure, can you check if the time spent doing show field keys on <sdb> from [meas1, meas2,...,measN] its equal than doing a bulk show field keys from <sdb>

Thanks,
Regards!

@karn09
Copy link
Author

karn09 commented Sep 4, 2019

Hi @sbengo, thanks for following up.

In one extreme case, we have 10391 measurements, however, a query for measurement cardinality gives 10389. On this db, show field keys on LARGE_DB took 21.624s.

Given the high cardinality, I'm hesitant to load all measurement field keys via show field keys on <sdb> from [meas1, meas2,...,measN]. Instead I ran a small subset with 9 measurements.

show field keys on LARGE_DB from interrupts,kernel,linux_sysctl_fs,mem,net,netstat,processes,soft_interrupts,swap,system took 2m59.413s.

For comparison, I have another DB with a measurement cardinality of 13. A bulk show field keys on SMALL_DB took 0m0.301s. Passing all 13 measurements to show field keys took 0.780s.

You bring up valid points that I had not considered. I can see both approaches being problematic.

To try and accommodate filtering, I put together a query that uses regex on FROM. In the end I couldn't come up with anything that worked 100% since I could not use negative look arounds. Anyway, I suspect that this may be just as slow as passing a list of all the measurements for large DBs.

Perhaps a 'use at your own risk' warning could surround an option to bulk load fields, which would disable filtering - or have filtering applied after all the fields are fetched.

@sbengo
Copy link
Collaborator

sbengo commented Sep 6, 2019

Hi @karn09 , thanks for your feedback.
I have been able to reproduce the case and it still surprises me the time difference between commands

As you have said, and IMHO its not acceptable to spent 8-20 min retriving schema.
@toni-moreno , what do you think?

Personally I agree to implement what you proposed. If you can, try to make an initial PR and we will discuss it on it. Remember to do the measurement filter after retrieve all fields with the bulk query.

Thanks,
Greetings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants