-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] added 27 FPP3 datasets; accessor load_fpp3() that returns sktime object #6420
base: main
Are you sure you want to change the base?
Conversation
…convert the data into sktime objects, e.g. pd_multiindex_hier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
My main comment is, is there a way to minimize the file size footprint of this PR.
Each file we add increases the size of sktime
, as a package, and makes it more inconvenient to install, download, etc.
Some options how to deal with this, below.
We could ship the files that have less than 30kb, say, with sktime
, but the remaining ones would bloat the repository.
A simple option would be to put them into another repository, and download from there. Upload/download quotas apply once a repository exceeds 500 GB size. sktime/sktime-datasets-fpp3
?
Here is an example of a repository that we planned to use in this way:
https://github.com/sktime/sktime-datasets
although currently it is only used as fallback if the primary source is down.
Another option would be to put these into an optional, downloadable package, and make the loader require that as soft dependency.
Any thoughts or input, @benHeid, @JonathanBechtel, @yarnabrina?
@fkiraly I agree the file size footprint should be reduced. The best would be to make the dataset available in a different repo or a datastore and a data loader as suggested. Is there are existing repo or website that is already hosting these data, would it be possible in that case to access these data by writing an own loader? |
@ericjb mentioned https://github.com/holgern/pyRdatasets in his document, which might be an option. Given the relatively small size of the datasets, we would not run into download restrictions if we moved the datasets to a separate repository in the |
I contacted both
In both cases I opened an issue related to this PR. I am hopeful that one, or both, of the maintainers will respond positively, but ... so far no response. I don't want to delay this PR too long while waiting for a response. I looked at https://github.com/sktime/sktime-datasets as mentioned by @fkiraly. Not clear to me what would be involved to use this. Any suggestions on how to proceed would be welcomed. |
Oh, I see, Here are possible ways to proceed, I split these into handling the
There are different trade-offs to consider on how quickly we get to a functioning prototype, and given limited time we may also have to choose to add more features on the indexing side or the data hosting side. Robustness of the data location is also a factor here. If you have some time you want to spend in a structured way on this, you could join the benchmarking related workstream (onboarding of experienced contributors is May 31). |
FYI @JonathanBechtel who has worked on datasets, and @benHeid, @hazrulakmal due to benchmarking connection. |
So, what do we do, @ericjb? Any preference? Happy to help. |
As there was no response from either maintainer of the two candidate approaches (Rdatasets, pyRdatasets) I went back to the drawing board. As @benHeid pointed out, one should consider if these datasets are already hosted somewhere. And of course, they are, on CRAN. And, by serendipity, I received the following announcement in my LinkedIn feed today: rdata: Read R datasets from Python (https://www.linkedin.com/pulse/rdata-read-r-datasets-from-python-pyopensci-z8obf/). As a proof-of-concept (POC), and with the help of GitHub Co-Pilot, I wrote a Python function that takes as input a dataset name, grabs the fpp3 tarball from CRAN, downloads it into a temporary folder, finds the relevant dataset (which is in .rda format), reads it using this newly discovered rdata package, deletes the temporary folder and returns the data as a DataFrame. I have attached my POC code so you can see the details. If this approach is acceptable, I need to modify my PR because I included logic as part of reading the CSV file (to create the multiindex etc) and that logic would now have to be moved a bit. I won't do that work until I get confirmation that this approach is acceptable. |
Update: In the two hours since I added my most recent comment, I received an update from the maintainer of Rdatasets that he has added the datasets from fpp3. There are 13 such datasets (of the 27 in this PR.) I submitted a new issue to Rdatasets requesting the addition of the additional 14 (2 are from the tsibble package and 12 are from the tsibbledata package). Hopefully these will be added and we will have a simpler solution than my most recent proposal. |
I see - so we have two potential mirrors in play? I would assume the design would then look like:
We could also adopt a dual mirror approach:
I would also recommend to open a new PR. I think the current state of the PR is useful as it has logic going in a different direction, the "self-hosted" option. We should keep it open until we have converged on the final constellation. |
It seems like you are not against my proposal to use the rdata package to read the R rda files. Please confirm if so, so I can go ahead with a first implementation that should provide access to all 27 datasets. Thanks |
Absolutley not (against the proposal), it is a good idea. Feel free to go ahead. I was just trying to pin down the design, and suggested to work on the rdata route in another PR. |
Turning into draft to avoid that this is accidentally merged - keeping the PR due to potentially useful alternative code for indexing the data sets. |
Adds 27 datasets from Hyndman's fpp3. User entry point is sktime.datasets.load_fpp3(). The returned object has appropriate mtype with multiindex as required. The load_fpp3() function could be of interest to the broader Python community, including those not otherwise using sktime. See Section 1 in the accompanying pdf for details.
PullRequestNotes.pdf