Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] added 27 FPP3 datasets; accessor load_fpp3() that returns sktime object #6420

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ericjb
Copy link
Contributor

@ericjb ericjb commented May 14, 2024

Adds 27 datasets from Hyndman's fpp3. User entry point is sktime.datasets.load_fpp3(). The returned object has appropriate mtype with multiindex as required. The load_fpp3() function could be of interest to the broader Python community, including those not otherwise using sktime. See Section 1 in the accompanying pdf for details.

PullRequestNotes.pdf

…convert the data into sktime objects, e.g. pd_multiindex_hier
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

My main comment is, is there a way to minimize the file size footprint of this PR.
Each file we add increases the size of sktime, as a package, and makes it more inconvenient to install, download, etc.

Some options how to deal with this, below.

We could ship the files that have less than 30kb, say, with sktime, but the remaining ones would bloat the repository.

A simple option would be to put them into another repository, and download from there. Upload/download quotas apply once a repository exceeds 500 GB size. sktime/sktime-datasets-fpp3?

Here is an example of a repository that we planned to use in this way:
https://github.com/sktime/sktime-datasets
although currently it is only used as fallback if the primary source is down.

Another option would be to put these into an optional, downloadable package, and make the loader require that as soft dependency.

Any thoughts or input, @benHeid, @JonathanBechtel, @yarnabrina?

@benHeid
Copy link
Contributor

benHeid commented May 19, 2024

Any thoughts or input, @benHeid, @JonathanBechtel, @yarnabrina?

@fkiraly I agree the file size footprint should be reduced. The best would be to make the dataset available in a different repo or a datastore and a data loader as suggested. Is there are existing repo or website that is already hosting these data, would it be possible in that case to access these data by writing an own loader?

@fkiraly
Copy link
Collaborator

fkiraly commented May 20, 2024

@ericjb mentioned https://github.com/holgern/pyRdatasets in his document, which might be an option.

Given the relatively small size of the datasets, we would not run into download restrictions if we moved the datasets to a separate repository in the sktime org, either.

@ericjb
Copy link
Contributor Author

ericjb commented May 21, 2024

I contacted both

  1. https://github.com/vincentarelbundock/Rdatasets - which stores over 2,000 R datasets, including fpp2 (this PR is related to fpp3)
  2. https://pypi.org/project/rdatasets/ - which is a Python package that accesses the datasets in the above.

In both cases I opened an issue related to this PR. I am hopeful that one, or both, of the maintainers will respond positively, but ... so far no response.

I don't want to delay this PR too long while waiting for a response. I looked at https://github.com/sktime/sktime-datasets as mentioned by @fkiraly. Not clear to me what would be involved to use this. Any suggestions on how to proceed would be welcomed.

@fkiraly
Copy link
Collaborator

fkiraly commented May 21, 2024

Oh, I see, Rdatasets does no thave the fpp3 ones.

Here are possible ways to proceed, I split these into handling the sktime side and the data hosting side.

  1. sktime side:
  • we can add the loader as you suggest, and let's make the download location it points to flexible. Under your design, that should not be too difficult, although manual. This should ensure flexibility in handling different solutions for the data mirror, and switching from one to a better one once it should become available.
  • a luxury version would be finalizing DRAFT [ENH] experimental: datasets as objects #4333, this design has some abstraction around handling mirrors, perhaps that's a second step rather than the first one. In case you are interested helping out with this, it would be great for a benchmarking framework, and for user discoverability. It would also abstract away the load/prep boilerplate for setting up a benchmarking experiment.
  1. data mirror side:
  • baseline version: we dump all the datasets into a repository in the sktime GitHub org. I can give you maintainer rights to that, and the download points ot the blob storage of that repo. As the data sets are small and the total size is also small, this should be legit - although non-standard - usage of a repo: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github
  • "ship with sktime" version: small files are included in sktime, and larger files are put in the repo. So the loader needs to handle a sequence of mirrors - the package itself, and then the repo, checking the repo only if the file is not found in the sktime datasets folder. This kind of logic could be useful anyway, in case we end up working with multiple mirrors (see above).
  • third party option: we ensure that one of the other packages provides the files that we want. This may take some time though. Once this is done, we can treat the third party data package as a soft dependency and have the loader point to there.
  • own package: we create our own data package, instead of relying on third party. This is not as complicated as it sounds, but someone needs to maintain it. We can use the sktime org for that, and use the template workflow we have for releasing packages. This would require some coordination.

There are different trade-offs to consider on how quickly we get to a functioning prototype, and given limited time we may also have to choose to add more features on the indexing side or the data hosting side. Robustness of the data location is also a factor here.

If you have some time you want to spend in a structured way on this, you could join the benchmarking related workstream (onboarding of experienced contributors is May 31).

@fkiraly
Copy link
Collaborator

fkiraly commented May 21, 2024

FYI @JonathanBechtel who has worked on datasets, and @benHeid, @hazrulakmal due to benchmarking connection.

@fkiraly fkiraly added enhancement Adding new functionality module:datasets&loaders data sets and data loaders labels May 21, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented May 22, 2024

So, what do we do, @ericjb? Any preference? Happy to help.

@ericjb
Copy link
Contributor Author

ericjb commented May 23, 2024

As there was no response from either maintainer of the two candidate approaches (Rdatasets, pyRdatasets) I went back to the drawing board.

As @benHeid pointed out, one should consider if these datasets are already hosted somewhere. And of course, they are, on CRAN. And, by serendipity, I received the following announcement in my LinkedIn feed today: rdata: Read R datasets from Python (https://www.linkedin.com/pulse/rdata-read-r-datasets-from-python-pyopensci-z8obf/).

As a proof-of-concept (POC), and with the help of GitHub Co-Pilot, I wrote a Python function that takes as input a dataset name, grabs the fpp3 tarball from CRAN, downloads it into a temporary folder, finds the relevant dataset (which is in .rda format), reads it using this newly discovered rdata package, deletes the temporary folder and returns the data as a DataFrame. I have attached my POC code so you can see the details.

If this approach is acceptable, I need to modify my PR because I included logic as part of reading the CSV file (to create the multiindex etc) and that logic would now have to be moved a bit. I won't do that work until I get confirmation that this approach is acceptable.

get_fpp3.py.txt

@ericjb
Copy link
Contributor Author

ericjb commented May 23, 2024

Update: In the two hours since I added my most recent comment, I received an update from the maintainer of Rdatasets that he has added the datasets from fpp3. There are 13 such datasets (of the 27 in this PR.) I submitted a new issue to Rdatasets requesting the addition of the additional 14 (2 are from the tsibble package and 12 are from the tsibbledata package). Hopefully these will be added and we will have a simpler solution than my most recent proposal.

@fkiraly
Copy link
Collaborator

fkiraly commented May 23, 2024

I see - so we have two potential mirrors in play?

I would assume the design would then look like:

  • use one of the mirrors to get the ds into numpy or pandas format
  • the sktime function downloads and converts the datasets into sktime format
    ?

We could also adopt a dual mirror approach:

  • we add code to get this from either place, the user can choose. Depending on the choice, they will require one or the other dependency. This may be useful for users in restricted environments that need packages approved, making it more likely that they have access to one of the options.
  • i.e., under the hood, we could add your loader code for option 1, and for option 2

I would also recommend to open a new PR. I think the current state of the PR is useful as it has logic going in a different direction, the "self-hosted" option. We should keep it open until we have converged on the final constellation.

@ericjb
Copy link
Contributor Author

ericjb commented May 23, 2024

It seems like you are not against my proposal to use the rdata package to read the R rda files. Please confirm if so, so I can go ahead with a first implementation that should provide access to all 27 datasets. Thanks

@fkiraly
Copy link
Collaborator

fkiraly commented May 23, 2024

Absolutley not (against the proposal), it is a good idea. Feel free to go ahead.

I was just trying to pin down the design, and suggested to work on the rdata route in another PR.
For soft dependency checks, you can use _check_soft_dependencies.

@fkiraly fkiraly marked this pull request as draft June 1, 2024 18:26
@fkiraly
Copy link
Collaborator

fkiraly commented Jun 1, 2024

Turning into draft to avoid that this is accidentally merged - keeping the PR due to potentially useful alternative code for indexing the data sets.

@fkiraly fkiraly added the do not merge should not be merged - e.g., CI diagnostic, experimental label Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge should not be merged - e.g., CI diagnostic, experimental enhancement Adding new functionality module:datasets&loaders data sets and data loaders
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants