Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] With Pandas 2.0.0, load_dataset raises TypeError: read_csv() got an unexpected keyword argument 'mangle_dupe_cols' #5744

Closed
keyboardAnt opened this issue Apr 13, 2023 · 6 comments
Assignees

Comments

@keyboardAnt
Copy link

keyboardAnt commented Apr 13, 2023

The load_dataset function with Pandas 1.5.3 has no issue (just a FutureWarning) but crashes with Pandas 2.0.0.
For your convenience, I opened a draft Pull Request to fix it quickly: #5745


  • The FutureWarning mentioned above:
FutureWarning: the 'mangle_dupe_cols' keyword is deprecated and will be removed in a future version. Please take steps to stop the use of 'mangle_dupe_cols'
@albertvillanova
Copy link
Member

Thanks for reporting, @keyboardAnt.

We haven't noticed any crash in our CI tests. Could you please indicate specifically the load_dataset command that crashes in your side, so that we can reproduce it?

@lhoestq
Copy link
Member

lhoestq commented Apr 21, 2023

This has been fixed in datasets 2.11

@pratt3000
Copy link

pratt3000 commented Mar 1, 2024

I am still getting this bug with the latest pandas and datasets lib installed. Anyone else?

from datasets import load_dataset

dataset = load_dataset("csv", data_files={"train":"/kaggle/working/train.csv", "test":"/kaggle/working/test.csv"})
print(dataset)



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 3
      1 from datasets import load_dataset
----> 3 dataset = load_dataset("csv", data_files={"train":"/kaggle/working/train.csv", "test":"/kaggle/working/test.csv"})
      4 print(dataset)

File /opt/conda/lib/python3.10/site-packages/datasets/load.py:1691, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1688 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
   1690 # Download and prepare data
-> 1691 builder_instance.download_and_prepare(
   1692     download_config=download_config,
   1693     download_mode=download_mode,
   1694     ignore_verifications=ignore_verifications,
   1695     try_from_hf_gcs=try_from_hf_gcs,
   1696     use_auth_token=use_auth_token,
   1697 )
   1699 # Build dataset for splits
   1700 keep_in_memory = (
   1701     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   1702 )

File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:605, in DatasetBuilder.download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
    603         logger.warning("HF google storage unreachable. Downloading and preparing it from source")
    604 if not downloaded_from_gcs:
--> 605     self._download_and_prepare(
    606         dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    607     )
    608 # Sync info
    609 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:694, in DatasetBuilder._download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    690 split_dict.add(split_generator.split_info)
    692 try:
    693     # Prepare split will record examples associated to the split
--> 694     self._prepare_split(split_generator, **prepare_split_kwargs)
    695 except OSError as e:
    696     raise OSError(
    697         "Cannot find data file. "
    698         + (self.manual_download_instructions or "")
    699         + "\nOriginal error:\n"
    700         + str(e)
    701     ) from None

File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1151, in ArrowBasedBuilder._prepare_split(self, split_generator)
   1149 generator = self._generate_tables(**split_generator.gen_kwargs)
   1150 with ArrowWriter(features=self.info.features, path=fpath) as writer:
-> 1151     for key, table in logging.tqdm(
   1152         generator, unit=" tables", leave=False, disable=True  # not logging.is_progress_bar_enabled()
   1153     ):
   1154         writer.write_table(table)
   1155     num_examples, num_bytes = writer.finalize()

File /opt/conda/lib/python3.10/site-packages/tqdm/notebook.py:249, in tqdm_notebook.__iter__(self)
    247 try:
    248     it = super(tqdm_notebook, self).__iter__()
--> 249     for obj in it:
    250         # return super(tqdm...) will not catch exception
    251         yield obj
    252 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt

File /opt/conda/lib/python3.10/site-packages/tqdm/std.py:1170, in tqdm.__iter__(self)
   1167 # If the bar is disabled, then just walk the iterable
   1168 # (note: keep this check outside the loop for performance)
   1169 if self.disable:
-> 1170     for obj in iterable:
   1171         yield obj
   1172     return

File /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/csv/csv.py:154, in Csv._generate_tables(self, files)
    152 dtype = {name: dtype.to_pandas_dtype() for name, dtype in zip(schema.names, schema.types)} if schema else None
    153 for file_idx, file in enumerate(files):
--> 154     csv_file_reader = pd.read_csv(file, iterator=True, dtype=dtype, **self.config.read_csv_kwargs)
    155     try:
    156         for batch_idx, df in enumerate(csv_file_reader):

TypeError: read_csv() got an unexpected keyword argument 'mangle_dupe_cols'```

@lhoestq
Copy link
Member

lhoestq commented Mar 1, 2024

Feel free to update datasets to fix this issue

pip install -U datasets

@httplups
Copy link

I am still having the same issue with the version >= 2.14

@zmoki688
Copy link

zmoki688 commented Apr 9, 2024

Edit: Sorry, I found that our version is 2.2.1. Please ignore the following comment. This issue was already solved by this line:

_PANDAS_READ_CSV_DEPRECATED_PARAMETERS = ["warn_bad_lines", "error_bad_lines", "mangle_dupe_cols"]

This issue still exists as you can see in version 2.14:

mangle_dupe_cols: bool = True

"mangle_dupe_cols": self.mangle_dupe_cols,

that "mangle_dupe_cols" still exists in the arguments.

And this error occurs at this line:

csv_file_reader = pd.read_csv(file, iterator=True, dtype=dtype, **self.config.pd_read_csv_kwargs)

where

file == '~/llama/llama-recipes/recipes/finetuning/gtrain_10k.csv'
dtype == None
self.config.pd_read_csv_kwargs == {
    "sep": ",",
    "header": "infer",
    "index_col": None,
    "usecols": None,
    "mangle_dupe_cols": True,
    "engine": None,
    "true_values": None,
    "false_values": None,
    "skipinitialspace": False,
    "skiprows": None,
    "nrows": None,
    "na_values": None,
    "keep_default_na": True,
    "na_filter": True,
    "verbose": False,
    "skip_blank_lines": True,
    "thousands": None,
    "decimal": ".",
    "lineterminator": None,
    "quotechar": '"',
    "quoting": 0,
    "escapechar": None,
    "comment": None,
    "encoding": None,
    "dialect": None,
    "skipfooter": 0,
    "doublequote": True,
    "memory_map": False,
    "float_precision": None,
    "chunksize": 10000,
}

for me.

Here is where we got the error: meta-llama/llama-recipes#426

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants