[BUG] With Pandas 2.0.0, `load_dataset` raises `TypeError: read_csv() got an unexpected keyword argument 'mangle_dupe_cols'` #5744

keyboardAnt · 2023-04-13T20:21:28Z

The load_dataset function with Pandas 1.5.3 has no issue (just a FutureWarning) but crashes with Pandas 2.0.0.
For your convenience, I opened a draft Pull Request to fix it quickly: #5745

The FutureWarning mentioned above:

FutureWarning: the 'mangle_dupe_cols' keyword is deprecated and will be removed in a future version. Please take steps to stop the use of 'mangle_dupe_cols'

The text was updated successfully, but these errors were encountered:

albertvillanova · 2023-04-14T07:54:54Z

Thanks for reporting, @keyboardAnt.

We haven't noticed any crash in our CI tests. Could you please indicate specifically the load_dataset command that crashes in your side, so that we can reproduce it?

lhoestq · 2023-04-21T15:21:55Z

This has been fixed in datasets 2.11

…e/datasets/issues/5744

pratt3000 · 2024-03-01T01:07:22Z

I am still getting this bug with the latest pandas and datasets lib installed. Anyone else?

from datasets import load_dataset

dataset = load_dataset("csv", data_files={"train":"/kaggle/working/train.csv", "test":"/kaggle/working/test.csv"})
print(dataset)



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 3
      1 from datasets import load_dataset
----> 3 dataset = load_dataset("csv", data_files={"train":"/kaggle/working/train.csv", "test":"/kaggle/working/test.csv"})
      4 print(dataset)

File /opt/conda/lib/python3.10/site-packages/datasets/load.py:1691, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1688 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
   1690 # Download and prepare data
-> 1691 builder_instance.download_and_prepare(
   1692     download_config=download_config,
   1693     download_mode=download_mode,
   1694     ignore_verifications=ignore_verifications,
   1695     try_from_hf_gcs=try_from_hf_gcs,
   1696     use_auth_token=use_auth_token,
   1697 )
   1699 # Build dataset for splits
   1700 keep_in_memory = (
   1701     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   1702 )

File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:605, in DatasetBuilder.download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
    603         logger.warning("HF google storage unreachable. Downloading and preparing it from source")
    604 if not downloaded_from_gcs:
--> 605     self._download_and_prepare(
    606         dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    607     )
    608 # Sync info
    609 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:694, in DatasetBuilder._download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    690 split_dict.add(split_generator.split_info)
    692 try:
    693     # Prepare split will record examples associated to the split
--> 694     self._prepare_split(split_generator, **prepare_split_kwargs)
    695 except OSError as e:
    696     raise OSError(
    697         "Cannot find data file. "
    698         + (self.manual_download_instructions or "")
    699         + "\nOriginal error:\n"
    700         + str(e)
    701     ) from None

File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1151, in ArrowBasedBuilder._prepare_split(self, split_generator)
   1149 generator = self._generate_tables(**split_generator.gen_kwargs)
   1150 with ArrowWriter(features=self.info.features, path=fpath) as writer:
-> 1151     for key, table in logging.tqdm(
   1152         generator, unit=" tables", leave=False, disable=True  # not logging.is_progress_bar_enabled()
   1153     ):
   1154         writer.write_table(table)
   1155     num_examples, num_bytes = writer.finalize()

File /opt/conda/lib/python3.10/site-packages/tqdm/notebook.py:249, in tqdm_notebook.__iter__(self)
    247 try:
    248     it = super(tqdm_notebook, self).__iter__()
--> 249     for obj in it:
    250         # return super(tqdm...) will not catch exception
    251         yield obj
    252 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt

File /opt/conda/lib/python3.10/site-packages/tqdm/std.py:1170, in tqdm.__iter__(self)
   1167 # If the bar is disabled, then just walk the iterable
   1168 # (note: keep this check outside the loop for performance)
   1169 if self.disable:
-> 1170     for obj in iterable:
   1171         yield obj
   1172     return

File /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/csv/csv.py:154, in Csv._generate_tables(self, files)
    152 dtype = {name: dtype.to_pandas_dtype() for name, dtype in zip(schema.names, schema.types)} if schema else None
    153 for file_idx, file in enumerate(files):
--> 154     csv_file_reader = pd.read_csv(file, iterator=True, dtype=dtype, **self.config.read_csv_kwargs)
    155     try:
    156         for batch_idx, df in enumerate(csv_file_reader):

TypeError: read_csv() got an unexpected keyword argument 'mangle_dupe_cols'```

lhoestq · 2024-03-01T09:49:06Z

Feel free to update datasets to fix this issue

pip install -U datasets

httplups · 2024-03-16T01:15:25Z

I am still having the same issue with the version >= 2.14

zmoki688 · 2024-04-09T03:23:01Z

Edit: Sorry, I found that our version is 2.2.1. Please ignore the following comment. This issue was already solved by this line:

datasets/src/datasets/packaged_modules/csv/csv.py

Line 18 in bf02cff

    
           _PANDAS_READ_CSV_DEPRECATED_PARAMETERS = ["warn_bad_lines", "error_bad_lines", "mangle_dupe_cols"]

This issue still exists as you can see in version 2.14:

datasets/src/datasets/packaged_modules/csv/csv.py

Line 35 in bf02cff

mangle_dupe_cols: bool = True

datasets/src/datasets/packaged_modules/csv/csv.py

Line 84 in bf02cff

"mangle_dupe_cols": self.mangle_dupe_cols,

that "mangle_dupe_cols" still exists in the arguments.

And this error occurs at this line:

datasets/src/datasets/packaged_modules/csv/csv.py

Line 185 in bf02cff

csv_file_reader = pd.read_csv(file, iterator=True, dtype=dtype, **self.config.pd_read_csv_kwargs)

where
file == '~/llama/llama-recipes/recipes/finetuning/gtrain_10k.csv'
dtype == None
self.config.pd_read_csv_kwargs == {
    "sep": ",",
    "header": "infer",
    "index_col": None,
    "usecols": None,
    "mangle_dupe_cols": True,
    "engine": None,
    "true_values": None,
    "false_values": None,
    "skipinitialspace": False,
    "skiprows": None,
    "nrows": None,
    "na_values": None,
    "keep_default_na": True,
    "na_filter": True,
    "verbose": False,
    "skip_blank_lines": True,
    "thousands": None,
    "decimal": ".",
    "lineterminator": None,
    "quotechar": '"',
    "quoting": 0,
    "escapechar": None,
    "comment": None,
    "encoding": None,
    "dialect": None,
    "skipfooter": 0,
    "doublequote": True,
    "memory_map": False,
    "float_precision": None,
    "chunksize": 10000,
}
for me.

Here is where we got the error: meta-llama/llama-recipes#426

keyboardAnt mentioned this issue Apr 13, 2023

[BUG FIX] Issue 5744 #5745

Open

albertvillanova self-assigned this Apr 14, 2023

dwnoble mentioned this issue May 4, 2023

Bumped datasets to 2.11 datacommonsorg/website#2648

Closed

KenHoffman mentioned this issue Jun 5, 2023

I get an error in 02_classification.ipynb cell 15 nlp-with-transformers/notebooks#107

Open

keyboardAnt added a commit to keyboardAnt/bigcode-evaluation-harness that referenced this issue Jun 15, 2023

Requiring datasets>=2.11 following this bugfix: github.com/huggingfac…

bc3152e

…e/datasets/issues/5744

mariosasko closed this as completed Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] With Pandas 2.0.0, `load_dataset` raises `TypeError: read_csv() got an unexpected keyword argument 'mangle_dupe_cols'` #5744

[BUG] With Pandas 2.0.0, `load_dataset` raises `TypeError: read_csv() got an unexpected keyword argument 'mangle_dupe_cols'` #5744

keyboardAnt commented Apr 13, 2023 •

edited

albertvillanova commented Apr 14, 2023

lhoestq commented Apr 21, 2023

pratt3000 commented Mar 1, 2024 •

edited

lhoestq commented Mar 1, 2024

httplups commented Mar 16, 2024

zmoki688 commented Apr 9, 2024 •

edited

[BUG] With Pandas 2.0.0, load_dataset raises TypeError: read_csv() got an unexpected keyword argument 'mangle_dupe_cols' #5744

[BUG] With Pandas 2.0.0, load_dataset raises TypeError: read_csv() got an unexpected keyword argument 'mangle_dupe_cols' #5744

Comments

keyboardAnt commented Apr 13, 2023 • edited

albertvillanova commented Apr 14, 2023

lhoestq commented Apr 21, 2023

pratt3000 commented Mar 1, 2024 • edited

lhoestq commented Mar 1, 2024

httplups commented Mar 16, 2024

zmoki688 commented Apr 9, 2024 • edited

[BUG] With Pandas 2.0.0, `load_dataset` raises `TypeError: read_csv() got an unexpected keyword argument 'mangle_dupe_cols'` #5744

[BUG] With Pandas 2.0.0, `load_dataset` raises `TypeError: read_csv() got an unexpected keyword argument 'mangle_dupe_cols'` #5744

keyboardAnt commented Apr 13, 2023 •

edited

pratt3000 commented Mar 1, 2024 •

edited

zmoki688 commented Apr 9, 2024 •

edited