[tsv] add CLI option to use NUL as delimiter #2272

midichef · 2024-01-26T00:29:50Z

It's useful to parse output from GNU grep's -Z option. That produces lines that in Python are f'{filename}\0{line}\n', instead of the usual f'{filename}:{line}\n'.

Right now the command line can't be used to specify a NUL delimiter, as in vd --delimiter="\0", because sys.argv strings are NUL-terminated and can't ever contain NUL.

My workarounds for now are to use .visidatarc, either add a temporary line:
vd.option('delimiter', '\x00', 'field delimiter to use for tsv/usv filetype', replay=True).
or add a new filetype to allow vd -f nsv:

@VisiData.api
def open_nsv(vd, p):
    tsv = TsvSheet(p.base_stem, source=p)
    tsv.delimiter = '\x00'
    tsv.reload()
    return tsv

Can open_nsv() be written without reload() right now? I couldn't think of another way to set delimiter for TsvSheet.

The text was updated successfully, but these errors were encountered:

saulpw · 2024-01-26T00:35:55Z

The way to do this is:

@VisiData.api
def open_nsv(vd, p):
    return NsvSheet(p.base_stem, source=p)

class NsvSheet(TsvSheet):
    pass

NsvSheet.options.delimiter = '\x00'

Also is it literally not possible to pass a NUL character into Python from the CLI? Not even with $'\0'? That seems pretty egregious. Maybe we could make an empty separator mean NUL.

midichef · 2024-01-26T04:02:18Z

Ah right, thanks.

Yes, it is impossible for the shell to execute a program with arguments that contain a NUL character, as the argv in theexec*() system calls (in C) uses NUL to terminate its strings.

saulpw · 2024-01-27T02:19:50Z

Ah, of course. Well, I'd take a PR to make options.delimiter='' mean NUL-delimited, if you're up for it. We'll want to update the (new) docs at visidata/features/xsv_guide.py too.

midichef · 2024-01-29T01:32:54Z

Okay, great!

anjakefala · 2024-02-04T22:48:09Z

From @midichef

The issue is more complicated than I realized though. How should we handle comments when the delimiter is NUL?

i.e. TsvSheet.options.regex_skip = '^#.*' will currently skip over lines that look like comments. But it should definitely not do that when handling the output of find -print0.

My intuition is, regex_skip should not be used when the row delimiter is NUL, as we're not in classic TSV format any more.

midichef · 2024-02-09T07:53:53Z

There are two more issues where NUL as delimiter has a mismatch with the traditional TSV behavior.

The tsv loader assumes data is text, not binary.
It runs open_text_source(). This causes some unusual behavior.
If we read a NUL-separated file from disk, we get one row.
echo -n 'col\0' > one-row.nsv; vd -f tsv --row-delimiter= one-row.nsv
But if we pipe the same data:
echo -n 'col\0' |vd -f tsv --row-delimiter=
then the sheet has an extra row, containing just a newline. It happens because piped data passes through a RepeatFile. RepeatFile is for holding text data. If the data doesn't have a final newline, RepeatFile appends one. For text, that won't change its interpretation. But for NUL-delimited data, it makes the sheet gain a newline row.

I'm not sure what the right answer is here. The code that reads piped data makes quite strong assumptions that the piped data is text. (This is why binary file-guessing code like guess_zip() does not work on piped data.)

The tsv loader skips blank rows.
echo -n 'header\n\n\n\n\nval' |vd -f tsv makes two rows. So does:
echo -n 'header\0\0\0\0\0val' |vd -f tsv --row-delimiter=
I am not sure if this will surprise users or not.
That's done by if not line here:

visidata/visidata/loaders/tsv.py

Line 89 in 7c9799c

if not line or fp._regex_skip.match(line):

I'll think about these situations some more. For now, for my specific use cases, the code works well enough.

saulpw · 2024-02-09T09:06:38Z

The tsv loader skips blank rows.

Some TSV formats are delineated by \n\n, so without this, they load a blank row after every row by default. Also this only applies
to entirely blank rows, so multi-column TSVs won't be affected (if you have a single-column TSV, in my mind that's just a text file and should be loaded with txt). Do you have a case where you want the blank lines to load?

midichef · 2024-02-20T04:52:13Z

Okay, makes sense. No, I don't have a case where I want the blank lines to load, the current behavior works for me.

midichef added the wishlist label Jan 26, 2024

anjakefala added the documentation label Jan 26, 2024

saulpw added a commit that referenced this issue Jan 26, 2024

[tsv] add XsvGuide for tsv/csv/lsv/usv #2272

7930061

anjakefala removed the documentation label Jan 27, 2024

anjakefala added the documentation label Jan 27, 2024

midichef mentioned this issue Jan 29, 2024

[tsv-] let options set NUL delimiters #2275

Merged

anjakefala closed this as completed in #2275 Feb 4, 2024

anjakefala reopened this Feb 4, 2024

midichef mentioned this issue Feb 8, 2024

[tsv] turn off regex skipping sometimes #2302

Merged

saulpw added wish granted and removed documentation labels May 24, 2024

saulpw closed this as completed May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tsv] add CLI option to use NUL as delimiter #2272

[tsv] add CLI option to use NUL as delimiter #2272

midichef commented Jan 26, 2024

saulpw commented Jan 26, 2024

midichef commented Jan 26, 2024

saulpw commented Jan 27, 2024

midichef commented Jan 29, 2024

anjakefala commented Feb 4, 2024

midichef commented Feb 9, 2024 •

edited

saulpw commented Feb 9, 2024

midichef commented Feb 20, 2024

[tsv] add CLI option to use NUL as delimiter #2272

[tsv] add CLI option to use NUL as delimiter #2272

Comments

midichef commented Jan 26, 2024

saulpw commented Jan 26, 2024

midichef commented Jan 26, 2024

saulpw commented Jan 27, 2024

midichef commented Jan 29, 2024

anjakefala commented Feb 4, 2024

midichef commented Feb 9, 2024 • edited

saulpw commented Feb 9, 2024

midichef commented Feb 20, 2024

midichef commented Feb 9, 2024 •

edited