Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check fail_on on more paths in jpeg load #3293

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jcupitt
Copy link
Member

@jcupitt jcupitt commented Jan 19, 2023

Some truncated JPEG files can cause out of order reads since jpeg_read_scanlines() will longjmp() out before it gets a chance to update y_pos. So we need to check fail_on on the longjmp() path and the out of order path.

See: #3289

Some truncated JPEG files can cause out of order reads since
jpeg_read_scanlines() will longjmp out before it gets a chance to
update y_pos. So we need to check fail_on on the longjmp path and the
out of order path.

See: #3289
@jcupitt
Copy link
Member Author

jcupitt commented Jan 19, 2023

Using the truncated jpg in the linked discussion, before this patch I see:

$ vips copy trunc4.jpg x.png

(vips:283787): VIPS-WARNING **: 11:58:21.438: error in tile 0 x 856
VipsJpeg: Corrupt JPEG data: premature end of data segment
VipsJpeg: Corrupt JPEG data: found marker 0xd9 instead of RST1
VipsJpeg: out of order read at line 848
vips2png: unable to write to target x.png

(vips:283787): VIPS-WARNING **: 11:58:21.446: read gave 100 warnings

(vips:283787): VIPS-WARNING **: 11:58:21.446: VipsJpeg: Corrupt JPEG data: premature end of data segment
VipsJpeg: Corrupt JPEG data: found marker 0xd9 instead of RST1
VipsJpeg: out of order read at line 848
vips2png: unable to write to target x.png

memory: high-water mark 4.96 MB
error buffer: VipsJpeg: Corrupt JPEG data: premature end of data segment
VipsJpeg: Corrupt JPEG data: found marker 0xd9 instead of RST1
VipsJpeg: out of order read at line 848
vips2png: unable to write to target x.png
$ echo $?
1

And after:

$ vips copy trunc4.jpg x.png

(vips:288926): VIPS-WARNING **: 12:57:48.408: read gave 100 warnings

(vips:288926): VIPS-WARNING **: 12:57:48.408: VipsJpeg: Corrupt JPEG data: premature end of data segment
VipsJpeg: Corrupt JPEG data: found marker 0xd9 instead of RST1
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848

memory: high-water mark 5.26 MB
error buffer: VipsJpeg: Corrupt JPEG data: premature end of data segment
VipsJpeg: Corrupt JPEG data: found marker 0xd9 instead of RST1
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
VipsJpeg: out of order read at line 848
$ echo $?
0

@kleisauke
Copy link
Member

Wow, I never thought of reducing the severity of this "error".

I think this might help for #2757. Using the same reproducer as mentioned in #2757 (reply in thread) and testing this PR in combination with commit kleisauke@0eeabf8, I see this on both master branch and that changeset:

$ vips black x.jpg 539 960
$ VIPS_STALL=1 vips reducev x.jpg x.v 2
vips_worker_work_unit: stall done, releasing y = 0 ... [x3]
$ sha256sum x.v
7be26c512c54b5e966f1239ff215e84418659ad267f35d9a9c5dbc73b5359073  x.v
$ vips black x2.jpg 10000 10000
$ VIPS_STALL=1 vips reducev x2.jpg x2.v 2
vips_worker_work_unit: stall done, releasing y = 0 ... [x39]
$ sha256sum x2.v
220ee900f622fb50d24a0b629ad0108822b3e442ee587b85a860c1b50a70d325  x2.v

Thus, it produces identical images without causing an out of order read error. I double-checked just to be sure.

/me falls off chair

Perhaps the PNG and TIFF loaders needs something similar?:

vips_error( class->nickname,
_( "out of order read at line %d" ), png->y_pos );
return( -1 );

vips_error( "vipspng",
_( "out of order read at line %d" ), read->y_pos );
return( -1 );

vips_error( "tiff2vips",
_( "out of order read -- at line %d, "
"but line %d requested" ), rtiff->y_pos, r->top );
return( -1 );

FWIW, I confirm that a similar change to the spng loader also solves the issue mentioned in #3270 (comment).

@jcupitt
Copy link
Member Author

jcupitt commented Jan 21, 2023

Huh this is a strange one, you're right. I made a branch with this fail-on change, plus your remove seq from reduce patch:

https://github.com/libvips/libvips/tree/remove-seq-from-reduce

And I can't provoke an out of order read either:

vips black x.jpg 10000 10000
VIPS_STALL=1 vips reducev x.jpg x.v 2
VIPS_STALL=1 vips reducev x.jpg x.v 2 --vips-concurrency 6
VIPS_STALL=1 vips reducev x.jpg x.v 2 --vips-concurrency 2
VIPS_STALL=1 vips reducev x.jpg x.v 2 --vips-concurrency 4
VIPS_STALL=1 vips reducev x.jpg x.v 2 --vips-concurrency 8

All seem to work.

I wonder if this is due to this fail-on change though? Could it be caused by the new threadpool sizing system? That removes unnecessary threads from pools, so you're less likely to get threads running ahead of the group and triggering misordered reads.

@jcupitt
Copy link
Member Author

jcupitt commented Jan 21, 2023

Oh haha this fails instantly:

$ vips black x.png 10000 10000
$ VIPS_STALL=1 vips reducev x.png x.v 2
vips_worker_work_unit: stall done, releasing y = 0 ...

(vips:457884): VIPS-WARNING **: 13:59:09.310: error in tile 0 x 1264

(vips:457884): VIPS-WARNING **: 13:59:09.311: error in tile 0 x 1280

(vips:457884): VIPS-WARNING **: 13:59:09.311: error in tile 0 x 1296

(vips:457884): VIPS-WARNING **: 13:59:09.311: error in tile 0 x 1312
vips_worker_work_unit: stall done, releasing y = 0 ...
vipspng: out of order read at line 3856

@jcupitt
Copy link
Member Author

jcupitt commented Jan 21, 2023

I tried with a real image (not just black):

$ vips crop st-francis.jpg x.jpg 0 0 10000 10000

And verified that this produced the correct result:

VIPS_STALL=1 vips reducev x.jpg x.v 2
VIPS_STALL=1 vips reducev x.jpg x.v 2 --vips-concurrency 6
VIPS_STALL=1 vips reducev x.jpg x.v 2 --vips-concurrency 2
VIPS_STALL=1 vips reducev x.jpg x.v 2 --vips-concurrency 4
VIPS_STALL=1 vips reducev x.jpg x.v 2 --vips-concurrency 8

And it all seems to work (on the remove-seq-from-reduce branch).

If I try git master and remove the vips_sequential() from reducev, it fails instantly.

@jcupitt
Copy link
Member Author

jcupitt commented Jan 21, 2023

... so I think I don't have a good understanding of what's happening. I might go for a cycle ride and see if anything pops into my head.

@kleisauke
Copy link
Member

Indeed, it's a bit strange. Good idea to check non-black images as well. I'm also not sure why I never see the vips_error() message, I would assume that's still being printed.

Have a nice cycle ride!

@kleisauke
Copy link
Member

FWIW, I just pushed commit 937ecf7 to that branch, to ensure it doesn't use t[3] (as that is supposed to be the NULL terminator of the array).

@kleisauke
Copy link
Member

I'm also not sure why I never see the vips_error() message, I would assume that's still being printed.

Ah, this error is only displayed when a non-zero exit code is returned. It can be noticed when you pass --vips-leak (causing vips_error_buffer() to be printed on exit).

@kleisauke
Copy link
Member

I found an out of order read error in the wild with certain operations and parameters that would produce a non-deterministic image on the remove-seq-from-reduce branch.

Details
import pyvips

pyvips.leak_set(True)
pyvips.cache_set_max_files(0)

# https://wsrv.nl/zebra.jpg
# vipsthumbnail zebra.jpg -o x.jpg -s "2135x1800!"
filename = 'x.jpg'

im = pyvips.Image.thumbnail(filename, 720)
im = im.flipver()
im.write_to_file('x2.jpg')
im = None

# Ensure leak checker is run, see:
# https://github.com/libvips/libvips/commit/01a9947006753d6bd4674483c619c7d2ae79bfbc
pyvips.ffi.cdef('void vips_shutdown();')
pyvips.vips_lib.vips_shutdown()
master remove-seq-from-reduce
master remove-seq-from-reduce
(non-deterministic, so image may vary)

Though, the produced image on the master branch is also incorrect, so perhaps this is another issue.

@kleisauke
Copy link
Member

kleisauke commented Jan 22, 2023

Though, the produced image on the master branch is also incorrect, so perhaps this is another issue.

Ah, I wasn't focused enough, flipver() needs random access, so you'll need to do .copy_memory() before calling that (similar to the autorotate parameter in jpegload). Sorry for the noise.

@kleisauke
Copy link
Member

kleisauke commented Jan 25, 2023

FWIW, increasing the vshrink by two times seems to produce an incorrect and non-deterministic image on the remove-seq-from-reduce branch.

check-nondeterminism.sh
#!/usr/bin/env bash
set -ex

# Usage: ./check-nondeterminism.sh 100
n=$1

# Prepare test image
curl -LO https://github.com/libvips/vips-bench/raw/master/sample2.v
vips replicate sample2.v t1.v 40 30
vips extract_area t1.v x.jpg 0 0 10000 10000
rm t1.v sample2.v

# Do first run, check sha256 of produced image
VIPS_STALL=1 vips reducev x.jpg x.v 4 --vips-leak
expected_sha256=$(sha256sum x.v | awk '{ print $1 }')

# Check for any nondeterminism
for (( i = 0; i < n; i++ )); do
  echo "run $i"
  VIPS_STALL=1 vips reducev x.jpg x.v 4
  echo "$expected_sha256 x.v" | sha256sum --check
done

@kleisauke
Copy link
Member

kleisauke commented Feb 8, 2023

After a couple hours of debugging (:sweat_smile:), I might be close to a possible fix.

Commit 2603bba adapts the r->height when the sequential->y_pos is out of order. With that commit, I also enabled these debug prints:

VIPS_DEBUG_MSG( "vips_line_cache_build: max_tiles = %d\n",
block_cache->max_tiles );

VIPS_DEBUG_MSG( "vips_line_cache_gen: bumped max_tiles to %d\n",
block_cache->max_tiles );

Using the above check-nondeterminism.sh script, I see:

vips_line_cache_build: max_tiles = 64
vips_line_cache_gen: bumped max_tiles to 132

So, it bumped the max_tiles from 64 to 132. It still causes an out of order read error (I suspect this is because it dropped some cached tiles earlier). However, when hardcoding that value (132) here:

block_cache->max_tiles = VIPS_MAX( 2,
4 * n_lines / block_cache->tile_height );

It does not seem to cause an out of order read error.

This same "trick" can also be applied to this test:

$ vips black x.jpg 10000 10000
$ VIPS_STALL=1 vips reducev x.jpg x.v 2 --vips-leak
vips_line_cache_build: max_tiles = 64
vips_worker_work_unit: stall done, releasing y = 0 ...
vips_line_cache_gen: bumped max_tiles to 66
...
vips_threadset_free: peak of 8 threads
memory: high-water mark 49.61 MB
error buffer: VipsJpeg: out of order read at line 776

And hardcoding the max_tiles to 66 seems to fix it, for example:

$ vips black x.jpg 10000 10000
$ VIPS_STALL=1 vips reducev x.jpg x.v 2 --vips-leak
vips_line_cache_build: max_tiles = 66
vips_worker_work_unit: stall done, releasing y = 0 ... [x39]
vips_threadset_free: peak of 8 threads
memory: high-water mark 49.61 MB

@kleisauke
Copy link
Member

I think the issue is that some old tiles are being reused too early. By the time it bumps the max_tiles it's too late, because the old tiles are no longer there. To fix this, I tried starting with max_tiles == -1 and setting it when it hits a threshold, but that didn't work for all scenarios. Perhaps a better solution would be that reducev / shrinkv can somehow hint the number of tiles to cache?

While investigating this, I found a few minor improvements, which I just pushed to the remove-seq-from-reduce branch:
master...remove-seq-from-reduce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants