Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp #74

Closed
cdzombak opened this issue Mar 14, 2018 · 15 comments · Fixed by #1430
Closed

Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp #74

cdzombak opened this issue Mar 14, 2018 · 15 comments · Fixed by #1430
Labels
size: hard status: backlog Work is planned someday but is not the highest priority at the moment touches: API/CLI/user interface touches: data/schema/architecture touches: docs type: refactor why: functionality Intended to improve ArchiveBox functionality or features why: performance Intended to improve ArchiveBox speed or responsiveness why: security Intended to improve ArchiveBox security or data integrity

Comments

@cdzombak
Copy link
Contributor

cdzombak commented Mar 14, 2018

My Pinboard export contains several bookmarks with identical timestamps (presumably from imports from Delicious years ago).

The first time I run archive.py, I end up with several archive directories named like 1317249309, 1317249309.0, 1317249309.1, …. These directory names correspond properly with entries in index.json as expected.

If I run archive.py a second time with the same input, it appears to rewrite index.json, assigning different numerical suffixes to the 1317249309 timestamp. The entries in index.json no longer correspond with the contents of those archive directories on disk.

You can reproduce this with the following JSON file (pinboard.json):

[{"href":"http:\/\/www.flickr.com\/groups\/photoshopsupport\/discuss\/72157600201629413\/","description":"Flickr: Discussing Index Of Topics: Compliments of LifeLive~ in Photoshop Support Group","extended":"","meta":"c9aa62c0eaa3c35a587903100870df43","hash":"8dd9951810c0eae6af67651341af5110","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"photography photoshop retouching"},
{"href":"http:\/\/allinthehead.com\/retro\/345\/whats-in-your-utility-belt","description":"What's In Your Utility Belt? \u2014 All in the head","extended":"","meta":"746e69822f36f2e78c16fc789a7545b5","hash":"ac4d0527bca6c7d6741fee117f45f631","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"php"},
{"href":"http:\/\/www.tyndellphotographic.com\/plasticwallet.html","description":"Plastic Wallet Boxes for Wallet sized photos","extended":"","meta":"c133eb53f29d97c35c3f31768ff7ce45","hash":"60bbf228c559518b818ed7d0ff997a69","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"photography supply"},
{"href":"http:\/\/www.arduino.cc\/","description":"Arduino - HomePage","extended":"","meta":"a80835b5f374965f5f8a5990da6cf2be","hash":"78532ff2155cd9feeac11aba18739bdc","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"arduino elecdiy"},
{"href":"http:\/\/mbed.org\/","description":"Rapid Prototyping for Microcontrollers | mbed","extended":"","meta":"644e8e0c9ae522eb1ca025c2af604f7d","hash":"fd2d014879e63a9aca6c18eb11e19b02","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"elecdiy"},
{"href":"http:\/\/www.tasankokaiku.com\/jarse\/?p=268","description":"Jarse \u00bb Blog Archive \u00bb Kohtauskone","extended":"","meta":"8483f7b4d0423ddd0930142c55c909e3","hash":"e971d3670f0fe1b2638c343e458f88bd","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"elecdiy arduino dmx512"}]

Run the following commands:

./archive.py ~/path/to/pinboard.json
# contents on disk match up with contents of index.json

./archive.py ~/path/to/pinboard.json
# timestamp suffices in index.json have been changed and no longer match content on disk
@pirate
Copy link
Member

pirate commented Mar 15, 2018

I'm 90% sure this is due to the faulty cleanup/merging code I added recently. Can you try checking out 2aae6e0 (a known good version that I use on my server) and seeing if the problem exists there?

@pirate pirate added type: bug report status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers labels Mar 15, 2018
@cdzombak
Copy link
Contributor Author

cdzombak commented Mar 15, 2018

I checked out 2aae6e0 locally and ran through the same process as described above. I still see the same thing — the same URL gets reassigned a different timestamp suffix each time I run the archiver, and index.json is no longer in sync with the disk.

FWIW, I'm going to solve this for myself by running the archiver on an export with these duplicate timestamps exactly once, then only running incremental updates via RSS, containing only newer entries with unique timestamps, in the future. That should at least let me avoid the issue.

@cdzombak
Copy link
Contributor Author

FWIW, I'm going to solve this for myself by running the archiver on an export with these duplicate timestamps exactly once, then only running incremental updates via RSS, containing only newer entries with unique timestamps, in the future. That should at least let me avoid the issue.

This does not work around the issue. After running archive.py on my Pinboard RSS feed containing only new links, all these very old links with duplicate timestamps seem to have been assigned different numbers, so my index.html/json are out of sync with what's on disk 😢

@pirate
Copy link
Member

pirate commented Apr 17, 2018

Try pulling master or 1776bdf and let me know if it works.

@pirate
Copy link
Member

pirate commented Apr 25, 2018

I'm thinking about abolishing the incremental timestamp de-duping like: 1523763242.1, 1523763242.2, 1523763242.3, etc. because it's not really deterministic and was only causing problems.

The design is similar to buckets in a hash table to handle collisions, so I propose we take further inspiration from our hash-table roots and dedupe timestamps with a hash instead of an incrementing number:

I'm testing this right now, I will push the code soon to a branch:

url_hash = sha256(link['url'].encode('utf-8')).hexdigest()
uniqueish_suffix = str(int(url_hash, base=16))[:10]                # ~ 10^9 is probably enough imo
link['timestamp'] = f'{link["timestamp"]}.{uniqueish_suffix}'

# timstamp   hash_of_url
# 1523763242.090329842341

We might as well add a hash suffix to all links while we're add it. The timestamp.hash format as a primary key is very useful because it instantly makes all links unique and it retains the original timestamp order.

The real issue is migrating old archives to the new format. Right now a migration system doesn't really exist, and my last attempt to build one util.py:cleanup_archive() failed miserably and corrupted some people's archive folders. One of the main reasons I'm switching to Django is the excellent forwards & backwards migrations system.

Whatever new timestamp deduping solution we end up choosing will need to come with a migration script to force BA to reindex the links and move old folders to the new format.

@cdzombak
Copy link
Contributor Author

cdzombak commented May 23, 2018

@pirate I finally had a chance to test this with the latest master (a532d11).

Following the reproduction instructions in the OP issue, I end up with directories on disk whose index pages seem to line up with what index.json expects, but on further inspection the archive folders on disk contain resources from multiple archive entries. Further, the screenshots and etc. are still mixed up. One example (note flickr screenshots for a non-flickr site):

screen shot 2018-05-23 at 17 00 18

@pirate
Copy link
Member

pirate commented May 24, 2018

Thanks for the report @cdzombak, this is fairly critical, I'll take a look as soon as I can. In the meantime if you absolutely need it working I suggest writing a little script to pre-process your links to ensure they have unique timestamps.

@pirate pirate added size: hard and removed status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers labels May 24, 2018
@pirate
Copy link
Member

pirate commented Jun 11, 2018

I found one of the bugs:

https://github.com/pirate/bookmark-archiver/blob/master/util.py#L281

archive_org_txt = os.path.join(ARCHIVE_DIR, 'html/archive' + folder, 'archive.org.txt')

Should be:

archive_org_txt = os.path.join(ARCHIVE_DIR, 'html/archive', folder, 'archive.org.txt')

Very sneaky 1 character bug 🤦‍♂️.

It will be fixed on master shortly.

@cdzombak
Copy link
Contributor Author

Oh, yikes. That's a tricky one to find.

If you let me know when that's fixed on master, I can re-run my test and let you know the result.

@pirate pirate changed the title Archiver inconsistently assigns timestamp suffixes to bookmarks with the same timestamps Use sha256 of url as unique id instead of timestamp Aug 30, 2018
@pirate
Copy link
Member

pirate commented Aug 30, 2018

@aurelg fyi you might be interested in following this issue

@pirate pirate added the why: functionality Intended to improve ArchiveBox functionality or features label Oct 12, 2018
@pirate pirate pinned this issue Dec 14, 2018
@pirate
Copy link
Member

pirate commented Jan 22, 2019

A quick update to those waiting on this issue. This is still taking a lot of thought because there are some hard problems to consider, namely:

  • convenience of user access vs integrity of disk storage
    Timestamps convey valuable information about when the website was archived, which is why other sites like archive.org and archive.is use them in URLs. I think timestamps will remain the primary way for users to access archived resources, but for database integrity and on-disk storage, it's much better to have things bucketed by a unique, immutable key. Because ArchiveBox needs to generate a static output, it can't just serve up two web endpoints that refer to one folder layout, it has to have both folder layouts accessible on disk and indexed statically. This means we have to use symlinks or hardlinks to represent a single folder layout without duplicating files.

  • folder and URL layout
    We have to allow archives to be accessed by either hash OR timestamp to preserve backwards-compatibility.
    If we change the directory structure, we'll have a create a second directory full of symlinks pointing to their equivalent folders.
    Somethings like this could work:

    output/
        index.html
        index.json
        archive/
            <timestamp>     -> output/assets/<hash>
        assets/
            <hash>/
                index.html
                index.json
    ...
    
    
  • hash type
    Some background: https://blog.codinghorror.com/url-shortening-hashes-in-practice/

    I wanted to go with a base62 encoding of the first 32 bits of a sha256 for super dense URL slugs, but unfortunately, macOS has a case-insensitive filesystem, so it's a disaster waiting to happen. We don't want two archives written to the same folder, and I'd rather explicitly pick a smaller hash algorithm that works for everyone, than attempt to release two different hash options to users as a config var.

    It seems dangerous to go with something so obscure for a potentially long-term project, but maybe a base32 of a few more sha256 bytes could work for URL and filesystem safe storage:

    In [1]: base32_crockford.encode(int(hashlib.sha256(url).hexdigest(), 16) % (10 ** 32))   # take the first 32 bits out of 64
    Out[1]: '7P6HMQR2VTC7P6HMQR2VTC'

    https://github.com/ulid/spec or https://github.com/jbittel/base32-crockford

  • migration
    We have to carefully move all the archive data to the new format and link everything, and we only get one try because many people will run it the moment it's released

  • django server this is done now
    The next highest priority issue is migrating to the new cli format + django server, and I think it will make this problem slightly easier because the database can keep track of timestamps and map them to hashes on disk.

Plan:

Rather than implement hashed storage on the current CLI ArchiveBox, I think I want to build the django sever first, because it will allow me to run safe, rewindable migrations on the archive data without destroying people's folders by accident.

1. create django server and script to load existing archive folder into db
2. add sha256 hash field with database migration
3. serve both urls `/<hash>/example.com/index.html` and `/<timestamp>/example.com/index.html`
4. export archive to new folder layout using new sha256 hash folders
5. continue serving both url types with data from new folder layout

This migration will take place for users of the ./archive CLI command as well.
Once the initial django version is released, all subsequent versions will automatically
migrate the data format forward to the latest schema when they start.
This should be a mostly invisible process to users as almost all migrations are non-destructive, and we will prompt to explain it to the user before doing destructive ones.

If any of you have ideas or input on this process, any help is welcome.

@pirate pirate changed the title Use sha256 of url as unique id instead of timestamp Uniquely identify URLs by hash of url instead of archive timestamp Jan 22, 2019
@pirate pirate unpinned this issue Mar 28, 2019
@karlicoss
Copy link
Contributor

Hey @pirate , thanks for for your response! Some thoughts:

  • convenience of user access vs integrity of disk storage

    I think timestamps will remain the primary way for users to access archived resources

    This means we have to use symlinks or hardlinks to represent a single folder layout without duplicating files.

    Symlinks sounds like a good compromise. However, there will still be issue when two symlinks clash due to same timestamps, right? But at least it won't be damaging to the actual backups though.

    Have to say, I don't really understand the concept of using historic timestamps from, say, Pinboard backup or chrome history. You can't retrieve the page at the time of that timestamp (sadly!), the only relevant timestamp is the current time, isn't it?
    Also, if you are using historic timestamps and happened to have same URL incoming from several sources, would they all end up as different archived directories? Sounds a bit wasteful...

  • hash type
    sha256 is just 64 characters as hex, right? For URL shortening, it's a probel, agree. But as part of archive URL, which you would not have to access that often, presumably, don't think it's too bad.

@pirate
Copy link
Member

pirate commented Apr 16, 2019

Oh I'm already halfway through the migration process away from timestamps, I forgot to update this issue :) Edit: it's ended up taking longer than I expected

Most of these problems go away as we start to use django more heavily, as the export folder structure can be changed dramatically now that we have a SQL database as the single-source-of-truth with safe migrations.

In v0.4.0 I've already added hashes, and in a subsequent version they will become the primary unique key.

The archive will be served by django, with static folder exports becoming optional-only. This allows us to provide both timestamp and hash-based URLs via django, and static export format can be selected by specifying a flag like:

archivebox export --folders=timestamp
# or
archivebox export --folders=hash

I might even add an options to do both with symlinks as discussed above, but for now I think letting the user decide is the simplest solution. Once we hear feedback from users on the new >v0.4.0 system we can decide how to proceed with export formatting.

@pirate pirate added touches: data/schema/architecture status: wip Work is in-progress / has already been partially completed and removed why: functionality Intended to improve ArchiveBox functionality or features type: bug report labels Apr 23, 2019
@pirate
Copy link
Member

pirate commented Dec 19, 2022

We should use one of these better implementations instead of crockford-base32 directly:

 01AN4Z07BY      79KA1307SR...

 01AN4Z07BY      79KA1307SR9X4MV3

|----------|    |----------------|
 Timestamp          Randomness
   48bits             80bits
   10char             16char

@pirate pirate changed the title Uniquely identify URLs by hash of url instead of archive timestamp Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp Jan 31, 2023
@pirate pirate added why: functionality Intended to improve ArchiveBox functionality or features why: security Intended to improve ArchiveBox security or data integrity why: performance Intended to improve ArchiveBox speed or responsiveness touches: docs status: backlog Work is planned someday but is not the highest priority at the moment touches: API/CLI/user interface type: refactor and removed status: wip Work is in-progress / has already been partially completed labels Jun 13, 2023
@pirate
Copy link
Member

pirate commented May 12, 2024

WIP: https://github.com/ArchiveBox/ArchiveBox/pull/1430/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: hard status: backlog Work is planned someday but is not the highest priority at the moment touches: API/CLI/user interface touches: data/schema/architecture touches: docs type: refactor why: functionality Intended to improve ArchiveBox functionality or features why: performance Intended to improve ArchiveBox speed or responsiveness why: security Intended to improve ArchiveBox security or data integrity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@cdzombak @karlicoss @pirate and others