Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp #74

cdzombak · 2018-03-14T19:52:14Z

My Pinboard export contains several bookmarks with identical timestamps (presumably from imports from Delicious years ago).

The first time I run archive.py, I end up with several archive directories named like 1317249309, 1317249309.0, 1317249309.1, …. These directory names correspond properly with entries in index.json as expected.

If I run archive.py a second time with the same input, it appears to rewrite index.json, assigning different numerical suffixes to the 1317249309 timestamp. The entries in index.json no longer correspond with the contents of those archive directories on disk.

You can reproduce this with the following JSON file (pinboard.json):

[{"href":"http:\/\/www.flickr.com\/groups\/photoshopsupport\/discuss\/72157600201629413\/","description":"Flickr: Discussing Index Of Topics: Compliments of LifeLive~ in Photoshop Support Group","extended":"","meta":"c9aa62c0eaa3c35a587903100870df43","hash":"8dd9951810c0eae6af67651341af5110","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"photography photoshop retouching"},
{"href":"http:\/\/allinthehead.com\/retro\/345\/whats-in-your-utility-belt","description":"What's In Your Utility Belt? \u2014 All in the head","extended":"","meta":"746e69822f36f2e78c16fc789a7545b5","hash":"ac4d0527bca6c7d6741fee117f45f631","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"php"},
{"href":"http:\/\/www.tyndellphotographic.com\/plasticwallet.html","description":"Plastic Wallet Boxes for Wallet sized photos","extended":"","meta":"c133eb53f29d97c35c3f31768ff7ce45","hash":"60bbf228c559518b818ed7d0ff997a69","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"photography supply"},
{"href":"http:\/\/www.arduino.cc\/","description":"Arduino - HomePage","extended":"","meta":"a80835b5f374965f5f8a5990da6cf2be","hash":"78532ff2155cd9feeac11aba18739bdc","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"arduino elecdiy"},
{"href":"http:\/\/mbed.org\/","description":"Rapid Prototyping for Microcontrollers | mbed","extended":"","meta":"644e8e0c9ae522eb1ca025c2af604f7d","hash":"fd2d014879e63a9aca6c18eb11e19b02","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"elecdiy"},
{"href":"http:\/\/www.tasankokaiku.com\/jarse\/?p=268","description":"Jarse \u00bb Blog Archive \u00bb Kohtauskone","extended":"","meta":"8483f7b4d0423ddd0930142c55c909e3","hash":"e971d3670f0fe1b2638c343e458f88bd","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"elecdiy arduino dmx512"}]

Run the following commands:

./archive.py ~/path/to/pinboard.json
# contents on disk match up with contents of index.json

./archive.py ~/path/to/pinboard.json
# timestamp suffices in index.json have been changed and no longer match content on disk

The text was updated successfully, but these errors were encountered:

pirate · 2018-03-15T00:07:01Z

I'm 90% sure this is due to the faulty cleanup/merging code I added recently. Can you try checking out 2aae6e0 (a known good version that I use on my server) and seeing if the problem exists there?

cdzombak · 2018-03-15T01:50:24Z

I checked out 2aae6e0 locally and ran through the same process as described above. I still see the same thing — the same URL gets reassigned a different timestamp suffix each time I run the archiver, and index.json is no longer in sync with the disk.

FWIW, I'm going to solve this for myself by running the archiver on an export with these duplicate timestamps exactly once, then only running incremental updates via RSS, containing only newer entries with unique timestamps, in the future. That should at least let me avoid the issue.

cdzombak · 2018-03-16T14:28:13Z

FWIW, I'm going to solve this for myself by running the archiver on an export with these duplicate timestamps exactly once, then only running incremental updates via RSS, containing only newer entries with unique timestamps, in the future. That should at least let me avoid the issue.

This does not work around the issue. After running archive.py on my Pinboard RSS feed containing only new links, all these very old links with duplicate timestamps seem to have been assigned different numbers, so my index.html/json are out of sync with what's on disk 😢

pirate · 2018-04-17T13:25:42Z

Try pulling master or 1776bdf and let me know if it works.

pirate · 2018-04-25T11:15:48Z

I'm thinking about abolishing the incremental timestamp de-duping like: 1523763242.1, 1523763242.2, 1523763242.3, etc. because it's not really deterministic and was only causing problems.

The design is similar to buckets in a hash table to handle collisions, so I propose we take further inspiration from our hash-table roots and dedupe timestamps with a hash instead of an incrementing number:

I'm testing this right now, I will push the code soon to a branch:

url_hash = sha256(link['url'].encode('utf-8')).hexdigest()
uniqueish_suffix = str(int(url_hash, base=16))[:10]                # ~ 10^9 is probably enough imo
link['timestamp'] = f'{link["timestamp"]}.{uniqueish_suffix}'

# timstamp   hash_of_url
# 1523763242.090329842341

We might as well add a hash suffix to all links while we're add it. The timestamp.hash format as a primary key is very useful because it instantly makes all links unique and it retains the original timestamp order.

The real issue is migrating old archives to the new format. Right now a migration system doesn't really exist, and my last attempt to build one util.py:cleanup_archive() failed miserably and corrupted some people's archive folders. One of the main reasons I'm switching to Django is the excellent forwards & backwards migrations system.

Whatever new timestamp deduping solution we end up choosing will need to come with a migration script to force BA to reindex the links and move old folders to the new format.

cdzombak · 2018-05-23T21:02:55Z

@pirate I finally had a chance to test this with the latest master (a532d11).

Following the reproduction instructions in the OP issue, I end up with directories on disk whose index pages seem to line up with what index.json expects, but on further inspection the archive folders on disk contain resources from multiple archive entries. Further, the screenshots and etc. are still mixed up. One example (note flickr screenshots for a non-flickr site):

pirate · 2018-05-24T04:35:32Z

Thanks for the report @cdzombak, this is fairly critical, I'll take a look as soon as I can. In the meantime if you absolutely need it working I suggest writing a little script to pre-process your links to ensure they have unique timestamps.

pirate · 2018-06-11T01:29:28Z

I found one of the bugs:

https://github.com/pirate/bookmark-archiver/blob/master/util.py#L281

archive_org_txt = os.path.join(ARCHIVE_DIR, 'html/archive' + folder, 'archive.org.txt')

Should be:

archive_org_txt = os.path.join(ARCHIVE_DIR, 'html/archive', folder, 'archive.org.txt')

Very sneaky 1 character bug 🤦‍♂️.

It will be fixed on master shortly.

cdzombak · 2018-06-11T13:48:35Z

Oh, yikes. That's a tricky one to find.

If you let me know when that's fixed on master, I can re-run my test and let you know the result.

pirate · 2018-08-30T21:50:27Z

@aurelg fyi you might be interested in following this issue

pirate · 2019-01-22T09:50:12Z

A quick update to those waiting on this issue. This is still taking a lot of thought because there are some hard problems to consider, namely:

convenience of user access vs integrity of disk storage
Timestamps convey valuable information about when the website was archived, which is why other sites like archive.org and archive.is use them in URLs. I think timestamps will remain the primary way for users to access archived resources, but for database integrity and on-disk storage, it's much better to have things bucketed by a unique, immutable key. Because ArchiveBox needs to generate a static output, it can't just serve up two web endpoints that refer to one folder layout, it has to have both folder layouts accessible on disk and indexed statically. This means we have to use symlinks or hardlinks to represent a single folder layout without duplicating files.
folder and URL layout
We have to allow archives to be accessed by either hash OR timestamp to preserve backwards-compatibility.
If we change the directory structure, we'll have a create a second directory full of symlinks pointing to their equivalent folders.
Somethings like this could work:
```
output/
    index.html
    index.json
    archive/
        <timestamp>     -> output/assets/<hash>
    assets/
        <hash>/
            index.html
            index.json
...
```
hash type
Some background: https://blog.codinghorror.com/url-shortening-hashes-in-practice/

I wanted to go with a base62 encoding of the first 32 bits of a sha256 for super dense URL slugs, but unfortunately, macOS has a case-insensitive filesystem, so it's a disaster waiting to happen. We don't want two archives written to the same folder, and I'd rather explicitly pick a smaller hash algorithm that works for everyone, than attempt to release two different hash options to users as a config var.

It seems dangerous to go with something so obscure for a potentially long-term project, but maybe a base32 of a few more sha256 bytes could work for URL and filesystem safe storage:
```
In [1]: base32_crockford.encode(int(hashlib.sha256(url).hexdigest(), 16) % (10 ** 32))   # take the first 32 bits out of 64
Out[1]: '7P6HMQR2VTC7P6HMQR2VTC'
```
https://github.com/ulid/spec or https://github.com/jbittel/base32-crockford
migration
We have to carefully move all the archive data to the new format and link everything, and we only get one try because many people will run it the moment it's released
~~django server~~ this is done now
The next highest priority issue is migrating to the new cli format + django server, and I think it will make this problem slightly easier because the database can keep track of timestamps and map them to hashes on disk.

Plan:

Rather than implement hashed storage on the current CLI ArchiveBox, I think I want to build the django sever first, because it will allow me to run safe, rewindable migrations on the archive data without destroying people's folders by accident.

1. create django server and script to load existing archive folder into db
2. add sha256 hash field with database migration
3. serve both urls `/<hash>/example.com/index.html` and `/<timestamp>/example.com/index.html`
4. export archive to new folder layout using new sha256 hash folders
5. continue serving both url types with data from new folder layout

This migration will take place for users of the ./archive CLI command as well.
Once the initial django version is released, all subsequent versions will automatically
migrate the data format forward to the latest schema when they start.
This should be a mostly invisible process to users as almost all migrations are non-destructive, and we will prompt to explain it to the user before doing destructive ones.

If any of you have ideas or input on this process, any help is welcome.

karlicoss · 2019-04-16T20:35:35Z

Hey @pirate , thanks for for your response! Some thoughts:

convenience of user access vs integrity of disk storage

I think timestamps will remain the primary way for users to access archived resources

This means we have to use symlinks or hardlinks to represent a single folder layout without duplicating files.

Symlinks sounds like a good compromise. However, there will still be issue when two symlinks clash due to same timestamps, right? But at least it won't be damaging to the actual backups though.

Have to say, I don't really understand the concept of using historic timestamps from, say, Pinboard backup or chrome history. You can't retrieve the page at the time of that timestamp (sadly!), the only relevant timestamp is the current time, isn't it?
Also, if you are using historic timestamps and happened to have same URL incoming from several sources, would they all end up as different archived directories? Sounds a bit wasteful...
hash type
sha256 is just 64 characters as hex, right? For URL shortening, it's a probel, agree. But as part of archive URL, which you would not have to access that often, presumably, don't think it's too bad.

pirate · 2019-04-16T20:40:58Z

~~Oh I'm already halfway through the migration process away from timestamps, I forgot to update this issue :)~~ Edit: it's ended up taking longer than I expected

Most of these problems go away as we start to use django more heavily, as the export folder structure can be changed dramatically now that we have a SQL database as the single-source-of-truth with safe migrations.

In v0.4.0 I've already added hashes, and in a subsequent version they will become the primary unique key.

The archive will be served by django, with static folder exports becoming optional-only. This allows us to provide both timestamp and hash-based URLs via django, and static export format can be selected by specifying a flag like:

archivebox export --folders=timestamp
# or
archivebox export --folders=hash

I might even add an options to do both with symlinks as discussed above, but for now I think letting the user decide is the simplest solution. Once we hear feedback from users on the new >v0.4.0 system we can decide how to proceed with export formatting.

pirate · 2022-12-19T20:05:01Z

We should use one of these better implementations instead of crockford-base32 directly:

 01AN4Z07BY      79KA1307SR...

 01AN4Z07BY      79KA1307SR9X4MV3

|----------|    |----------------|
 Timestamp          Randomness
   48bits             80bits
   10char             16char

pirate · 2024-05-12T12:22:33Z

WIP: https://github.com/ArchiveBox/ArchiveBox/pull/1430/files

pirate added type: bug report status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers labels Mar 15, 2018

cdzombak mentioned this issue Mar 15, 2018

Archiver tries to merge/detects conflict between two bookmarks which differ only in query string #72

Closed

pirate added size: hard and removed status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers labels May 24, 2018

pirate mentioned this issue Jul 6, 2018

Add a parser for a list of links (see issue #39) #59

Closed

pirate changed the title ~~Archiver inconsistently assigns timestamp suffixes to bookmarks with the same timestamps~~ Use sha256 of url as unique id instead of timestamp Aug 30, 2018

fgtham mentioned this issue Sep 14, 2018

fix unstable sorting between consecutive runs #96

Merged

pirate added the why: functionality Intended to improve ArchiveBox functionality or features label Oct 12, 2018

pirate mentioned this issue Dec 4, 2018

History entry timestamps aren't accurate #119

Closed

pirate pinned this issue Dec 14, 2018

pirate changed the title ~~Use sha256 of url as unique id instead of timestamp~~ Uniquely identify URLs by hash of url instead of archive timestamp Jan 22, 2019

pirate unpinned this issue Mar 28, 2019

pirate added touches: data/schema/architecture status: wip Work is in-progress / has already been partially completed and removed why: functionality Intended to improve ArchiveBox functionality or features type: bug report labels Apr 23, 2019

pirate mentioned this issue Nov 20, 2019

Question: Changing date to human readable format #295

Closed

karlicoss mentioned this issue Mar 27, 2021

Feature Request: subdivide archive/ directory into subdirectories #679

Open

4 tasks

pirate mentioned this issue Apr 13, 2021

Feature Request: Deduplicate files on archives #704

Closed

9 tasks

pirate changed the title ~~Uniquely identify URLs by hash of url instead of archive timestamp~~ Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp Jan 31, 2023

pirate mentioned this issue Jan 20, 2024

Bug: Adding two different links with the same timestamp (e.g. from JSON) errors out and stops the entire import #1188

Open

jimwins mentioned this issue Mar 2, 2024

Bug: Pinboard JSON parser doesn't keep original bookmarked timestamp when adding URLs #785

Open

9 tasks

pirate linked a pull request May 12, 2024 that will close this issue

Refactor Snapshot and ArchiveResult to use ulid and typeid instead of uuidv4 #1430

Merged

6 tasks

pirate mentioned this issue May 12, 2024

Refactor Snapshot and ArchiveResult to use ulid and typeid instead of uuidv4 #1430

Merged

6 tasks

pirate closed this as completed in 3114980 Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp #74

Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp #74

cdzombak commented Mar 14, 2018 •

edited

pirate commented Mar 15, 2018 •

edited

cdzombak commented Mar 15, 2018 •

edited

cdzombak commented Mar 16, 2018

pirate commented Apr 17, 2018

pirate commented Apr 25, 2018 •

edited

cdzombak commented May 23, 2018 •

edited

pirate commented May 24, 2018

pirate commented Jun 11, 2018 •

edited

cdzombak commented Jun 11, 2018

pirate commented Aug 30, 2018

pirate commented Jan 22, 2019 •

edited

karlicoss commented Apr 16, 2019

pirate commented Apr 16, 2019 •

edited

pirate commented Dec 19, 2022 •

edited

pirate commented May 12, 2024 •

edited

Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp #74

Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp #74

Comments

cdzombak commented Mar 14, 2018 • edited

pirate commented Mar 15, 2018 • edited

cdzombak commented Mar 15, 2018 • edited

cdzombak commented Mar 16, 2018

pirate commented Apr 17, 2018

pirate commented Apr 25, 2018 • edited

cdzombak commented May 23, 2018 • edited

pirate commented May 24, 2018

pirate commented Jun 11, 2018 • edited

cdzombak commented Jun 11, 2018

pirate commented Aug 30, 2018

pirate commented Jan 22, 2019 • edited

karlicoss commented Apr 16, 2019

pirate commented Apr 16, 2019 • edited

pirate commented Dec 19, 2022 • edited

pirate commented May 12, 2024 • edited

cdzombak commented Mar 14, 2018 •

edited

pirate commented Mar 15, 2018 •

edited

cdzombak commented Mar 15, 2018 •

edited

pirate commented Apr 25, 2018 •

edited

cdzombak commented May 23, 2018 •

edited

pirate commented Jun 11, 2018 •

edited

pirate commented Jan 22, 2019 •

edited

pirate commented Apr 16, 2019 •

edited

pirate commented Dec 19, 2022 •

edited

pirate commented May 12, 2024 •

edited