-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp #74
Comments
I'm 90% sure this is due to the faulty cleanup/merging code I added recently. Can you try checking out 2aae6e0 (a known good version that I use on my server) and seeing if the problem exists there? |
I checked out 2aae6e0 locally and ran through the same process as described above. I still see the same thing — the same URL gets reassigned a different timestamp suffix each time I run the archiver, and
|
This does not work around the issue. After running |
Try pulling master or 1776bdf and let me know if it works. |
I'm thinking about abolishing the incremental timestamp de-duping like: The design is similar to buckets in a hash table to handle collisions, so I propose we take further inspiration from our hash-table roots and dedupe timestamps with a hash instead of an incrementing number: I'm testing this right now, I will push the code soon to a branch: url_hash = sha256(link['url'].encode('utf-8')).hexdigest()
uniqueish_suffix = str(int(url_hash, base=16))[:10] # ~ 10^9 is probably enough imo
link['timestamp'] = f'{link["timestamp"]}.{uniqueish_suffix}'
# timstamp hash_of_url
# 1523763242.090329842341 We might as well add a hash suffix to all links while we're add it. The The real issue is migrating old archives to the new format. Right now a migration system doesn't really exist, and my last attempt to build one Whatever new timestamp deduping solution we end up choosing will need to come with a migration script to force BA to reindex the links and move old folders to the new format. |
@pirate I finally had a chance to test this with the latest Following the reproduction instructions in the OP issue, I end up with directories on disk whose index pages seem to line up with what |
Thanks for the report @cdzombak, this is fairly critical, I'll take a look as soon as I can. In the meantime if you absolutely need it working I suggest writing a little script to pre-process your links to ensure they have unique timestamps. |
I found one of the bugs: https://github.com/pirate/bookmark-archiver/blob/master/util.py#L281 archive_org_txt = os.path.join(ARCHIVE_DIR, 'html/archive' + folder, 'archive.org.txt') Should be: archive_org_txt = os.path.join(ARCHIVE_DIR, 'html/archive', folder, 'archive.org.txt') Very sneaky 1 character bug 🤦♂️. It will be fixed on master shortly. |
Oh, yikes. That's a tricky one to find. If you let me know when that's fixed on |
@aurelg fyi you might be interested in following this issue |
A quick update to those waiting on this issue. This is still taking a lot of thought because there are some hard problems to consider, namely:
Plan: Rather than implement hashed storage on the current CLI ArchiveBox, I think I want to build the django sever first, because it will allow me to run safe, rewindable migrations on the archive data without destroying people's folders by accident.
This migration will take place for users of the If any of you have ideas or input on this process, any help is welcome. |
Hey @pirate , thanks for for your response! Some thoughts:
|
Most of these problems go away as we start to use django more heavily, as the export folder structure can be changed dramatically now that we have a SQL database as the single-source-of-truth with safe migrations. In v0.4.0 I've already added hashes, and in a subsequent version they will become the primary unique key. The archive will be served by django, with static folder exports becoming optional-only. This allows us to provide both timestamp and hash-based URLs via django, and static export format can be selected by specifying a flag like: archivebox export --folders=timestamp
# or
archivebox export --folders=hash I might even add an options to do both with symlinks as discussed above, but for now I think letting the user decide is the simplest solution. Once we hear feedback from users on the new >v0.4.0 system we can decide how to proceed with export formatting. |
We should use one of these better implementations instead of crockford-base32 directly:
|
My Pinboard export contains several bookmarks with identical timestamps (presumably from imports from Delicious years ago).
The first time I run
archive.py
, I end up with several archive directories named like1317249309
,1317249309.0
,1317249309.1
, …. These directory names correspond properly with entries inindex.json
as expected.If I run
archive.py
a second time with the same input, it appears to rewriteindex.json
, assigning different numerical suffixes to the1317249309
timestamp. The entries inindex.json
no longer correspond with the contents of those archive directories on disk.You can reproduce this with the following JSON file (
pinboard.json
):Run the following commands:
The text was updated successfully, but these errors were encountered: