Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken "Glacier_remote_uploads_duplicates" bug link #85

Open
soxofaan opened this issue Jul 3, 2021 · 3 comments
Open

Broken "Glacier_remote_uploads_duplicates" bug link #85

soxofaan opened this issue Jul 3, 2021 · 3 comments

Comments

@soxofaan
Copy link
Contributor

soxofaan commented Jul 3, 2021

# deletion, retaining (ie. not listing) one of each identical archive id. This
# is useful to work around this bug:
# http://git-annex.branchable.com/bugs/Glacier_remote_uploads_duplicates/

this link https://git-annex.branchable.com/bugs/Glacier_remote_uploads_duplicates/ does not work anymore (404 Not Found)

@soxofaan
Copy link
Contributor Author

soxofaan commented Jul 3, 2021

bug was removed from tracker by http://source.git-annex.branchable.com/?p=source.git;a=commit;h=b949e8504506dee6a2844bf61c0c9cf617fe9585

commit b949e8504506dee6a2844bf61c0c9cf617fe9585
Author: Joey Hess <[email protected]>
Date:   Tue Apr 19 13:46:11 2016 -0400

    remove old closed bugs and todo items to speed up wiki updates and reduce size

    Remove closed bugs and todos that were last edited or commented before Q3 2015.

@soxofaan
Copy link
Contributor Author

soxofaan commented Jul 3, 2021

What follows here is attempt to reconstruct the content of that bug report at the time it was removed:

Please describe the problem.

Other references:

#19
http://git-annex.branchable.com/special_remotes/glacier/#comment-a2b05b8dc2d640ee498d90398f02931c

Background

  • Glacier doesn't support keys that the client selects, unlike S3. If you upload to Glacier, Glacier assigns a unique ID, not the client.
  • Glacier does support an "archive description" which is immutable. It also provides this "archive description" in an inventory listing, together with the unique IDs.
  • An "archive description" is not a unique key. It's perfectly possible to upload multiple archives to Glacier with the same "archive description".
  • glacier-cli uses the "archive description" field as an upload identifier, since the unique IDs are unfriendly to users. However, since they are potentially ambiguous identifiers, it also supports disambiguation using the ID itself. See "Addressing Archives" in README.md for details.

The Problem

This what I believe is happening in the two reports referenced above. When git-annex is used without --trust-glacier, it can end up uploading the same data multiple times. From git-annex's point of view, it cannot verify that the data is already in Glacier, so it uploads again, expecting an overwrite operation if the key is already in Glacier. Since glacier-cli maps the key to an "archive description" that can be duplicated, this is not what happens. Instead, a second archive is uploaded.

When git-annex later does a "checkpresent" operation, glacier-cli fails. This is because the request is ambiguous, since there are two archives in Glacier with the same "key". The error message could be better here, but I believe that the behaviour is correct.

Discussion

glacier-cli can find out what data Glacier claims to have using an inventory retrieval. However, this retrieval takes about four hours and can be out of date (eg. if someone else recently deleted the archive from another client). Thus, I can understand git-annex's desire not to trust this data or a cache of it.

However, whatever we do, it is impossible to map an "upload or overwrite on key X" type command to Glacier. We'll always end up with duplicates. Even if git-annex stored the Glacier archive IDs, there is no API to replace an existing archive with the same ID, and inventories are out of date even before we retrieve them.

Workaround

If the problem is as I think it is, always applying --trust-glacier should prevent the problem from occurring in most cases, since git-annex will run "checkpresent" and glacier-cli will confirm that the archive exists.

To fix the problem after it has occurred, it should be sufficient to delete duplicates using glacier-cli, since they should be identical to each other. Some enhancement of the glacier-cli archive list command would help here.

Update 10 June 2013: I've pushed a glacier-cli update and helper script in commit b68835. This adds a --force-ids option to glacier archive list, with which the helper script glacier-list-duplicates.sh uses to identify duplicates that can be removed. If you're affected by this issue, I suggest that you use this helper to identify and fix your problem by removing the duplicates. Please do so carefully by checking that the output of the helper is correct before you use it to delete the duplicates. See the comments at the top of the helper script for usage information.

[[fixed|done]], at least for the only well-working case for glacier, where
only one repository can access glacier directly. --[[Joey]]

Comments

Comment 1 - by joey - 2013-05-23T15:55:16Z

Please beware of the warning on the man page when using --trust-glacier-inventory:

Be careful using this, especially if you or someone else might
have recently removed a file from Glacier. If you try to drop
the only other copy of the file, and this switch is enabled, you
could lose data!

While I'm inclined to want git-annex to store the necessary mappings from keys to glacier IDs in the git-annex branch, which would allow uploads/downloads from multiple repositories to the same glacier repository, it will not help with this problem. The git-annex branch can be out of date too.

It seems that what's needed is a separate form of the checkpresent hook, that's used when deciding whether to copy data to glacier.
We want this to trust the glacier inventory. But we don't want to trust the glacier inventory when moving data to glacier, or when running git annex drop! (unless --trust-glacier-inventory is specified). I think this would be easy to add. If you're up for testing a patch, I could do it today.

BTW, there does seem to be a workaround that avoids duplicate copies to glacier:

git annex copy --to glacier --not --in glacier

While normally copy checks the inventory to see if a key has been sent to glacier, and so will re-send, the --not --in glacier
trusts the location tracking information, so if git-annex has sent the key before, it will skip the copy.

Comment 2 - by joey - 2013-05-23T15:57:08Z

I suppose another way to fix it along similar lines would be to make git annex copy always trust location tracking information when deciding whether to copy. I'm not sure how I feel about this though -- it might make things less robust in situations where git annex copy is run as a backup, and location tracking could have gotten out of date.

Comment 3 - by joey - 2013-05-23T15:59:37Z

It's also worth noting that the assistant always trusts the location log when deciding whether to send a key to a remote. So I think it will not trigger this bug. It seems only git annex copy will. (Well, maybe git annex move too in an edge case.)

Comment 4 - by Justin - 2013-05-27T22:24:44Z

If you're up for testing a patch, I could do it today.

I'm happy to test a patch. I haven't successfully compiled git-annex on my Mac, which is the only computer I have for the next month or so, but it wasn't too hard to get it to work on my Linux box.

Comment 5 - by joey - 2013-05-29T17:54:11Z

I started to make a branch with the change I suggested, but then I had another idea.

The checkpresent hook can return either True or, False, or fail with a message if it cannot successfully check the remote. Currently for glacier, when --trust-glacier is not set, it always returns False. Crucially, in the case when a file is in glacier, this is telling git-annex it's not there, so copy re-uploads it. What if it instead, when the glacier inventory is missing a file, it returns False. And when the glacier inventory has a file, unless --trust-glacier is set, it fails.

The result would be:

  • git annex copy --to glacier would only send things not listed in inventory. If a file is listed in the inventory, copy
    would complain that --trust-glacier` is not set, and not re-upload the file.
  • git annex drop would only trust that glacier has a file when --trust-glacier is set. Behavior unchanged.
  • git annex move --to glacier, when the file is not listed in inventory, would send the file, and delete it locally. Behavior unchanged.
  • git annex move --to glacier, when the file is listed in inventory, would only trust that glacier has the file when --trust-glacier is set
  • git annex copy --from glacier / git annex get, when the file is located in glacier, would trust the location log, and attempt to get the file from glacier.

This seems like it should do the right thing in all cases, but I have not tested it. I've pushed a glacier branch with this change.

Comment 6 - by Robie - 2013-06-10T17:24:34Z

This seems reasonable to me.

One other possibility that you could end up with a duplicate: if glacier-cli's cache is not up to date. For example: hosts A and B both have (the same) annex with the same Glacier special remote defined. Host A copies a file to Glacier. On host B, the glacier-cli cache doesn't know about the file, and so a copy to Glacier on host B also succeeds. When the cache is later brought up to date glacier vault sync, then the duplicate appears.

I'm not sure what we can do about this. Perhaps we need to accept that duplicates will occur, and handle them more gracefully.

Comment 7 - by joey - 2013-06-11T14:38:19Z

Ok, I've merged the glacier branch into master. I would still be happy to see some testing of this before my next release (in a week).

I guess I'll close this bug report. There are certainly still problems that can happen if there are multiple repositories all writing to glacier independently. Seems to me that one good way to deal with this is to set up a single remote that is configured to be a gateway to glacier.

Comment 8 - by Jimmy - 2013-11-18T00:00:32Z -- subject: For those on Mac OS X

The duplicates script fails because the BSD/MacOS version of uniq doesn't support the -D option.

You can work around this by installing the GNU version using Homebrew ('brew install coreutils') and then replacing the 'uniq' in the script with 'guniq' (Homebrew prefixes the coreutils with "g" by default).

I seem to still be running in to this bug using git annex version 4.20131106 and 'git annex copy --to glacier' without the '--not --in glacier' flags. It's not a problem to use the extra flags but I wasn't originally aware of this issue and the duplicates don't seem to always occur. I'll do some more testing and see whether I can reliably predict what will create duplicates and what won't.

@basak
Copy link
Owner

basak commented Jul 4, 2021

Thank you for the report. Maybe it would be easiest to "deep link" directly into the git history instead - for example to http://source.git-annex.branchable.com/?p=source.git;a=blob;f=doc/bugs/Glacier_remote_uploads_duplicates.mdwn;h=75014a5e049a397cef2dd8745e98d4094319e1f6;hb=a9c7260adb7f6336270fe1af90484abb3bbb3991 - plus all the comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants