-
Notifications
You must be signed in to change notification settings - Fork 722
/
bfg-ish
executable file
·470 lines (427 loc) · 22.7 KB
/
bfg-ish
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
#!/usr/bin/env python3
"""
This is a re-implementation of BFG Repo Cleaner, with some changes...
New features:
* pruning unwanted objects streamlined (automatic repack) and made more robust
(BFG makes user repack manually, and while it provides instructions on how
to do so, it won't successfully remove large objects in cases like unpacked
refs, loose objects, or use of --no-blob-protection; the robustness details
are bugfixes, so are covered below.)
* pruning of commits which become empty (or become degenerate and empty)
* creation of new replace refs so folks can access new commits using old
(unabbreviated) commit hashes
* respects and uses grafts and replace refs in the rewrite to make them
permanent (this is half new feature, half bug fix; thus also mentioned
in bugfixes below)
* auto-update of commit encoding to utf-8 (as per fast-export's default;
could pass --preserve-commit-encoding to FilteringOptions.parse_args() if
this isn't wanted...)
Bug fixes:
* Works for both packfiles and loose objects
(With BFG, if you don't repack before running, large blobs may be retained.)
(With BFG, any files larger than core.bigFileThreshold are thus hard to
remove since they will not be packed by a gc or a repack.)
* Works for both packed-refs and loose refs
(As per BFG issue #221, BFG fails to properly walk history unless packed.)
* Works with replace refs
(BFG operates directly on packfiles and packed-refs, and does not
understand replace refs; see BFG issue #82)
* Updates both index and working tree at end of rewrite
(With BFG and --no-blob-protection, these are still left out-of-date. This
is a doubly-whammy principle-of-least-astonishment violation: (1) users
are likely to accidentally commit the "staged" changes, re-introducing the
large blobs or removed passwords, (2) even if they don't commit the
changes the index holding them will prevent gc from shrinking the repo.
Fixing these two glaring problems not only makes --no-blob-protection
safe to recommend, it makes it safe to make it the default.)
* Fixes the "protection" defaults
(With BFG, it won't rewrite the tree for HEAD; it can't reasonably switch
to doing so because of the bugs mentioned above with updating the index
and working tree. However, this behavior comes with a surprise for users:
if HEAD is "protected" because users should manually update it first, why
isn't that also true of the other branches? In my opinion, there's no
user-facing distinction that makes sense for such a difference in
handling. "Protecting" HEAD can also be an error-prone requirement for
users -- why do they have to manually edit all files the same way
--replace-text is doing and why do they have to risk dirty diffs if they
get it slightly different (or a useless and ugly empty commit if they
manage to get it right)? Finally, a third reason this was in my opinion a
bad default was that it works really poorly in conjunction with other
types of history rewrites, e.g. --subdirectory-filter,
--to-subdirectory-filter, --convert-to-git-lfs, --path-rename, etc. For
all three of these reasons, and the fixes mentioned above to make it safe,
--no-blob-protection is made the default.)
* Implements privacy improvements, defaulting to true
(As per BFG #139, one of the BFG maintainers notes problematic issues
with the "privacy" handling in BFG, suggesting options which could be
added to improve the story. I implemented those options, except that I
felt --private should be the default and made the various non-private
choices individual options; see the --use-* options.)
Other changes:
* Removed the --convert-to-git-lfs option
(As per BFG issues #116 and #215, and git-lfs issue #1589, handling LFS
conversion is poor in BFG and not recommended; other tools are suggested
even by the BFG authors.)
* Removed the --strip-biggest-blobs option
(I philosophically disagree with offering such an option when no
mechanism is provided to see what the N biggest blobs are. How is the
user supposed to select N? Even if they know they have three files
which have been large, they may be unaware of others in history. Even
if there aren't any other files in history and the user requests to
remove the largest three blobs, it might not be what they want: one of
the files might have had multiple versions, in which case their request
would only remove some versions of the largest file from history and
leave all versions of the second and third largest files as well as all
but three versions of the largest file. Finally, on a more minor note,
what is done in the case of a tie -- remove more than N, less than N, or
just pick one of the objects tieing for Nth largest at random? It's
ill-defined.)
...even with all these improvements, I think filter-repo is the better tool,
and thus I suggest folks use it. I have no plans to improve bfg-ish
further. However, bfg-ish serves as a nice demonstration of the ability to
use filter-repo to write different filtering tools, which was its purpose.
"""
"""
Please see the
***** API BACKWARD COMPATIBILITY CAVEAT *****
near the top of git-filter-repo.
"""
import argparse
import fnmatch
import os
import re
import subprocess
import tempfile
try:
import git_filter_repo as fr
except ImportError:
raise SystemExit("Error: Couldn't find git_filter_repo.py. Did you forget to make a symlink to git-filter-repo named git_filter_repo.py or did you forget to put the latter in your PYTHONPATH?")
subproc = fr.subproc
def java_to_fnmatch_glob(extended_glob):
if not extended_glob:
return None
curly_re = re.compile(br'(.*){([^{}]*)}(.*)')
m = curly_re.match(extended_glob)
if not m:
return [extended_glob]
all_answers = [java_to_fnmatch_glob(m.group(1)+x+m.group(3))
for x in m.group(2).split(b',')]
return [item for sublist in all_answers for item in sublist]
class BFG_ish:
def __init__(self):
self.blob_sizes = {}
self.filtered_blobs = {}
self.cat_file_proc = None
self.replacement_rules = None
self._hash_re = re.compile(br'(\b[0-9a-f]{7,40}\b)')
self.args = None
def parse_options(self):
usage = 'bfg-ish [options] [<repo>]'
parser = argparse.ArgumentParser(description="bfg-ish 1.13.0", usage=usage)
parser.add_argument('--strip-blobs-bigger-than', '-b', metavar='<size>',
help=("strip blobs bigger than X (e.g. '128K', '1M', etc)"))
#parser.add_argument('--strip-biggest-blobs', '-B', metavar='NUM',
# help=("strip the top NUM biggest blobs"))
parser.add_argument('--strip-blobs-with-ids', '-bi',
metavar='<blob-ids-file>',
help=("strip blobs with the specified Git object ids"))
parser.add_argument('--delete-files', '-D', metavar='<glob>',
type=os.fsencode,
help=("delete files with the specified names (e.g. '*.class', '*.{txt,log}' - matches on file name, not path within repo)"))
parser.add_argument('--delete-folders', metavar='<glob>',
type=os.fsencode,
help=("delete folders with the specified names (e.g. '.svn', '*-tmp' - matches on folder name, not path within repo)"))
parser.add_argument('--replace-text', '-rt', metavar='<expressions-file>',
help=("filter content of files, replacing matched text. Match expressions should be listed in the file, one expression per line - by default, each expression is treated as a literal, but 'regex:' & 'glob:' prefixes are supported, with '==>' to specify a replacement string other than the default of '***REMOVED***'."))
parser.add_argument('--filter-content-including', '-fi', metavar='<glob>',
type=os.fsencode,
help=("do file-content filtering on files that match the specified expression (eg '*.{txt,properties}')"))
parser.add_argument('--filter-content-excluding', '-fe', metavar='<glob>',
type=os.fsencode,
help=("don't do file-content filtering on files that match the specified expression (eg '*.{xml,pdf}')"))
parser.add_argument('--filter-content-size-threshold', '-fs',
metavar='<size>', default=1048576, type=int,
help=("only do file-content filtering on files smaller than <size> (default is 1048576 bytes)"))
parser.add_argument('--preserve-ref-tips', '--protect-blobs-from', '-p',
metavar='<refs>', nargs='+',
help=("Do not filter the trees for final commit of the specified refs, only in the history before those commits (by default, filtering options affect all commits, even those at ref tips). This is not recommended."))
parser.add_argument('--no-blob-protection', action='store_true',
help=("allow the BFG to modify even your *latest* commit. Not only is this highly recommended, it is the default. As such, this option does not actually do anything and is provided solely for compatibility with BFG. To undo this option, use --preserve-ref-tips and specify HEAD or the current branch name"))
parser.add_argument('--use-formerly-log-text', action='store_true',
help=("when updating commit hashes in commit messages also add a [formerly OLDHASH] text, possibly violating commit message line length guidelines and providing an inferior way to lookup old hashes (replace references are much preferred as git itself will understand them)"))
parser.add_argument('--use-formerly-commit-footer', action='store_true',
help=("append a `Former-commit-id:` footer to commit messages. This is an inferior way to lookup old hashes (replace references are much preferred as git itself will understand them)"))
parser.add_argument('--use-replace-blobs', action='store_true',
help=("replace any removed file by a `<filename>.REMOVED.git-id` file. Makes history ugly as it litters it with replacement files for each one you want removed, but has a small chance of being useful if you find you pruned something incorrectly."))
parser.add_argument('--private', action='store_true',
help=("this option does nothing and is provided solely for compatibility with bfg; to undo it, use the --use-* options"))
parser.add_argument('--massive-non-file-objects-sized-up-to',
metavar='<size>',
help=("this option does nothing and is provided solely for compatibility with bfg"))
parser.add_argument('repo', type=os.fsencode,
help=("file path for Git repository to clean"))
args = parser.parse_args()
# Sanity check on args.repo
if not os.path.isdir(args.repo):
raise SystemExit("Repo not found: {}".format(os.fsdecode(args.repo)))
dirname, basename = os.path.split(args.repo)
if not basename:
dirname, basename = os.path.split(dirname)
if not dirname:
dirname = b'.'
if basename == b".git":
raise SystemExit("For non-bare repos, please specify the toplevel directory ({}) for repo"
.format(os.fsdecode(dirname)))
return args
def convert_replace_text(self, filename):
tmpfile, newname = tempfile.mkstemp()
os.close(tmpfile)
with open(newname, 'bw') as outfile:
with open(filename, 'br') as infile:
for line in infile:
if line.startswith(b'regex:'):
beg, end = line.split(b'==>')
end = re.sub(br'\$([0-9])', br'\\\1', end)
outfile.write(b'%s==>%s\n' % (beg, end))
elif line.startswith(b'glob:'):
outfile.write(b'glob:' + java_to_fnmatch_glob(line[5:]))
else:
outfile.write(line)
return newname
def path_wanted(self, filename):
if not self.args.delete_files and not self.args.delete_folders:
return filename
paths = filename.split(b'/')
dirs = paths[0:-1]
basename = paths[-1]
if self.args.delete_files and any(fnmatch.fnmatch(basename, x)
for x in self.args.delete_files):
return False
if self.args.delete_folders and any(any(fnmatch.fnmatch(dirname, x)
for dirname in dirs)
for x in self.args.delete_folders):
return False
return True
def should_filter_path(self, filename):
def matches(basename, glob_list):
return any(fnmatch.fnmatch(basename, x) for x in glob_list)
basename = os.path.basename(filename)
if self.args.filter_content_including and \
not matches(basename, self.args.filter_content_including):
return False
if self.args.filter_content_excluding and \
matches(basename, self.args.filter_content_excluding):
return False
return True
def filter_relevant_blobs(self, commit):
for change in commit.file_changes:
if change.type == b'D':
continue # deleted files have no remaining content to filter
if change.mode in (b'120000', b'160000'):
continue # symlinks and submodules aren't text files we can filter
if change.blob_id in self.filtered_blobs:
change.blob_id = self.filtered_blobs[change.blob_id]
continue
if self.args.filter_content_size_threshold:
size = self.blob_sizes[change.blob_id]
if size >= self.args.filter_content_size_threshold:
continue
if not self.should_filter_path(change.filename):
continue
self.cat_file_proc.stdin.write(change.blob_id + b'\n')
self.cat_file_proc.stdin.flush()
objhash, objtype, objsize = self.cat_file_proc.stdout.readline().split()
# FIXME: This next line assumes the file fits in memory; though the way
# fr.Blob works we kind of have that assumption baked in elsewhere too...
contents = self.cat_file_proc.stdout.read(int(objsize))
if not any(x == b"0" for x in contents[0:8192]): # not binaries
for literal, replacement in self.replacement_rules['literals']:
contents = contents.replace(literal, replacement)
for regex, replacement in self.replacement_rules['regexes']:
contents = regex.sub(replacement, contents)
self.cat_file_proc.stdout.read(1) # Read trailing newline
blob = fr.Blob(contents)
self.filter.insert(blob)
self.filtered_blobs[change.blob_id] = blob.id
change.blob_id = blob.id
def munge_message(self, message, metadata):
def replace_hash(matchobj):
oldhash = matchobj.group(1)
newhash = metadata['commit_rename_func'](oldhash)
if newhash != oldhash and self.args.use_formerly_log_text:
newhash = b'%s [formerly %s]' % (newhash, oldhash)
return newhash
return self._hash_re.sub(replace_hash, message)
def commit_update(self, commit, metadata):
# Strip out unwanted files
new_file_changes = []
for change in commit.file_changes:
if not self.path_wanted(change.filename):
if not self.args.use_replace_blobs:
continue
blob = fr.Blob(change.blob_id)
self.filter.insert(blob)
change.blob_id = blob.id
change.filename += b'.REMOVED.git-id'
new_file_changes.append(change)
commit.file_changes = new_file_changes
# Filter text of relevant files
if self.replacement_rules:
self.filter_relevant_blobs(commit)
# Replace commit hashes in commit message with 'newhash [formerly oldhash]'
if self.args.use_formerly_log_text:
commit.message = self.munge_message(commit.message, metadata)
# Add a 'Former-commit-id:' footer
if self.args.use_formerly_commit_footer:
if not commit.message.endswith(b'\n'):
commit.message += b'\n'
lastline = commit.message.splitlines()[-1]
if not re.match(b'\n[A-Za-z0-9-_]*: ', lastline):
commit.message += b'\n'
commit.message += b'Former-commit-id: %s' % commit.original_id
def get_preservation_info(self, ref_tips):
if not ref_tips:
return []
cmd = 'git rev-parse --symbolic-full-name'.split()
p = subproc.Popen(cmd + ref_tips,
stdout = subprocess.PIPE,
stderr = subprocess.STDOUT)
ret = p.wait()
output = p.stdout.read()
if ret != 0:
raise SystemExit("Failed to translate --preserve-ref-tips arguments into refs\n"+fr.decode(output))
refs = output.splitlines()
ref_trees = [b'%s^{tree}' % ref for ref in refs]
output = subproc.check_output(['git', 'rev-parse'] + ref_trees)
trees = output.splitlines()
return dict(zip(refs, trees))
def revert_tree_changes(self, preserve_refs):
# FIXME: Since this function essentially creates a new commit (with the
# original tree) to replace the commit at the ref tip (which has a
# filtered tree), I should update the created refs/replace/ object to
# point to the newest commit. Also, the double reset (see comment near
# where revert_tree_changes is called) seems kinda lame. It'd be easy
# enough to fix these issues, but I'm very unmotivated since
# --preserve-ref-tips/--protect-blobs-from is a design mistake.
updates = {}
for ref, tree in preserve_refs.items():
output = subproc.check_output('git cat-file -p'.split()+[ref])
lines = output.splitlines()
if not lines[0].startswith(b'tree '):
raise SystemExit("Error: --preserve-ref-tips only works with commit refs")
num = 1
parents = []
while lines[num].startswith(b'parent '):
parents.append(lines[num][7:])
num += 1
assert lines[num].startswith(b'author ')
author_info = [x.strip()
for x in re.split(b'[<>]', lines[num][7:])]
aenv = 'GIT_AUTHOR_NAME GIT_AUTHOR_EMAIL GIT_AUTHOR_DATE'.split()
assert lines[num+1].startswith(b'committer ')
committer_info = [x.strip()
for x in re.split(b'[<>]', lines[num+1][10:])]
cenv = 'GIT_COMMITTER_NAME GIT_COMMITTER_EMAIL GIT_COMMITTER_DATE'.split()
new_env = {**os.environ.copy(),
**dict(zip(aenv, author_info)),
**dict(zip(cenv, committer_info))}
assert lines[num+2] == b''
commit_msg = b'\n'.join(lines[num+3:])+b'\n'
p_s = [val for pair in zip(['-p',]*len(parents), parents) for val in pair]
p = subproc.Popen('git commit-tree'.split() + p_s + [tree],
stdin = subprocess.PIPE, stdout = subprocess.PIPE,
env = new_env)
p.stdin.write(commit_msg)
p.stdin.close()
if p.wait() != 0:
raise SystemExit("Error: failed to write preserve commit for {} [{}]"
.format(ref, tree))
updates[ref] = p.stdout.read().strip()
p = subproc.Popen('git update-ref --stdin'.split(), stdin = subprocess.PIPE)
for ref, newvalue in updates.items():
p.stdin.write(b'update %s %s\n' % (ref, newvalue))
p.stdin.close()
if p.wait() != 0:
raise SystemExit("Error: failed to write preserve commits")
def run(self):
bfg_args = self.parse_options()
preserve_refs = self.get_preservation_info(bfg_args.preserve_ref_tips)
work_dir = os.getcwd()
os.chdir(bfg_args.repo)
bfg_args.delete_files = java_to_fnmatch_glob(bfg_args.delete_files)
bfg_args.delete_folders = java_to_fnmatch_glob(bfg_args.delete_folders)
bfg_args.filter_content_including = \
java_to_fnmatch_glob(bfg_args.filter_content_including)
bfg_args.filter_content_excluding = \
java_to_fnmatch_glob(bfg_args.filter_content_excluding)
if bfg_args.replace_text and bfg_args.filter_content_size_threshold:
# FIXME (perf): It would be much more performant and probably make more
# sense to have a `git cat-file --batch-check` process running and query
# it for blob sizes, since we may only need a small subset of blob sizes
# rather than the sizes of all objects in the git database.
self.blob_sizes, packed_sizes = fr.GitUtils.get_blob_sizes()
extra_args = []
if bfg_args.strip_blobs_bigger_than:
extra_args = ['--strip-blobs-bigger-than',
bfg_args.strip_blobs_bigger_than]
if bfg_args.strip_blobs_with_ids:
extra_args = ['--strip-blobs-with-ids',
bfg_args.strip_blobs_with_ids]
if bfg_args.use_formerly_log_text:
extra_args += ['--preserve-commit-hashes']
new_replace_file = None
if bfg_args.replace_text:
if not os.path.isabs(bfg_args.replace_text):
bfg_args.replace_text = os.path.join(work_dir, bfg_args.replace_text)
new_replace_file = self.convert_replace_text(bfg_args.replace_text)
rules = fr.FilteringOptions.get_replace_text(new_replace_file)
self.replacement_rules = rules
self.cat_file_proc = subproc.Popen(['git', 'cat-file', '--batch'],
stdin = subprocess.PIPE,
stdout = subprocess.PIPE)
self.args = bfg_args
# Setting partial prevents:
# * remapping origin remote tracking branches to regular branches
# * deletion of the origin remote
# * nuking unused refs
# * nuking reflogs
# * repacking
# While these are arguably desirable things, BFG documentation assumes
# the first two aren't done, so for compatibility turn them all off.
# The third is irrelevant since BFG has no mechanism for renaming refs,
# and we'll manually add the fourth and fifth back in below by calling
# RepoFilter.cleanup().
fr_args = fr.FilteringOptions.parse_args(['--partial', '--force'] +
extra_args)
self.filter = fr.RepoFilter(fr_args, commit_callback=self.commit_update)
self.filter.run()
if new_replace_file:
os.remove(new_replace_file)
self.cat_file_proc.stdin.close()
self.cat_file_proc.wait()
need_another_reset = False
if preserve_refs:
self.revert_tree_changes(preserve_refs)
# If the repository is not bare, self.filter.run() already did a reset
# for us. However, if we are preserving refs (and the repository isn't
# bare), we need another since we possibly updated HEAD after that
# reset (FIXME: two resets is kinda ugly; would be nice to just do
# one).
if not fr.GitUtils.is_repository_bare('.'):
need_another_reset = True
if not os.path.isabs(os.fsdecode(bfg_args.repo)):
bfg_args.repo = os.fsencode(os.path.join(work_dir, os.fsdecode(bfg_args.repo)))
self.filter.cleanup(bfg_args.repo, repack=True, reset=need_another_reset)
if __name__ == '__main__':
bfg = BFG_ish()
bfg.run()
# Show the same message BFG does, even if we don't copy the rest of its
# progress output. Make this program feel slightly more authentically BFG.
# :-)
print('''
--
You can rewrite history in Git - don't let Trump do it for real!
Trump's administration has lied consistently, to make people give up on ever
being told the truth. Don't give up: https://www.rescue.org/topic/refugees-america
--
''')