Warning
This page was created from a pull request (#950).
2.3. Fixing up too-large datasets¶
The previous section highlighted problems of too large monorepos and advised strategies to them prevent them. This section introduces some strategies to clean and fix up datasets that got out of hand size-wise. If there are use cases you would want to see discussed here or propose solutions for, please get in touch.
2.3.1. Getting contents out of Git¶
Let’s say you did a datalad run (manual) with an analysis that put too
many files under version control by Git, and you want to see them gone.
Sticking to the FSL FEAT analysis example from earlier, you may, for example,
want to get rid of every tsplot directory, as it contains results that are
irrelevant for you.
Note that there is no way to drop the files as they are in Git instead of
git-annex. Removing
the files with plain file system (rm, git rm) operation also does not
shrink your dataset. The files are snapshot and even though they don’t exist in
the current state of your dataset anymore, they still exist – and thus clutter
– your datasets history. In order to really get committed files out of Git,
you need to rewrite history. And for this you need heavy machinery:
git-filter-repo[1].
It is a powerful and potentially dangerous tool to rewrite Git history.
Treat this tool like a chainsaw. Very helpful for heavy duty tasks, but also
life-threatening. The command
git-filter-repo <path-specification> --force will “filter-out”, i.e., remove
all files but the ones specified in <path-specification> from the datasets
history. Before you use it, please make sure to read its help page thoroughly.
Installing git-filter-repo
git-filter-repo is not part of Git but needs to be installed separately.
Its GitHub repository contains
more and more detailed instructions, but it is possible to install via pip
(pip install git-filter-repo), and available via standard package managers
for macOS and some Linux distributions (mostly rpm-based ones).
The general procedure you should follow is the following:
datalad clone(manual) the repository. This is a safeguard to protect your dataset should something go wrong. The clone you are creating will be your new, cleaned up dataset.datalad get(manual) all the dataset contents by runningdatalad get .in the clone.git-filter-repowhat you don’t want anymore (see below)Run
git annex unusedand a subsequentgit annex dropunused allto remove stale file contents that are not referenced anymore.Finally, do some aggressive garbage collection with
git gc --aggressive
In order to get a hang on the git-filter-repo step, consider a directory
structure similar to this exemplary run-wise FEAT analysis output structure:
$ tree
sub-*/run-*_<task>-<level>.feat
├── custom_timing_files
├── logs
├── reg
├── reg_standard
│ ├── reg
│ └── stats
├── stats
└── tsplot
Each of such sub-* directories contains about 3000 files, and the majority of
them are irrelevant text files in tsplot/.
In order to remove them for all subjects and runs from the dataset history,
the following command can be used:
$ git-filter-repo --path-regex '^sub-[0-9]{2}/run-[0-9]{1}*.feat/tsplot/.*$' --invert-paths --force
The --path-regex and the regex expression '^sub-[0-9]{2}/run-[0-9]{1}*.feat/tsplot/.*$'[2]
match all file paths inside of the tsplot/ directories of all subjects and
runs.
The option --invert-paths then inverts this path specification, and leads
to only the files in tsplot/ to be filtered out. Note that there are also
non-regex based path specifications possible, for example with the option
--path-match or path-glob, or with a specification placed in a file.
Please see the manual of git-filter-repo for more information.
Footnotes