Data files represent a special case for source code repositories and generally speaking data files should not be stored in a source code repo. Rather they should be stored in a separate storage area, reserved for times when retrieval is required (e.g. updating test data, performing local installations, etc). Think of "source code" as files which are made up of text, are required for compilation, and can be easily inspected by the human eye (and diff'ed by the diff program). Think of "data files" as files stored as binary and not required for compilation.

This page describes how to work with git-annex, the software used for storing and retrieving data files in the freesurfer git repo.

Initial Setup

Based in the information included in the Freesurfer_github page, the remotes of your freesurfer repo working directory should look something like:

git remote -v 
 
  datasrc       file:///space/freesurfer/repo/annex.git (fetch)
  datasrc       file:///space/freesurfer/repo/annex.git (push)
  origin        git@github.com:zkaufman/freesurfer.git (fetch)
  origin        git@github.com:zkaufman/freesurfer.git (push)
  upstream      git@github.com:freesurfer/freesurfer.git (fetch)
  upstream      git@github.com:freesurfer/freesurfer.git (push)

Users outside the Martinos Center, who do not have access to the local filesystem, should instead have the datasrc repo pointed to the public facing server:

git remote -v

  datasrc       https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/annex.git (fetch)
  datasrc       https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/annex.git (push)
  origin        git@github.com:zkaufman/freesurfer.git (fetch)
  origin        git@github.com:zkaufman/freesurfer.git (push)
  upstream      git@github.com:freesurfer/freesurfer.git (fetch)
  upstream      git@github.com:freesurfer/freesurfer.git (push)

Adding a data file

The following example assumes we want to add a data file called 'testdata.tar.gz' to the 'distribution' directory:

git annex add <filename>
git commit -a -m "Added new file"
git annex copy --to datasrc

Getting a data file

To retrieve the contents of a data file:

git fetch datasrc (maybe could do 'git annex sync')
git annex get mri_em_register/testdata.tar.gz

Get only the data files required for build time checks (1.9 GB)

git annex get --metadata fstags=makecheck .

Get only the data files required for local installation (4.3 GB)

git annex get --metadata fstags=makeinstall .

Retrieve everything under the current directory (not recommended)

git annex get .

Modifying a data file

To modify the contents of a data file, first unlock it (which eliminates the symlink), than modify it, then re-add to the annex:

git annex unlock mri_em_register/testdata.tar.gz
<modify contents of tar file>
git annex add mri_em_register/testdata.tar.gz
git commit -am "New test data"
git push
git annex copy --to datasrc

Tagging

Git -annex provides the ability to to tag data files. Freesurfer utilizes tags so that subsets of the data can be retrieved without having to download everything. The data files have been broken down into the following 3 categories:

  1. Those being required for build time checks (tagged makecheck)

  2. Those required for a local installation (tagged makeinstall)

  3. Everything else (untagged)

It is essential that data files get the proper tag(s) so that our servers and diskspace is not overwhelmed when only a known subset of the data is required.

Display metadata

To show all the metadata associated with a file:

git annex metadata mri_em_register/testdata.tar.gz

Assign metadata

To assign a tag to an existing datafile.

git annex metadata mri_em_register/testdata.tar.gz -s fstags=makecheck
git annex sync

We can also append tags:

git annex metadata mri_em_register/testdata.tar.gz -s fstags+=makeinstall
git annex sync

List all files with a given tag

git annex find --metadata fstags=makecheck

Administrative Stuff

Mirroring

The git annex repo exists on the local file system in the following directory:

/space/freesurfer/repo/annex.git

The public facing git annex repo exists on local file system in the following directory (mounted by our server):

/cluster/pubftp/dist/freesurfer/repo/annex.git

Currently we "mirror" the two repos daily using the following commands:

ssh pinto   (Must be on machine pinto)
rsync -av /space/freesurfer/repo/annex.git/* /cluster/pubftp/dist/freesurfer/repo/annex.git
git update-server-info

The proper way to mirror would be as follows:

Synced branches

There exists a lot of these 'synced' branches in the repo. (See https://github.com/freesurfer/freesurfer) I've observed these tend to get created whenever using the git annex sync command when updating metadata of a data file. Whats the deal with them, and can I safely delete them? Should I delete them? Is there a way to update only metadata of a data file without issuing the git annex sync which seems to wreak havoc on the repo?

git-annex branch

Suppose a user has a clone in an unknown state and wants to update. Typically this is done via a git pull upstream dev. But when using git-annex it seems that the user also needs to issue a git annex fetch datasrc command as well. Is this the proper workflow from a user perspective? Seems a bit of a burden on the user if so. Is their a cleaner way?