Differences between revisions 18 and 19
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
#acl LcnGroup:read,write,delete,revert All:read

<<TableOfContents>>

Data files represent a special case for source code repositories and generally speaking data files should not be stored in a source code repo. Rather they should be stored in a separate storage area, reserved for times when retrieval is required (e.g. updating test data, performing local installations, etc). Think of "source code" as files which are made up of text, are required for compilation, and can be easily inspected by the human eye (and diff'ed by the '''diff''' program). Think of "data files" as files stored as binary and not required for compilation.

This page describes how to work with git-annex, the software used for storing and retrieving data files in the freesurfer git repo. Additional documentation can be found on the git-annex website (https://git-annex.branchable.com/walkthrough/).

== Initial Setup ==

Based in the information included in the [[Freesurfer_github|Freesurfer_github]] page, the remotes of your freesurfer repo working directory should look something like:

{{{
git remote -v
 
  datasrc file:///space/freesurfer/repo/annex.git (fetch)
  datasrc file:///space/freesurfer/repo/annex.git (push)
  origin git@github.com:zkaufman/freesurfer.git (fetch)
  origin git@github.com:zkaufman/freesurfer.git (push)
  upstream git@github.com:freesurfer/freesurfer.git (fetch)
  upstream git@github.com:freesurfer/freesurfer.git (push)
}}}

Users outside the Martinos Center, who do not have access to the local filesystem, should instead have the '''datasrc''' repo pointed to the public facing server:

{{{
git remote -v

  datasrc https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/annex.git (fetch)
  datasrc https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/annex.git (push)
  origin git@github.com:zkaufman/freesurfer.git (fetch)
  origin git@github.com:zkaufman/freesurfer.git (push)
  upstream git@github.com:freesurfer/freesurfer.git (fetch)
  upstream git@github.com:freesurfer/freesurfer.git (push)
}}}

== Adding a data file ==

Adding a data file to the repo is something generally only the Freesurfer source code administrator should do. For one, only users at the Martinos Center will have write access to the filesystem. The following example assumes we want to add a sample scripts and data file called 'testdata.tar.gz' to the 'distribution' directory:

{{{
git checkout -b new-feature
git add <scriptname>
git annex add testdatadata.tar.gz
git commit -m "Added a new test script and data file"
git push
git annex copy --to datasrc
}}}

== Getting a data file ==

To retrieve the contents of a data file:
{{{
git fetch datasrc (maybe could do 'git annex sync')
git annex get mri_em_register/testdata.tar.gz
}}}

Get only the data files required for build time checks (1.9 GB)
{{{
git annex get --metadata fstags=makecheck .
}}}

Get only the data files required for local installation (4.3 GB)
{{{
git annex get --metadata fstags=makeinstall .
}}}

Retrieve everything under the current directory (not recommended)
{{{
git annex get .
}}}

== Modifying a data file ==

To modify the contents of a data file, first unlock it (which eliminates the symlink), than modify it, then re-add to the annex:
{{{
git annex unlock mri_em_register/testdata.tar.gz
<modify contents of tar file>
git annex add mri_em_register/testdata.tar.gz
git commit -am "New test data"
git push
git annex copy --to datasrc
}}}

== Tagging ==

Git -annex provides the ability to to tag data files. Freesurfer utilizes tags so that subsets of the data can be retrieved without having to download everything. The data files have been broken down into the following 3 categories:

 1. Those being required for build time checks (tagged '''makecheck''')
 1. Those required for a local installation (tagged '''makeinstall''')
 1. Everything else (untagged)

It is essential that data files get the proper tag(s) so that our servers and diskspace is not overwhelmed when only a known subset of the data is required.

=== Display metadata ===

To show all the metadata associated with a file:

{{{
git annex metadata mri_em_register/testdata.tar.gz
}}}

=== Assign metadata ===

Assigning metadata to a datafile is the job of a source code administrator, similar to adding a datafile. When adding metadata to a git-annex file, it is best to start with a clean checkout of the repository and be in the 'dev' branch. Then add the tag as follows:

{{{
git annex metadata mri_em_register/testdata.tar.gz -s fstags=makecheck
git annex sync
}}}

We can also append tags:
{{{
git annex metadata mri_em_register/testdata.tar.gz -s fstags+=makeinstall
git annex sync
}}}


No need to perform any commits or pushes or pull requests at this point (The administrator will see an updated '''git-annex''' and '''synced/git-annex''' branches on the github profile, but this can be ignored). As long as the administrator has write access to the Freesurfer github page everything should be all set.


=== List all files with a given tag ===
{{{
git annex find --metadata fstags=makecheck
}}}

== Administrative Stuff ==

=== Mirroring ===

The git annex repo exists on the local file system in the following directory:

{{{
/space/freesurfer/repo/annex.git
}}}

The public facing git annex repo exists on local file system in the following directory (mounted by our server):

{{{
/cluster/pubftp/dist/freesurfer/repo/annex.git
}}}

Currently we "mirror" the two repos daily using the following commands:

{{{
ssh pinto (Must be on machine pinto)
rsync -av /space/freesurfer/repo/annex.git/* /cluster/pubftp/dist/freesurfer/repo/annex.git
git update-server-info
}}}

Mirroring could also be achieved with pure git/git-annex commands (git push, git annex copy or just git annex sync --content) but it would actually more demanding/take longer for such a straightforward full mirror. The only additional option which could benefit rsync call is --delete-after to clean up removed files (including within .git/objects) not present on receiving end.


=== Synced branches ===

Q: There exists a lot of these 'synced' branches in the repo. (See https://github.com/freesurfer/freesurfer) I've observed these tend to get created whenever using the {{{git annex sync}}} command when updating metadata of a data file. Whats the deal with them, and can I safely delete them? Should I delete them?

A: 'synced' branches are results of the "pushes" annex does upon {{{git annex sync}}} command, since it shouldn't and/or might not be able to update already existing branches with the same names on the remote end. In your case, in principle, you can safely remove them if necessary locally and on remotes (e.g. via {{{git push github-remote :synced-dev}}} and so on for other branches).

Q: Is there a way to update only metadata of a data file without issuing the {{{git annex sync}}} which seems to wreak havoc on the repo?

A: Issuing {{{sync}}} is not the only way to "update metadata" on the remote. Some mystery behind git-annex could be elevated by looking in the core of {{{git annex}}} operation. Virtually '''all''' information annex cares about (where file is available from, what metadata is associated with any file, what are the descriptions of the remotes, etc) are contained within the "git-annex" branch. And that branch is just a regular git branch, so you (or git annex) can fetch and push it as any other branch. That is what {{{git annex sync}}} does for you -- fetches, merges, and pushes. As I have mentioned, you can fetch and push it manually as well (e.g. via {{{git push github-rebote git-annex}}}). You only should not try to merge it yourself -- you should let {{{git annex}}} do it when it deems it necessary -- it does a special merge of files contained within git-annex branch based on the timestamps it records per each entry within that branch (you can checkout that branch and see what is contained). If needed, you can call {{{git annex merge}}} to trigger the merge of git-annex branch.

Overall you can consider git-annex branch to be a "log" of actions done on data files by annex. When you issue many {{{git annex}}} command locally ("git", "drop", "metadata --set", etc) you are modifying your local git-annex branch to reflect the action you have done (you can use {{{git annex log -p git-annex}}} after running some git-annex command to see what annex recorded in that branch). After you modified it locally, if you would like to share the changed state of knowledge of what annex keeps within git-annex branch, you need to push it, first making sure that you have also all the changes to that branch from the remote. So, overall you do not really need {{{git annex sync}}} for majority of the use-cases, and can just do {{{git annex merge}}} (to merge with information from the remote), and {{{git push REMOTE git-annex}}} to share your new information about data files with the remote. And complimentary to it -- you do not need to push git-annex branch to the remote, if your local changes did not provide any new useful to others information (e.g. new/modified metadata).

=== git-annex branch ===

Q: Suppose a user has a clone in an unknown state and wants to update. Typically this is done via a {{{git pull upstream dev}}}. But when using git-annex it seems that the user also needs to issue a {{{git annex fetch datasrc}}} command as well. Is this the proper workflow from a user perspective? Seems a bit of a burden on the user if so. Is their a cleaner way?

A: It is a proper workflow. In general, if data files do not change often, the "git annex get" (NB not that I have changed/fixed "fetch" with "get") does not need to be issued. So overall it is not that much of a burden in my opinion. Some git hook (post-merge and/or post-checkout) could be setup to always run such {{{annex get}}} command but in my opinion it would be an overkill.

=== Multiple Sources ===

Q: Suppose we want to add multiple sources for files held in git-annex. How does one do that? And how does user setup account for these multiple sources.