Differences between revisions 5 and 11 (spanning 6 versions)
Revision 5 as of 2017-04-18 10:41:38
Size: 2866
Editor: AndrewHoopes
Comment:
Revision 11 as of 2017-05-19 12:03:10
Size: 4906
Editor: ZekeKaufman
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
#acl LcnGroup:read,write,delete,revert #acl LcnGroup:read,write,delete,revert All:read
Line 5: Line 5:
This page describes how to deal with adding and tagging data files in the freesurfer source code repository. Data files represent a special case for source code repositories and generally speaking data files should not be stored in a source code repo. Rather they should be stored in a separate storage area, reserved for times when retrieval is required (e.g. updating test data, performing local installations, etc). Think of "source code" as files which are made up of text, are required for compilation, and can be easily inspected by the human eye (and diff'ed by the '''diff''' program). Think of "data files" as files stored as binary and not required for compilation.

This page describes how to work with git-annex, the software used for storing and retrieving data files in the freesurfer git repo.
Line 22: Line 24:
Users outside the Martinos Center, who do not have access to the local filesystem, should instead have the '''datasrc''' repo pointed to the public facing server:

{{{
git remote -v

  datasrc https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/annex.git (fetch)
  datasrc https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/annex.git (push)
  origin git@github.com:zkaufman/freesurfer.git (fetch)
  origin git@github.com:zkaufman/freesurfer.git (push)
  upstream git@github.com:freesurfer/freesurfer.git (fetch)
  upstream git@github.com:freesurfer/freesurfer.git (push)
}}}
Line 35: Line 50:
{{{
git fetch datasrc (maybe could do 'git annex sync')
git annex get mri_em_register/testdata.tar.gz
}}}
Line 36: Line 55:
Get only the data files required for build time checks (1.9 GB)
Line 37: Line 57:
git annex get mri_em_register/testdata.tar.gz git annex get --metadata fstags=makecheck .
}}}

Get only the data files required for local installation (4.3 GB)
{{{
git annex get --metadata fstags=makeinstall .
Line 54: Line 79:
git annex sync git annex copy --to datasrc
Line 57: Line 82:
== Tagging a data file == == Tagging ==
Line 59: Line 84:
The data files have been broken down into the following 3 categories, and it is essential that data files get the proper tag(s) so that our servers and diskspace is not overwhelmed when only a known subset of the data is required.: Git -annex provides the ability to to tag data files. Freesurfer utilizes tags so that subsets of the data can be retrieved without having to download everything. The data files have been broken down into the following 3 categories:
Line 64: Line 89:

It is essential that data files get the proper tag(s) so that our servers and diskspace is not overwhelmed when only a known subset of the data is required.
Line 93: Line 120:
=== Retrieve all files with a given tag === == Mirroring ==
Line 95: Line 122:
Get only the data files required for build time checks (1.9 GB) The git annex repo exists on the local file system in the following directory:
Line 97: Line 125:
git annex get --metadata fstags=makecheck . /space/freesurfer/repo/annex.git
Line 100: Line 128:
Get only the data files required for local installation (4.3 GB) The public facing git annex repo exists on local file system in the following directory (mounted by our server):
Line 102: Line 131:
git annex get --metadata fstags=makeinstall . /cluster/pubftp/dist/freesurfer/repo/annex.git
Line 104: Line 133:

Currently we "mirror" the two repos daily using the following commands:

{{{
ssh pinto (Must be on machine pinto)
rsync -av /space/freesurfer/repo/annex.git/* /cluster/pubftp/dist/freesurfer/repo/annex.git
git update-server-info
}}}

The proper way to mirror would be as follows:

{{{

}}}

Data files represent a special case for source code repositories and generally speaking data files should not be stored in a source code repo. Rather they should be stored in a separate storage area, reserved for times when retrieval is required (e.g. updating test data, performing local installations, etc). Think of "source code" as files which are made up of text, are required for compilation, and can be easily inspected by the human eye (and diff'ed by the diff program). Think of "data files" as files stored as binary and not required for compilation.

This page describes how to work with git-annex, the software used for storing and retrieving data files in the freesurfer git repo.

Initial Setup

Based in the information included in the Freesurfer_github page, the remotes of your freesurfer repo working directory should look something like:

git remote -v 
 
  datasrc       file:///space/freesurfer/repo/annex.git (fetch)
  datasrc       file:///space/freesurfer/repo/annex.git (push)
  origin        git@github.com:zkaufman/freesurfer.git (fetch)
  origin        git@github.com:zkaufman/freesurfer.git (push)
  upstream      git@github.com:freesurfer/freesurfer.git (fetch)
  upstream      git@github.com:freesurfer/freesurfer.git (push)

Users outside the Martinos Center, who do not have access to the local filesystem, should instead have the datasrc repo pointed to the public facing server:

git remote -v

  datasrc       https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/annex.git (fetch)
  datasrc       https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/annex.git (push)
  origin        git@github.com:zkaufman/freesurfer.git (fetch)
  origin        git@github.com:zkaufman/freesurfer.git (push)
  upstream      git@github.com:freesurfer/freesurfer.git (fetch)
  upstream      git@github.com:freesurfer/freesurfer.git (push)

Adding a data file

The following example assumes we want to add a data file called 'testdata.tar.gz' to the 'distribution' directory:

git annex add <filename>
git commit -a -m "Added new file"
git annex copy --to datasrc

Getting a data file

To retrieve the contents of a data file:

git fetch datasrc (maybe could do 'git annex sync')
git annex get mri_em_register/testdata.tar.gz

Get only the data files required for build time checks (1.9 GB)

git annex get --metadata fstags=makecheck .

Get only the data files required for local installation (4.3 GB)

git annex get --metadata fstags=makeinstall .

Retrieve everything under the current directory (not recommended)

git annex get .

Modifying a data file

To modify the contents of a data file, first unlock it (which eliminates the symlink), than modify it, then re-add to the annex:

git annex unlock mri_em_register/testdata.tar.gz
<modify contents of tar file>
git annex add mri_em_register/testdata.tar.gz
git commit -am "New test data"
git push
git annex copy --to datasrc

Tagging

Git -annex provides the ability to to tag data files. Freesurfer utilizes tags so that subsets of the data can be retrieved without having to download everything. The data files have been broken down into the following 3 categories:

  1. Those being required for build time checks (tagged makecheck)

  2. Those required for a local installation (tagged makeinstall)

  3. Everything else (untagged)

It is essential that data files get the proper tag(s) so that our servers and diskspace is not overwhelmed when only a known subset of the data is required.

Display metadata

To show all the metadata associated with a file:

git annex metadata mri_em_register/testdata.tar.gz

Assign metadata

To assign a tag to an existing datafile.

git annex metadata mri_em_register/testdata.tar.gz -s fstags=makecheck
git annex sync

We can also append tags:

git annex metadata mri_em_register/testdata.tar.gz -s fstags+=makeinstall
git annex sync

List all files with a given tag

git annex find --metadata fstags=makecheck

Mirroring

The git annex repo exists on the local file system in the following directory:

/space/freesurfer/repo/annex.git

The public facing git annex repo exists on local file system in the following directory (mounted by our server):

/cluster/pubftp/dist/freesurfer/repo/annex.git

Currently we "mirror" the two repos daily using the following commands:

ssh pinto   (Must be on machine pinto)
rsync -av /space/freesurfer/repo/annex.git/* /cluster/pubftp/dist/freesurfer/repo/annex.git
git update-server-info

The proper way to mirror would be as follows: