personal-website/posts/2017-08-25.html

126 lines
8.6 KiB
HTML
Raw Normal View History

<!--
.. title: How GitLab can help in research reproducibility
.. slug: gitlab-repro
.. date: 2017-08-25 14:08:25 UTC-04:00
.. tags: reproducibility
.. category: Professional Life
.. link: https://gitlab.com/VickySteeves/personal-website/blob/master/posts/2017-08-25.html
.. description:
.. type: text
-->
<!DOCTYPE html>
<html>
<body>
<p><strong>See this post on GitLab's blog <a href="https://about.gitlab.com/2017/08/25/gitlab-and-reproducibility/">here.</strong></p>
<p>NYU reproducibility librarian Vicky Steeves shares why GitLab is her choice for ongoing collaborative research, and how it can help overcome challenges with sharing code in academia.</p>
<p>GitLab is a great platform for active, ongoing, collaborative research. It enables folks to work together easily and share that work in the open. This is especially poignant given the problems in sharing code in academia, across time and people.</p>
<!-- TEASER_END -->
<img src="http://phdcomics.com/comics/archive/phd031214s.gif" alt="phd-code-comic">
<p>It's no surprise that GitLab, a platform for collaborative coding and Git repository hosting, has features for reproducibility that researchers can leverage for their own and their communities' benefit.</p>
<h3 id="what-exactly-is-reproducibility-">What exactly is reproducibility?</h3>
<p>Reproducibility is a core component in a variety of work, from software engineering to research. For software engineers, the ability to reproduce errors or functionality is key to development. For researchers, reproducibility is about independent verification of results/methods, to build on top of previous work, and to increase the impact, visibility, and quality of research. Y'know. That Sir Isaac Newton quote in every reproducibility presentation ever: "If I have seen further, it is by standing on the shoulders of giants."</p>
<p>Like all things, reproducibility exists on a spectrum. I like Stodden et al's definitions from the <a href="http://stodden.net/icerm_report.pdf">2013 ICERM report</a>, so I'll use those:</p>
<table class="table-bordered">
<thead>
<tr>
<th style="text-align:center">ICERM Report Definitions</th>
<th style="text-align:center">Potential Real-World Examples</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reviewable Research: Sufficient detail for peer review and assessment</td>
<td>The code and data are openly available</td>
</tr>
<tr>
<td>Replicable Research: Tools are available to duplicate the author's results using their data</td>
<td>The tools (software) used in the analysis are freely available for others to confirm results</td>
</tr>
<tr>
<td>Confirmable Research: Main conclusions can be attained independently without author's software</td>
<td>Others can reach the conclusion using similar tools, not necessarily the same as the author, or on a different operating system</td>
</tr>
<tr>
<td>Auditable Research: Process and tools archived such that it can be defended later if necessary</td>
<td>The tools, environment, data, and code are put into a preservation-ready format</td>
</tr>
<tr>
<td>Open/Reproducible Research: Auditable research made openly available</td>
<td>Everything above is made available in a repository for others to examine and use</td>
</tr>
</tbody>
</table>
&nbsp;
<p>The last bullet there is the goal <strong>open and reproducible research</strong>. Releasing code and data are key to open research, but not necessarily enough for reproducibility. This is where the concept of computational reproducibility becomes important, where whole environments are captured. You could also look at it this way:</p>
<img src="https://osf.io/8rx9y/download" alt="reproducibility-pyramid" height="50%" width="50%" style="display: block; margin: 0 auto;">
<h3 id="how-can-gitlab-help-">How can GitLab help?</h3>
<p>There are a few solutions out there, including containers (such as Docker or Singularity) for active research, and <a href="http://o2r.info/">o2r</a> and <a href="https://reprozip.org">ReproZip</a> for capturing and reproducing completed research. For this post, I'm going to focus on active research and containers.</p>
<p>I like GitLab for research reproducibility because it makes working together simple, and seamless. There's no hacking together 100 different third-party services. GitLab has hosting, LFS, and integrated Continuous Integration for free, for both public and private repositories! Everything is integrated in a single GitLab repository which, if made publicly available, can enable secondary users to reproduce results in a more streamlined fashion. You can also keep these private to a group you control the visibility of everything in one repository in one place, as opposed to updating permissions across multiple services.</p>
<p>There are a few key features that set GitLab apart when it comes to containers and reproducibility. The first is that GitLab doesn't use a third-party service for continuous integration. It's shipped with CI runners which can use Docker images from GitLab's registry. Basically, you can use the Docker Container Registry, a secure, private Docker registry, to choose a container that GitLab CI uses to run each job in a separate and isolated container.</p>
<p><img src="https://about.gitlab.com/images/ci/arch-1.jpg" alt="gitlab-ci-repro"></p>
<p>If you don't feel like using the GitLab registry, you can also use images from DockerHub or a custom Docker container you're already using locally. These can be integrated with GitLab CI, and if made public, any secondary users can use it as well!</p>
<h3 id="let-s-look-at-an-example">Let&#39;s look at an example</h3>
<p>This process is set up in a single file, a <code>.gitlab-ci.yml</code>. Another feature that makes my life easier GitLab can syntax-check the CI config files! The <code>.gitlab-ci.yml</code> file describes the pipelines and stages, each of which has a different function and can have its own tags, produce its own artifacts, and reuse artifacts from other stages. These stages can also run in parallel if needed. Here's an example of what a basic config file looks like with R:</p>
<pre><code>image: jangorecki/r-base-dev
<span class="hljs-keyword">test:
</span>script:
- R CMD build . --no-build-vignettes --no-manual
- PKG_FILE_NAME=$(ls <span class="hljs-string">-1</span>t *.tar.gz | head -n 1)
- R CMD check "${PKG_FILE_NAME}" --no-build-vignettes --no-manual --as-cran
</code></pre><p>And here's an example of building a website using the GitLab and the static site generator, Nikola:</p>
<pre><code>image: registry.gitlab.<span class="hljs-keyword">com</span>/paddy-hack/nikol<span class="hljs-variable">a:7</span>.<span class="hljs-number">8.7</span>
tes<span class="hljs-variable">t:</span>
<span class="hljs-keyword">scrip</span><span class="hljs-variable">t:</span>
- nikola build
excep<span class="hljs-variable">t:</span>
- master
page<span class="hljs-variable">s:</span>
<span class="hljs-keyword">scrip</span><span class="hljs-variable">t:</span>
- nikola build
artifact<span class="hljs-variable">s:</span>
path<span class="hljs-variable">s:</span>
- public
<span class="hljs-keyword">only</span>:
- master
</code></pre>
<p>It's also worth noting that you can use different containers per step in your workflow, if you outline it in your .gitlab-ci.yml. If your data collection script runs in one environment but your analysis script needs another, that's perfectly fine using GitLab, and others have the information to reproduce it easily! Another feature that puts GitLab apart is that a build of one project can trigger a build of another AKA, multi-project pipelines. For those of you working with big data, you can automatically spin up and down VMs to make sure your builds get processed immediately with GitLab's CI as well.</p>
<p>Here are some other great resources and examples of using GitLab to make research more reproducible:</p>
<ul>
<li><a href="https://gitlab.com/jangorecki/r.gitlab.ci">Gitlab-CI for R packages</a></li>
<li><a href="https://bertelsen.ca/example-gitlab-ci-yml-for-r-projects/">Example gitlab-ci.yml for R Projects</a></li>
<li><a href="http://www.jonzelner.net/statistics/make/docker/reproducibility/2016/05/31/reproducibility-pt-1/">Blog Post explaining GitLab + reproducibility - Jon Zelner</a></li>
<li><a href="https://gitlab.com/jzelner/reproducible-stan">GitLab repo accompanying blog post - Jon Zelner</a></li>
<li><a href="https://www.nersc.gov/assets/Uploads/2017-02-06-Gitlab-CI.pdf">Continuous Integration with Gitlab - Tony Wildish</a></li>
</ul>
</body>
</html>