At the first MURUG meeting of 2015, Ellen gave a great presentation on the many virtues of using the version control and code sharing software packages: Git and GitHub.
MURUG has its own GitHub account where the R scripts, data, and other useful material from the MURUG workshops can be accessed. Sharing code is an important part of the collaboration process, and I’m certain that we can all benefit from using version control software to better manage our code. We strongly encourage MURUG members to become acquainted with GitHub and use this fantastic and free tool to share code and develop collaborations.
Like any new software, getting started with GitHub can be a bit overwhelming and challenging. In this series of posts we hope to make that process a little easier by providing some additional information, and a step-by-step guide to getting started with Git, GitHub and RStudio.
Picture this scenario: You’ve just finished writing a long R script involving dozens of customs functions, some complex statistical analyses, and a whole host of fancy graph outputs. Very wary of loosing all your hard work, you carefully navigate up to the top of RStudio and click “Save“. Great, “Analysis.R” is now saved for continued work tomorrow.
The next morning you attend a seminar and a colleague tells you that there is a much better way of analysing your data: “You should use the X package” or “You need to use the Y statistical method“. You now have two choices. Option 1 is to decide your previous work was complete rubbish and delete a whole bunch of your code and start again from scratch. Option 2 is to make the required changes to your code and click “Save as“: Now “AnalysisV2.R” has sprung into existence.
If your work goes anything like mine, it won’t be long before you end up with “AnalysisV3.R” and then “AnalysisV4.R” and so on. If you are really creative, you might end up with several things like “AnalysisV4_tryingnewidea_v2.R” and “AnalysisV19_otherideaswerebad_trythisinstead.R“. This only gets worse when you try to share code with your colleagues. As you are perhaps aware, it doesn’t take very long at all before you end up with a mess of different files. The solution? Version Control Software.
As the name suggests, version control software allows you to track different ‘versions’ of your project. All changes to your code (or other files) are recorded (i.e. saved) and you can revert back and undo changes from any point in time. Whenever you make changes to your code, at the end of the day for example, you take a ‘snapshot’ of your project (or ‘commit’ the changes to the repository in version control parlance). No more multiple files with different versions of the same code. No more creative file names. Just one file, a simple file name, and the entire history of all changes to the code at your fingertips. What’s not to love?
With version control software you can restore older versions of a file very quickly and easily. What is the advantage of this? You can’t mess things up! It is not possible (or a least very difficult) to wreck your code and loose all of your important information. Accidentally save over you entire R script with an empty file? No problems, just revert back to the last commit. Realise that all the stuff you did yesterday was wrong and you should not have deleted all that code from two days ago? Simple, revert back and continue from the version of code that you trust.
What has changed?
One of the additional problems of having many different files, each continuing a new and improved version of your code, is that it is very easy to loose track of the difference between the files. What was different between Version 2 and Version 3? I have trouble remembering what I did yesterday, so if the code was from last month or last year, I have no idea of the difference between the two. Sometimes this is where the creative file names come into the picture. But it is a usually a bit difficult to capture the nuances of your new and improved code in a file name. Plus, it looks really messy.
With version control software, each time you commit your file and save the changes you are required to enter a short description of what was changed: e.g., “Removed bug in regression model and added new plotting feature“. In addition, the version control software has a great feature where you can very quickly see exactly what was changed in the contents of a file: for example which lines were deleted and which new ones were added.
Backup of your code
If you use a centralised version control server (GitHub for example) your entire project is fully backed up. In the unfortunate situation of a computer meltdown, you can simply recover all your work by downloading from the centralised repository.
Collaboration and Sharing
The added advantage of using a centralised version control server like GitHub is that it is very simple to share code and collaborate on projects with your colleagues. GitHub will be explained in more detail later. But first let’s talk about version control software.
Version Control Software
Version control systems (VCS) have been around for many decades, and a number of different VCS software packages have been developed. Here we will only focus on open source version control software, although a number of proprietary VCS do exist.
Version control systems can be broadly grouped into two categories: centralised and distributed systems.
Centralised systems follow a client – server model, where a single central repository is located on a server, and all users of the repository commit changes to this central copy. One of the oldest VCS Concurrent Version System (CVS) follows the centralised approach and is widely used. Subversion (SVN) was developed to address some of the perceived deficiencies in CVS and is another popular VCS that follows the client – server model.
With a distributed system, each user has an entire copy of the version control repository on their local machine, with a complete history of all changes made to the project. Each user has the option to sync their local repository with a centralised system (GitHub for example). This way, there is no single master copy which is vulnerable to being corrupted, but every user who uses the repository has a complete copy of everything.
Although there are several different open source systems that follow the distributed approach, the most popular version control system (at least currently) appears to be Git.
Although a bit challenging to begin using, Git is very powerful and is well worth learning and using. RStudio can be integrated with Git and GitHub which makes saving your changes and sharing your code with the world a breeze.
In the next part we will discuss how to install and get started with Git.