UKB-RAP/DNANexus Gotchas

I’d like to use this page as a living document covering gotchas I’ve encountered when using the UK Biobank Research Analysis Platform (UKB-RAP), which is a special instance of the DNAnexus platform for using this dataset. Some of the items below are either somewhat covered in the official documentation, or elsewhere in the Community Forums, but weren’t obvious at first glance, or weren’t explained in a way I understood well. Hopefully my explanations will help others using the platform save a little time.

In general, I’ll cover

Interactive use of the data
Writing applets and pipelines

Interactive Use of UKBB Data on the RAP

Accessing project files from JupyterLab

The DXJupyterLab app allows you to run JupyterLab, optionally with Spark/Hail enabled. If you’re developing in Jupyter, you might notice that files you upload to your DNAnexus project (accessible under /mnt/project) cannot be found by the code running in your notebook. This is because only files present at the time of mounting by dxfuse are available.

If you want to use a file you uploaded to your project after you started up your Spark cluster, run the below Bash code in a Jupyter chunk:

echo "{
     \"files\" : [],
     \"directories\" : [
        {
         \"proj_id\" : \"$DX_PROJECT_CONTEXT_ID\",
         \"folder\" : \"/\",
         \"dirname\" : \"/project\"
        }
     ]
   }" > /home/dnanexus/.dxfuse_manifest.json
        
umount /mnt
mkdir -p /mnt
/home/dnanexus/dxfuse -readOnly /mnt /home/dnanexus/.dxfuse_manifest.json

Working With Project Files on the Cloud Workstation

When running a job via a utility such as dx swiss-army-knife, your project data (i.e. the files you see in the web interface) are available via dxfuse at /mnt/project. This is very useful when running things non-interactively (like in a pipeline), but you may find yourself wanting to try to interactively debug a step in a script you’re running with swiss-army-knife, and boot up a Cloud Workstation to interactively operate on files under that mountpoint. However, by default, the Cloud Workstation doesn’t mount your project files.

The most convenient way I’ve found to work with project files on the Cloud Workstation is to download the dxfuse binary for Linux to the Cloud Workstation, and simply mount the project to a directory of your choosing. I keep the binary as a file in DNAnexus, so I don’t have to fetch it over the network every time. You could include this in a startup script, or image your Cloud Workstation in order to not have to do this every time you want to work with the data.

To give a more concrete example, you could do something like the below to set up a folder project/ in your workstation home directory, in which the data will be available:

mkdir project  # Mountpoint for our project data
dx download [file ID for the dxfuse binary] && chmod +x ./dxfuse
dxfuse ./project [project ID]

You can then navigate the directory structure similarly to how you’d do it on the site. Keep in mind that this filesystem is read-only; this can make running tools on data mounted this way frustrating, since temporary files can’t be made alongside the files they’re being operated on. Also, because the mounted filesystem is read-only, you need to use dx upload in order to add any new files to your UKB-RAP instance.

In order to perform that kind of upload, remember that first you need to run the below (as mentioned in the official docs):

unset DX_WORKSPACE_ID
dx cd $DX_PROJECT_CONTEXT_ID:

Writing Applets and Pipelines

Including a large file in your applet

You may need to include a large file in your app(let), such as a raw FASTA human genome reference. Normally you could include this in the resources/ subdirectory of your app(let), which is appropriate for small files and binaries, but not something big…

For very large files, you could use the Upload Agent to upload them as an everyday file, and then reference the file object every time you run the applet. The disadvantage to this is that the file needs to be copied to the run container every time you run a job; so you’re wasting a lot of time on file transfer before a job can even start, and thus wasting EC2 instance time. You could also use dxfuse as mentioned above to stream the file to your worker, but depending on the filetype and use case, performance can become a serious issue.

There isn’t a good universal solution as far as I’ve seen. I include this subchapter solely to encourage you to be mindful of file transfer times in your DNAnexus-based pipelines as a hidden cost.

Working with DRAGEN WGS Data (in PLINK Format)

When running e.g. SAIGE-GENE, it’s often most convenient to use the PLINK2 files to represent the WGS genotyping information available from UKBB. However, there’s a catch: the files were generated without a Phenotype column, so you need to pass --no-pheno to PLINK2 or it won’t open any of the files you specify.

Keep in mind that the UKBB-provided files are alt-first; the alternate allele is specified first in the *.pvar file. This needs to be specified to tools like e.g. SAIGE.