Analysis Pipeline
This guide provides a walkthrough of some of the key analysis steps in analyzing molecular inversion probe data. We use a test data set composed of FASTQ files and a sample list.
Note
This guide does not cover the download or demux apps. The test data provided has already been demultiplexed.
Download Test Data
The test data set can be downloaded here or via the use of the command line:
# Download and untar directory
wget -qO- https://baileylab.brown.edu/MIPTools/download/test-data.tar.gz | tar -xvz
The test data set contains 5 directories that contain the test data, species resources, as well as project resources:
tree -FL 1 test-data
#> test-data
#> ├── DR1_project_resources/
#> ├── hg38_host/
#> ├── pf_species_resources/
#> ├── test_data/
#> └── test_design_resources_pf/
Wrangle Data
After downloading the test data, the next step is to run the wrangler app. We first create a directory that will store our wrangler analysis and copy our sample list into said directory:
mkdir test-data/wrangler
cp test-data/test_data/sample_list.tsv test-data/wrangler/
We additionally define several parameters needed to wrangle data:
experiment_id='test_run'
sample_list='sample_list.tsv'
probe_sets_used='DR1,VAR4'
sample_sets_used='JJJ'
cpu_number=10
min_capture_length=30
Next, we can run the wrangler app. For additional
instructions on what each flag represents, consult the man page for the app or the built in documentation with
singularity run --app wrangler miptools_dev.sif -h
.
singularity run \
-B test-data/DR1_project_resources:/opt/project_resources \
-B test-data/test_data/fastq:/opt/data \
-B test-data/wrangler:/opt/analysis \
--app wrangler miptools_dev.sif \
-e ${experiment_id} -l ${sample_list} -p ${probe_sets_used} \
-s ${sample_sets_used} -c ${cpu_number} -m ${min_capture_length}
The wrangler app will save the main outputs as compressed
files in the wrangler
directory. There will additionally be a
nohup
file that contains errors and warning messages logged by the
wrangler app. This file should be empty if the all went
well. In our example run, the nohup
file was empty and the main
outputs were aggregated into the three files:
run_test_run_wrangled_20220314.txt.gz
extractInfoByTarget.txt.gz
extractInfoSummary.txt.gz
Tip
After confirming the wrangler app successfully ran, we
recommend you delete the wrangler/analysis
directory. This will
remove many small files and save space in the future.
rm -rf test-data/wrangler/analysis
Variant Calling
To further process our data and call and analyze variants, we will leverage an interactive Jupyter notebook by calling the jupyter app. Our main variant calling method uses the Freebayes software, a Bayesian genetic variant detector. While we have optimized the algorithm for calling on molecular inversion probes (MIPs), we use an interactive environment for calling and initial assessment to provide the user with greater customizability.
Before running the jupyter app, we must define a new directory in which we will run our variant calling pipeline:
mkdir test-data/variant
Then we can start our Jupyter notebook:
singularity run \
-B test-data/DR1_project_resources:/opt/project_resources \
-B test-data/pf_species_resources:/opt/species_resources \
-B test-data/wrangler:/opt/data \
-B test-data/variant:/opt/analysis \
--app jupyter miptools_dev.sif
A series of instructions will be printed to the terminal on how to access the
notebook. Follow these instructions to run the Jupyter notebooks in a web
browser. For more information refer to the FAQ of the jupyter app. Next, navigate to the analysis
directory. The
analysis-of-test-data-Freebayes
notebook contains a demonstration of
processing data, variant calling, and additional data analysis.