We’ve built the dccvalidator tool to streamline the process of data validation and QA/QC. As the PsychENCODE (PEC) Knowledge Portal has grown to more than 39 contributing labs and over 90,000 data files, we’ve realized a need to be more standardized in our approaches to data curation. Thus, we built an application that performs many of the routine data quality checks we previously conducted by hand, with the hopes that it will help you, the data contributor, get your data checked, validated,nand shared easily and quickly.
The application is hosted on a Shiny server here.
To use this application you must:
Some portions of the app submit data to Synapse. This allows curators at Sage to troubleshoot issues if needed. No one outside the Sage curation team will be able to download the data.
A biospecimen is a sample of material such as tissue, cells, DNA, RNA or protein that has a unique identifier associated to it -
specimenID. The same biospecimen may be characterized in multiple assay types. In this case, the unique identifier should remain the same. We strongly recommend you do not name specimens using individual identifiers. In the case where multiple sequencing libaries are prepared from a single biospecimen,
LibraryID is an available key. Replicates are tracked using integers and the keys
A manifest is .tsv or .txt file with data files to be uploaded to Synapse as entries in each row. Details of a manifest are described in the Uploading and Downloading Data in Bulk Synapse User Guide. While a metadata file will be stored on Synapse as a flat file, and select variables added as file annotations, all variables in a manifest file will live as annotations respective to the file in that row. To successfully upload a file, you must specify the local
path to the file and the Synapse ID of the folder in the
Yes, Synapse supports Provenance! Provenance can be leveraged to connect raw data to reprocessed or summarized data. Populate the
used column in the manifest with the synID. The required values format for linking multiple files is
used = synID;synID.
Yes, with Provenance. Populate the
executed column with the url to your Github repo.
Each study in PEC will have accompanying documentation in the PEC portal. Here is an example of study documentation in the Accelerating Medicines Partnership in Alzheimer’s Disease portal, developed by Sage Bionetworks.
You can submit your documentation through the dccvalidator app on the Documentation page. There should be a study description for the whole study, and an assay description for each of the assays that was performed. These can be in a single file, or you can upload multiple files to the assay description section.
With a new study, there may not yet be a Staging folder in the PEC Knowledge Portal. Please contact us - PEC_SageAdmin@synaspse.org.
Each study should include metadata that would help a new researcher understand and reuse the data. In most cases, we will expect 4 files:
Metadata file templates are available in the PsychENCODE Knowledge Portal resources.
If you don’t see a template for the assay(s) in your study, please send a request for a new schema to PEC_SageAdmin@synapse.org. We depend on your expertise to develop schemas that capture the most pertinent metadata!
The data validation portion of the app allows you to upload metadata files (as .csv) and the manifest (as .tsv or .txt) and view the results of a series of automated checks.
Examples of the types of checks we perform are: - All required columns from the templates are present - Individuals and specimens have unique identifiers - Metadata terms conform to a controlled vocabulary, where applicable
We also provide a summary of the files you have uploaded, showing the number of individuals, specimens, and files. We visualize the data in each column by its data type to help spot unexpected missing values.
The Data Analysis Core will reprocess common data types during PEC Phase II. Currently, common data types are RNASeq, ATACSeq, ChipSeq and all single cell data. For common data types, only fastq files are required. For other data types, please provide fastq and bam files.
count matricies - RNASeq, ChIPSeq and ATACSeq data all produce count matricies. These file types are especially useful for data users who want to compare their own datasets without bioinformatic processing.
peaks - A processed data type specific to ChIPSeq output.
Once data has passed validation and the PEC data curators permit edit permissions to the Staging folder, data may be uploaded. Bulk file upload is achievable using the web UI or the R, Python and command line clients.
Data is uploaded to a Staging folder, private to each individual group. Once curated, data is moved to a PEC folder for a limited period of time where all consortium members have access to the data via the PEC Team. Finally, data is made public to Synapse users in a Data folder. All data upload takes place in the PsychENCODE Knowledge Portal Synapse Project. While access to the project is public, restrictions are associated with the Staging and PEC folder to make sure the data remains private for the appropriate period of time.
Synapse IDs are always preserved (i.e. IDs remain associated to the file).
Please send questions to PEC_SageAdmin@synapse.org.