CollaboratoR Workflow
collaborator_workflow.RmdOverview
As stated on the project main README, this package is to assist data synthesis research by automating data validation and aggregation processes for researchers with different skill sets. to use a set of CSV formatted data file, or optionally employ user-friendly Google Sheets, as a collaborative data and meta- data entry tool. It features automated validation of the data using simple data definition table and easy to write standardized rules, and finally the workflow tracks changes to data using git.
Components
- (optionally) Google Sheets to entry of data and meta-data
- other systems could be used with manual export of CSVs, this package has capability of automatically pulling from Google Sheets
- meta-data and data store in common tabular data formats
- CSV is basis for using version control
- this package enables use of user-friendly Google Sheets for editing these
- R Readr package for data column name and data type validation
- R Validate package for data content validation
- git to track changes to data and meta-data via CSV text file
- (optionally) Git service to share results via the web
- Github or other git services, such as Gitlab
- on premise Gitlab servers could be used ( our institution provides https://gitlab.msu.edu )
Setup/Configuration
- Install using instructions in the project’s README
- Information on how to set-up google workspace in the vignette for Google Sheets API setup.
- This package relies on configuration values in the
.Renvironfile
Workflow Summary
- Write your data entry protocol (outside of this system but crucial for organization)
- Setup Git repositories to hold data as it flows through
- Create metadata and configuration
- column definintions for main data sheet (CSV or Googlesheets)
- create validation rules (YAML format)
- data lists such as species lists in CSV or
- list/table for tracking data entry sheets
- CollaboratoR configuration in .Renviron
- Enter data into tabular form
- CSV or GoogleSheets
- add entry to list of URIs (Google sheet urls or CSV file paths)
- Use CollaboratoR package to accumulate metadata
- read metadata: column specification, list of data sheet URIs
- reads in list of data sheets CSV filename or list of googlesheets URLs
- validation rules written in YAML and stored a RData file
- automated validation and report
- read data files (from list of URIs in CSV or Google Sheets)
- data format against schema (from column spec)
- category columns are checked against lists (i.e. species lists)
- data values against rules (from YAML file)
- report errors printed for user
- save to git
- those sheets that passed validation are commited to git
- combine to master list(s)
- a master database
Workflow
Write your data entry protocol
This is beyond the scope of this documentation and package, but crucial for successful data synthesis projects. The process for finding, extracting and formatting every data point is crucial. This protocol will the determine the columns in the datsheet and column definitions in metadata
Create metadata and configuration
Create a table describing the columns in the each to data files based on the protocol. This table must have columns as follows:
| Column | Description |
|---|---|
| col_name | exact name of column as it will appear in the datasheet |
| col_description | human reable description based on your data collection protocol |
| col_type | name of the datatype, see datatype list below |
| example | human readable example for guidance |
col_name,col_description,col_type,example
id,"id formed from species group and sample location", character,"SetosaMBG"
Sepal.L., "Sepal length (mm)", double,5.1
Sepal.W., "Sepal width (mm)", double,3.5
Petal.L., "Petal length (mm)", double,1.4
Petal.W., "Petal width (mm)", double,0.2
group_id, "grouping variable", factor, "SetosaMBG"
whos, "collector of measurements", character, "ESA"
site, "name of the site where the data was collected", character, "Main Site"
- create validation rules (YAML format)
- data lists such as species lists in CSV or
- list/table for tracking data entry sheets
- CollaboratoR configuration in .Renviron
Enter data into tabular form
- CSV or GoogleSheets
- add entry to list of URIs (Google sheet urls or CSV file paths)
Use CollaboratoR package to accumulate metadata
- read metadata: column specification, list of data sheet URIs
- reads in list of data sheets CSV filename or list of googlesheets URLs
- validation rules written in YAML and stored a RData file
Initiate automated validation and report
- read data files (from list of URIs in CSV or Google Sheets)
- data format against schema (from column spec)
- category columns are checked against lists (i.e. species lists)
- data values against rules (from YAML file)
- report errors printed for user ## save to git
- those sheets that passed validation are commited to git