Big questions, big data: starting small.

Large-scale applications can be overwhelming when developing new algorithms. To keep track of our progress, starting with small data subsets is essential, helping quickly learn about our data before scaling-up our analysis.

Setting up a project

Before engaging with data manipulation, we need a place to host it. You can use build_project() to setup a standardized folder structure together with a README file on the purpose of each folder. This follows a similar infrastructure as the MAS database. The more member of the lab use it, the easier it will be to exchange data and code.

Determining what we need

Before integrating new, messy data into your project, you might want to look into the MAS data catalog. This registry offers an overview on each dataset in the MAS database, and guiding us to existing documentation with details on the production of the data.

Compiling a sandbox

build_sandbox() helps us extract a subset of the MAS database that follows the standards of the MAS database. This way, we can test our algorithms locally and we can later migrate that code into the main database to upscale our analysis (click here to learn about system variables and relative paths).

Register data

Once we have our sandbox, we can use compile_metadata() to register all collected datasets. This function creates a portable flat database, equal to the one used in the parent infrastructure. With this database, we can use every MAS tool with the sandbox just as we would in the parent database.