find-data.Rmd
The MAS database has several types of data, including gridded
(e.g. land cover), vector
(e.g. range maps) and tabular
(e.g. crop statistics). Given the particularities of each of these data formats, they require different forms of access. Currently, masDMT
offers two main forms: access_grid()
and access_vector()
. Both function require a product identifier, such as described in this page
access_grid()
helps find and read gridded data, such as the one derived with remote sensing. It provides access to one dataset at a time, returning them as a spatRaster
object with n bands where n is the number of time-steps associated to the query. In the following example, we accessed ESA’s CCI land cover, hosted under the identifier CCI_landCover/landCover
. Here, we will limit our search o the period between 2001-01-01 and 2002-12-31. As shown below, this results in a global spatRaster
with two layers, for 2001 and 2002, respectively.
access_grid('CCI_landCover/landCover', range=c('2001-01-01', '2002-12-31'))
## class : SpatRaster
## dimensions : 200, 266, 2 (nrow, ncol, nlyr)
## resolution : 0.002775493, 0.002776722 (x, y)
## extent : 23.33491, 24.0732, -1.811076, -1.255731 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326)
## source : file64c43777121a_25796.vrt
## names : CCI_landCover/C~00_10arcSec.tif, CCI_landCover/C~00_00arcSec.tif
## time : 2001-01-01 to 2002-01-01
Unlike other raster processing packages developed for R, terra
accounts for the perils of increasing data volumes in the remote sensing community. For that reason, most functions include arguments that prompt the use of additional CPU’s. However, it still assumes that we want to process the full raster. When dealing with global datasets, this can be an issue that tackles RAM limitations. While cropping functions exist, they create a new file with the subset data that occupy space in either physical or virtual memory. This can result in a large - and arguably unnecessary - processing time overhead, making our algorithms unnecessarily long. Under such limitations, one might choose to crop the larger dataset before running an analysis. However, this might also be inadequate. If we routinely process data on multiple spatial and temporal scales, subsetting raster in preparation for every situation can be time and space consuming. While terra
can’t do on-the-fly subsetting, we can trick the package to only access a data subset at a time. To do so, access_grid()
makes use of Virtual Rasters (VRT). Given a spatial bounding box, the function will call gdalbuildvrt
and create a VRT file in the default temporary folder, which will serve as a pointer for where the desired raster data starts and which files compose the desired analysis. To demonstrate this feature, we will build on the previous example and request the desired spatial extent. Notice how the reported metatada changes.
access_grid('CCI_landCover', range=c('2001-01-01', '2002-12-31'), bbox=c(23.5, -1.5, 24, -1.3))
## class : SpatRaster
## dimensions : 72, 180, 2 (nrow, ncol, nlyr)
## resolution : 0.002775493, 0.002776722 (x, y)
## extent : 23.5, 23.99959, -1.499924, -1.3 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326)
## source : file64c4788932f7_25796.vrt
## names : CCI_landCover/C~00_10arcSec.tif, CCI_landCover/C~00_00arcSec.tif
## time : 2001-01-01 to 2002-01-01
access_vector()
helps find and read gridded data, such as the one derived from field surveys. Just like with access_grid()
, it requires a dataset identifier and accepts a spatial bounding box to perform spatial subsetting. However, it does not allow for temporal queries. This choices relates to the fact that vector data from different origins can have very different tabular structures with constrasting fields, making it difficult to establish a standard content. However, given all vector data in our database is stored as spatialite objects, we can use dedicated SQL queries.We can provide access_vector()
with a bounding box (for spatial subsetting, just like with access_grid()
) and an SQL query (for thematic subsetting) making it easier to handle very large vector objects. In this example, we will use access_vector()
to extract country polygons for a spatial subset and for a thematic subset. Note that the spatial subsetting does not climp vector data, but rather returns those entries that intersect with the bounding box.
# read full dataset
access_vector('GADM/level0')
## Reading layer `level0' from data source
## `C:\Users\rr70wedu\AppData\Local\Temp\RtmpQZ0Zxa\temp_libpath752c110149ba\masDMT\extdata\GADM\GADM-level0_20200000_NA.spatialite'
## using driver `SQLite'
## Simple feature collection with 3 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -3.25492 ymin: 4.738751 xmax: 3.851701 ymax: 12.41835
## Geodetic CRS: WGS 84
## Simple feature collection with 3 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -3.25492 ymin: 4.738751 xmax: 3.851701 ymax: 12.41835
## Geodetic CRS: WGS 84
## gid_0 name_0 GEOMETRY
## 1 BEN Benin MULTIPOLYGON (((1.941528 6....
## 2 GHA Ghana MULTIPOLYGON (((-2.035416 4...
## 3 TGO Togo MULTIPOLYGON (((1.288195 6....
# read spatial subset
access_vector('GADM/level0', bbox=c(-2, 1, 4, 5))
## Reading layer `level0' from data source
## `C:\Users\rr70wedu\AppData\Local\Temp\RtmpQZ0Zxa\temp_libpath752c110149ba\masDMT\extdata\GADM\GADM-level0_20200000_NA.spatialite'
## using driver `SQLite'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -3.25492 ymin: 4.738751 xmax: 1.19177 ymax: 11.1733
## Geodetic CRS: WGS 84
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -3.25492 ymin: 4.738751 xmax: 1.19177 ymax: 11.1733
## Geodetic CRS: WGS 84
## gid_0 name_0 GEOMETRY
## 1 GHA Ghana MULTIPOLYGON (((-2.035416 4...
If you are new to a given dataset, you might feel unsure how to proceed, especially if it’s memory consuming. Building thematic queries is difficult without knowing which fields are available, but reading every record just for that purpose can be hindering. Preparing for such situations, access_vector()
allows users to retrieve the first records in a dataset. Through the limit
argument, you can specify how many records to read. In the example below, we’ll read the first entry.
# select the first record
access_vector('GADM/level0', limit=1)
## Reading layer `level0' from data source
## `C:\Users\rr70wedu\AppData\Local\Temp\RtmpQZ0Zxa\temp_libpath752c110149ba\masDMT\extdata\GADM\GADM-level0_20200000_NA.spatialite'
## using driver `SQLite'
## Simple feature collection with 3 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -3.25492 ymin: 4.738751 xmax: 3.851701 ymax: 12.41835
## Geodetic CRS: WGS 84
## Simple feature collection with 3 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -3.25492 ymin: 4.738751 xmax: 3.851701 ymax: 12.41835
## Geodetic CRS: WGS 84
## gid_0 name_0 GEOMETRY
## 1 BEN Benin MULTIPOLYGON (((1.941528 6....
## 2 GHA Ghana MULTIPOLYGON (((-2.035416 4...
## 3 TGO Togo MULTIPOLYGON (((1.288195 6....
# select first record using an SQL query
access_vector('GADM/level0', query='SELECT * FROM level0 WHERE gid_0 = "BEN"')
## Reading query `SELECT * FROM level0 WHERE gid_0 = "BEN"' from data source `C:\Users\rr70wedu\AppData\Local\Temp\RtmpQZ0Zxa\temp_libpath752c110149ba\masDMT\extdata\GADM\GADM-level0_20200000_NA.spatialite'
## using driver `SQLite'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 0.774574 ymin: 6.23514 xmax: 3.851701 ymax: 12.41835
## Geodetic CRS: WGS 84
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 0.774574 ymin: 6.23514 xmax: 3.851701 ymax: 12.41835
## Geodetic CRS: WGS 84
## gid_0 name_0 GEOMETRY
## 1 BEN Benin MULTIPOLYGON (((1.941528 6....
Both access_grid()
and access_vector()
provide an “easy way in” to the MAS database. However, you might find this limiting if you have your own preferences regarding how to handle raster and vector data. But don’t fear: there is an alternative. Both the functions we discussed use list_data()
to find relevant data entries, and you too. When provided with a unique data identifier (or vector with several identifiers), the function will identify all the corresponding records and provide a data.frame
with their characteristics. This includes, e.g., the spatial extent of the records, their format and data type, their start and end date and, perhaps more importantly, the location of the corresponding files (i.e. path
). The example below shows the query results for CCI_landCover/landCover
and GADM/level0
.
dataset | subdataset | resolution | pixel_size | path | format | type | nr_bytes | nr_files | start | end | modified |
---|---|---|---|---|---|---|---|---|---|---|---|
CCI_landCover | landCover | 10arcSec | 0.0027755 | CCI_landCover/CCI_landCover-landCover_20010000_10arcSec.tif | grid | INT1U | 1 | NA | 2001-01-01 | 2001-12-31 | 2021-10-25 17:14:22 |
CCI_landCover | landCover | 00arcSec | 0.0027755 | CCI_landCover/CCI_landCover-landCover_200200000_00arcSec.tif | grid | INT1U | 1 | NA | 2002-01-01 | 2002-12-31 | 2021-10-25 17:14:22 |
GADM | level0 | NA | NA | GADM/GADM-level0_20200000_NA.spatialite | vector | NA | NA | 3 | 2020-01-01 | 2020-12-31 | 2021-10-25 17:14:22 |
If no identifier is given, the function will return a summary table, as demonstrated below. This table aggregates multi-date records into single entries and reports on the corresponding start and end dates. This can be useful to acquire an overview of the database, helping you e.g. recall dataset identifiers or verify the existence of a given dataset.
dataset | subdataset | resolution | pixel_size | path | format | type | nr_bytes | nr_files | start | end | modified |
---|---|---|---|---|---|---|---|---|---|---|---|
CCI_landCover | landCover | 10arcSec | 0.0027755 | CCI_landCover/CCI_landCover-landCover_20010000_10arcSec.tif | grid | INT1U | 1 | NA | 2001-01-01 | 2001-12-31 | 2021-10-25 17:14:22 |
CCI_landCover | landCover | 00arcSec | 0.0027755 | CCI_landCover/CCI_landCover-landCover_200200000_00arcSec.tif | grid | INT1U | 1 | NA | 2002-01-01 | 2002-12-31 | 2021-10-25 17:14:22 |
GADM | level0 | NA | NA | GADM/GADM-level0_20200000_NA.spatialite | vector | NA | NA | 3 | 2020-01-01 | 2020-12-31 | 2021-10-25 17:14:22 |