Searching for data: the databrowser method#
All files available on in the project are scanned and indexed via a data search server. This allows you to query the server with almost immediate response time. To search for data you can either use the databrowser
method of the, freva
python module. Let’s import the freva
module first:
[1]:
import freva
Now inspect the help menu:
[2]:
help(freva.databrowser)
Help on function databrowser in module freva._databrowser:
databrowser(*, multiversion: 'bool' = False, batch_size: 'int' = 5000, uniq_key: "Literal['file', 'uri']" = 'file', time: 'str' = '', time_select: "Literal['flexible', 'strict', 'file']" = 'flexible', **search_facets: 'Union[str, list[str], int]') -> 'Union[dict[str, dict[str, int]], dict[str, list[str]], Iterator[str], int]'
Find data in the system.
You can either search for files or data facets (variable, model, ...)
that are available. The query is of the form key=value. <value> might
use *, ? as wildcards or any regular expression.
Parameters
----------
**search_facets: Union[str, Path, in, list[str]]
The facets to be applied in the data search. If not given
the whole dataset will be queried.
time: str
Special search facet to refine/subset search results by time.
This can be a string representation of a time range or a single
time step. The time steps have to follow ISO-8601. Valid strings are
``%Y-%m-%dT%H:%M`` to ``%Y-%m-%dT%H:%M`` for time ranges and
``%Y-%m-%dT%H:%M``. **Note**: You don't have to give the full string
format to subset time steps ``%Y``, ``%Y-%m`` etc are also valid.
time_select: str, default: flexible
Operator that specifies how the time period is selected. Choose from
flexible (default), strict or file. ``strict`` returns only those files
that have the *entire* time period covered. The time search ``2000 to
2012`` will not select files containing data from 2010 to 2020 with
the ``strict`` method. ``flexible`` will select those files as
``flexible`` returns those files that have either start or end period
covered. ``file`` will only return files where the entire time
period is contained within *one single* file.
uniq_key: str, default: file
Chose if the solr search query should return paths to files or
uris, uris will have the file path along with protocol of the storage
system. Uris can be useful if the the search query result should be
used libraries like fsspec.
multiversion: bool, default: False
Select all versions and not just the latest version (default).
batch_size: int, default: 5000
Size of the search query.
Returns
-------
Iterator :
If ``all_facets`` is False and ``facet`` is None an
iterator with results.
Example
-------
Search for files in the system:
.. execute_code::
import freva
files = freva.databrowser(project='obs*', institute='cpc',
time_frequency='??min',
variable='pr')
print(files)
print(next(files))
for file in files:
print(file)
break
Search for files between a two given time steps:
.. execute_code::
import freva
file_range = freva.databrowser(project="obs*", time="2016-09-02T22:15 to 2016-10")
for file in file_range:
print(file)
The default method for selecting time periods is ``flexible``, which means
all files are selected that cover at least start or end date. The
``strict`` method implies that the *entire* search time period has to be
covered by the files. Using the ``strict`` method in the example above would
only yield one file because the first file contains time steps prior to the
start of the time period:
.. execute_code::
import freva
file_range = freva.databrowser(project="obs*", time="2016-09-02T22:15 to 2016-10", time_select="strict")
for file in file_range:
print(file)
The databrowser expects a list of key=value
pairs. The order of the pairs doesn’t really matter. Most important is that you don’t need to split the search according to the type of data you are searching for. You can search for any files, both observations, reanalysis, and model data, all at the same time. Also important is that all searches are case insensitive. You can also search for attributes themselves instead of file paths. For example you can search for the list of variables available
that satisfies a certain constraint (e.g. sampled 6hr, from a certain model, etc).
[3]:
files = freva.databrowser(project="observations", variable="pr", model="cp*")
files
[3]:
<generator object SolrFindFiles._search at 0x7fe156c1eb90>
This will return a so called iterator. The advantage of an iterator is that the data can be loaded into memory if needed. Nothing is pre loaded. To access the files you can either loop through the Iterator or convert it to a list:
[4]:
list(files)
[4]:
['/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022300-201609022330.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022200-201609022230.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022100-201609022130.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022000-201609022030.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021900-201609021930.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021800-201609021830.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021700-201609021730.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021600-201609021630.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021500-201609021530.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021400-201609021430.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021300-201609021330.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021200-201609021230.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021100-201609021130.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021000-201609021030.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020900-201609020930.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020800-201609020830.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020700-201609020730.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020600-201609020630.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020500-201609020530.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020400-201609020430.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020300-201609020330.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020200-201609020230.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020100-201609020130.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020000-201609020030.nc']
In some cases it might be useful to know how much files are found in the databrowser
for certain search constraints. In such cases you can use the count
flag to count the number of found files instead of getting the files themselves.
[5]:
freva.databrowser(project="observations", variable="pr", model="cp*", count=True)
[5]:
<generator object SolrFindFiles._search at 0x7fe1561482e0>
Sometimes it might be useful to subset the data you’re interested in by time. To do so you can use the time search key to subset time steps and whole time ranges. For example let’s get the for certain time range:
[6]:
list(freva.databrowser(project="observations", time="2016-09-02T22:15 to 2016-10"))
[6]:
['/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022300-201609022330.nc',
'/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022200-201609022230.nc']
The default method for selecting time periods is flexible, which means all files are selected that cover at least start or end date. The strict method implies that the entire search time period has to be covered by the files. Using the strict method in the example above would only yield on file because the first file contains time steps prior to the start of the time period:
[7]:
list(freva.databrowser(project="observations", time="2016-09-02T22:15 to 2016-10", time_select="strict"))
[7]:
['/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022300-201609022330.nc']
The time format has to follow the ISO-8601 standard. Time ranges are indicated by the to keyword such as 2000 to 2100 or 2000-01 to 2100-12 and alike. Single time steps are given without the to keyword.
You might as well want to know about possible values that an attribute can take after a certain search is done. For this you use the facet
flag (facets are the attributes used to search for and sub set the data). For example to see all facets that are available in the observations project:
[8]:
freva.databrowser(project="observations", all_facets=True)
[8]:
<generator object SolrFindFiles._search at 0x7fe156148580>
Likewise you can inspect all model facet
flags in the databrowser:
[9]:
freva.databrowser(facet="model")
[9]:
<generator object SolrFindFiles._search at 0x7fe1561486d0>
Note: If you don’t give a search constraints like in the case above the command will query the whole data server.
You can also retrieve information on how many facets are found by the databrowser by giving the count flag
[10]:
freva.databrowser(facet="model", count=True)
[10]:
<generator object SolrFindFiles._search at 0x7fe156148820>
Reverse search is also be possible. You can query the metadata of a given file:
[11]:
file_to_query = next(freva.databrowser()) # Get a file
file_to_query
[11]:
'/tmp/user_data/user-runner/eur-11b/clex/UM-RA2T/Bias-correct/hr/user_data/hr/r0i0p0/v20241114/tas/tas_hr_UM-RA2T_Bias-correct_r0i0p0_197001041800-197001050300.nc'
[12]:
freva.databrowser(file=file_to_query, all_facets=True)
[12]:
<generator object SolrFindFiles._search at 0x7fe156148430>
Example: Using the databrowser to open datasets with xarray#
[13]:
import xarray as xr
dset = xr.open_mfdataset(freva.databrowser(variable="pr", project="observations"), combine="by_coords")
dset
[13]:
<xarray.Dataset> Size: 54MB Dimensions: (time: 48, lat: 412, lon: 687) Coordinates: * time (time) datetime64[ns] 384B 2016-09-02 ... 2016-09-02T23:30:00 * lon (lon) float32 3kB 255.0 255.1 255.2 255.3 ... 304.8 304.9 305.0 * lat (lat) float32 2kB 15.06 15.14 15.21 15.28 ... 44.83 44.9 44.97 Data variables: pr (time, lat, lon) float32 54MB dask.array<chunksize=(1, 1, 687), meta=np.ndarray> Attributes: (12/44) CDI: Climate Data Interface version 1.9.8 (https://mpi... history: Fri Jun 18 18:30:24 2021: cdo -O -s -f nc4 -z zip... Conventions: CF-1.6 geospatial_bounds: POLYGON ((-59.363 -180, 59.363 -180, 59.363 180, ... time_coverage_end: 1998-01-31:00.00 time_coverage_start: 1998-01-01:00.00 ... ... grid: 8 x 8 km x km grid_label: gn calendar: gregorian cmor_version: 2.9.1 initialization_method: 1 physics_version: 1