Searching for data: the databrowser method#

All files available on in the project are scanned and indexed via a data search server. This allows you to query the server with almost immediate response time. To search for data you can either use the databrowser method of the, freva python module. Let’s import the freva module first:

[1]:
import freva

Now inspect the help menu:

[2]:
help(freva.databrowser)
Help on function databrowser in module freva._databrowser:

databrowser(*, multiversion: 'bool' = False, batch_size: 'int' = 5000, uniq_key: "Literal['file', 'uri']" = 'file', time: 'str' = '', time_select: "Literal['flexible', 'strict', 'file']" = 'flexible', **search_facets: 'Union[str, list[str], int]') -> 'Union[dict[str, dict[str, int]], dict[str, list[str]], Iterator[str], int]'
    Find data in the system.

    You can either search for files or data facets (variable, model, ...)
    that are available. The query is of the form key=value. <value> might
    use *, ? as wildcards or any regular expression.

    Parameters
    ----------
    **search_facets: Union[str, Path, in, list[str]]
        The facets to be applied in the data search. If not given
        the whole dataset will be queried.
    time: str
        Special search facet to refine/subset search results by time.
        This can be a string representation of a time range or a single
        time step. The time steps have to follow ISO-8601. Valid strings are
        ``%Y-%m-%dT%H:%M`` to ``%Y-%m-%dT%H:%M`` for time ranges and
        ``%Y-%m-%dT%H:%M``. **Note**: You don't have to give the full string
        format to subset time steps ``%Y``, ``%Y-%m`` etc are also valid.
    time_select: str, default: flexible
        Operator that specifies how the time period is selected. Choose from
        flexible (default), strict or file. ``strict`` returns only those files
        that have the *entire* time period covered. The time search ``2000 to
        2012`` will not select files containing data from 2010 to 2020 with
        the ``strict`` method. ``flexible`` will select those files as
        ``flexible`` returns those files that have either start or end period
        covered. ``file`` will only return files where the entire time
        period is contained within *one single* file.
    uniq_key: str, default: file
        Chose if the solr search query should return paths to files or
        uris, uris will have the file path along with protocol of the storage
        system. Uris can be useful if the the search query result should be
        used libraries like fsspec.
    multiversion: bool, default: False
        Select all versions and not just the latest version (default).
    batch_size: int, default: 5000
        Size of the search query.

    Returns
    -------
    Iterator :
        If ``all_facets`` is False and ``facet`` is None an
        iterator with results.


    Example
    -------

    Search for files in the system:

    .. execute_code::

        import freva
        files = freva.databrowser(project='obs*', institute='cpc',
                                  time_frequency='??min',
                                  variable='pr')
        print(files)
        print(next(files))
        for file in files:
            print(file)
            break

    Search for files between a two given time steps:

    .. execute_code::

        import freva
        file_range = freva.databrowser(project="obs*", time="2016-09-02T22:15 to 2016-10")
        for file in file_range:
            print(file)

    The default method for selecting time periods is ``flexible``, which means
    all files are selected that cover at least start or end date. The
    ``strict`` method implies that the *entire* search time period has to be
    covered by the files. Using the ``strict`` method in the example above would
    only yield one file because the first file contains time steps prior to the
    start of the time period:

    .. execute_code::

        import freva
        file_range = freva.databrowser(project="obs*", time="2016-09-02T22:15 to 2016-10", time_select="strict")
        for file in file_range:
            print(file)

The databrowser expects a list of key=value pairs. The order of the pairs doesn’t really matter. Most important is that you don’t need to split the search according to the type of data you are searching for. You can search for any files, both observations, reanalysis, and model data, all at the same time. Also important is that all searches are case insensitive. You can also search for attributes themselves instead of file paths. For example you can search for the list of variables available that satisfies a certain constraint (e.g. sampled 6hr, from a certain model, etc).

[3]:
files = freva.databrowser(project="observations", variable="pr", model="cp*")
files
[3]:
<generator object SolrFindFiles._search at 0x7f110030d3f0>

This will return a so called iterator. The advantage of an iterator is that the data can be loaded into memory if needed. Nothing is pre loaded. To access the files you can either loop through the Iterator or convert it to a list:

[4]:
list(files)
[4]:
['/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022300-201609022330.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022200-201609022230.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022100-201609022130.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022000-201609022030.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021900-201609021930.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021800-201609021830.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021700-201609021730.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021600-201609021630.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021500-201609021530.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021400-201609021430.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021300-201609021330.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021200-201609021230.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021100-201609021130.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021000-201609021030.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020900-201609020930.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020800-201609020830.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020700-201609020730.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020600-201609020630.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020500-201609020530.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020400-201609020430.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020300-201609020330.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020200-201609020230.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020100-201609020130.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020000-201609020030.nc']

In some cases it might be useful to know how much files are found in the databrowser for certain search constraints. In such cases you can use the count flag to count the number of found files instead of getting the files themselves.

[5]:
freva.databrowser(project="observations", variable="pr", model="cp*", count=True)
[5]:
<generator object SolrFindFiles._search at 0x7f110030d2a0>

Sometimes it might be useful to subset the data you’re interested in by time. To do so you can use the time search key to subset time steps and whole time ranges. For example let’s get the for certain time range:

[6]:
list(freva.databrowser(project="observations", time="2016-09-02T22:15 to 2016-10"))
[6]:
['/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022300-201609022330.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022200-201609022230.nc']

The default method for selecting time periods is flexible, which means all files are selected that cover at least start or end date. The strict method implies that the entire search time period has to be covered by the files. Using the strict method in the example above would only yield on file because the first file contains time steps prior to the start of the time period:

[7]:
list(freva.databrowser(project="observations", time="2016-09-02T22:15 to 2016-10", time_select="strict"))
[7]:
['/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022300-201609022330.nc']

The time format has to follow the ISO-8601 standard. Time ranges are indicated by the to keyword such as 2000 to 2100 or 2000-01 to 2100-12 and alike. Single time steps are given without the to keyword.

You might as well want to know about possible values that an attribute can take after a certain search is done. For this you use the facet flag (facets are the attributes used to search for and sub set the data). For example to see all facets that are available in the observations project:

[8]:
freva.databrowser(project="observations", all_facets=True)
[8]:
<generator object SolrFindFiles._search at 0x7f110030d7e0>

Likewise you can inspect all model facet flags in the databrowser:

[9]:
freva.databrowser(facet="model")
[9]:
<generator object SolrFindFiles._search at 0x7f110030d930>

Note: If you don’t give a search constraints like in the case above the command will query the whole data server.

You can also retrieve information on how many facets are found by the databrowser by giving the count flag

[10]:
freva.databrowser(facet="model", count=True)
[10]:
<generator object SolrFindFiles._search at 0x7f110030da80>

Reverse search is also be possible. You can query the metadata of a given file:

[11]:
file_to_query = next(freva.databrowser()) # Get a file
file_to_query
[11]:
'/tmp/user_data/user-runner/eur-11b/clex/UM-RA2T/Bias-correct/hr/user_data/hr/r0i0p0/v20231020/tas/tas_hr_UM-RA2T_Bias-correct_r0i0p0_197001041800-197001050300.nc'
[12]:
freva.databrowser(file=file_to_query, all_facets=True)
[12]:
<generator object SolrFindFiles._search at 0x7f110030dbd0>

Example: Using the databrowser to open datasets with xarray#

[13]:
import xarray as xr
dset = xr.open_mfdataset(freva.databrowser(variable="pr", project="observations"), combine="by_coords")
dset
[13]:
<xarray.Dataset>
Dimensions:  (time: 48, lon: 687, lat: 412)
Coordinates:
  * time     (time) datetime64[ns] 2016-09-02 ... 2016-09-02T23:30:00
  * lon      (lon) float32 255.0 255.1 255.2 255.3 ... 304.7 304.8 304.9 305.0
  * lat      (lat) float32 15.06 15.14 15.21 15.28 ... 44.75 44.83 44.9 44.97
Data variables:
    pr       (time, lat, lon) float32 dask.array<chunksize=(1, 1, 687), meta=np.ndarray>
Attributes: (12/44)
    CDI:                    Climate Data Interface version 1.9.8 (https://mpi...
    history:                Fri Jun 18 18:30:24 2021: cdo -O -s -f nc4 -z zip...
    Conventions:            CF-1.6
    geospatial_bounds:      POLYGON ((-59.363 -180, 59.363 -180, 59.363 180, ...
    time_coverage_end:      1998-01-31:00.00
    time_coverage_start:    1998-01-01:00.00
    ...                     ...
    grid:                   8 x 8 km x km
    grid_label:             gn
    calendar:               gregorian
    cmor_version:           2.9.1
    initialization_method:  1
    physics_version:        1