Databrowser python module#

The following section gives an overview over the usage of the databrowser client python module. Please see the Installation and configuration section on how to install and configure the library.

TLDR: Too long didn’t read#

To query data databrowser and search for data you have three different options. You can the to following methods

  • freva_client.databrowser: The main class for searching data is the freva_client.databrowser class. After creating in instance of the databrowser class with your specific search constraints you can get retrieve all files or uris that matching your search constraints. You can also retrieve a count of the number objects matching the search, as well as getting an overview over the available metadata and creating an intake-esm catalogue from your search. Searching for Uris instead of file paths can be useful to get information on the storage system where the files or object stores are located.

  • freva_client.databrowser.metadata_search(): This class method lists all search categories (facets) and their values.

  • freva_client.databrowser.count_values(): You can count the occurrences of search results with this method.

Library Reference#

Below you can find a more detailed documentation.

Client software freva evaluation system framework (freva):

Freva, the free evaluation system framework, is a data search and analysis platform developed by the atmospheric science community for the atmospheric science community. With help of Freva researchers can:

  • quickly and intuitively search for data stored at typical data centers that host many datasets.

  • create a common interface for user defined data analysis tools.

  • apply data analysis tools in a reproducible manner.

The code described here is currently in testing phase. The client and server library described in the documentation only support searching for data. If you need to apply data analysis plugins, please visit the

class freva_client.databrowser(*facets: str, uniq_key: Literal['file', 'uri'] = 'file', flavour: Literal['freva', 'cmip6', 'cmip5', 'cordex', 'nextgems'] = 'freva', time: str | None = None, host: str | None = None, time_select: Literal['flexible', 'strict', 'file'] = 'flexible', stream_zarr: bool = False, multiversion: bool = False, fail_on_error: bool = False, **search_keys: str | List[str])#

Find data in the system.

You can either search for files or uri’s. Uri’s give you an information on the storage system where the files or objects you are looking for are located. The query is of the form key=value. For value you might use wild cards such as *, ? or any regular expression.

Parameters#

*facets: str

If you are not sure about the correct search key’s you can use positional arguments to search of any matching entries. For example ‘era5’ would allow you to search for any entries containing era5, regardless of project, product etc.

**search_keys: str

The search constraints applied in the data search. If not given the whole dataset will be queried.

flavour: str, default: freva

The Data Reference Syntax (DRS) standard specifying the type of climate datasets to query. You can get an overview by using the :py:meth:databrowser.overview class method to retrieve information on the available search flavours and their different search keys.

time: str, default: “”

Special search key to refine/subset search results by time. This can be a string representation of a time range or a single timestamp. The timestamps has to follow ISO-8601. Valid strings are %Y-%m-%dT%H:%M to %Y-%m-%dT%H:%M for time ranges or %Y-%m-%dT%H:%M for single time stamps. Note: You don’t have to give the full string format to subset time steps: %Y, %Y-%m etc are also valid.

time_select: str, default: flexible

Operator that specifies how the time period is selected. Choose from flexible (default), strict or file. strict returns only those files that have the entire time period covered. The time search 2000 to 2012 will not select files containing data from 2010 to 2020 with the strict method. flexible will select those files as flexible returns those files that have either start or end period covered. file will only return files where the entire time period is contained within one single file.

uniq_key: str, default: file

Chose if the solr search query should return paths to files or uris, uris will have the file path along with protocol of the storage system. Uris can be useful if the search query result should be used libraries like fsspec.

host: str, default: None

Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.

stream_zarr: bool, default: False

Create a zarr stream for all search results. When set to true the files are served in zarr format and can be opened from anywhere.

multiversion: bool, default: False

Select all versions and not just the latest version (default).

fail_on_error: bool, default: False

Make the call fail if the connection to the databrowser could not be established.

Attributes#

url: str

the url of the currently selected databrowser api server

metadata: dict[str, str]

The available search keys, or metadata, found for the applied search constraints. This can be useful for reverse searches.

Example#

Search for the cmorph datasets. Suppose we know that the experiment name of this dataset is cmorph therefore we can create in instance of the databrowser class using the experiment search constraint. If you just ‘print’ the created object you will get a quick overview:

Code

from freva_client import databrowser
db = databrowser(experiment="cmorph", uniq_key="uri")
print(db)

Results

databrowser(flavour=freva, host=http://localhost:7777/api/databrowser, multi_version=False, experiment=cmorph)

After having created the search object you can acquire different kinds of information like the number of found objects:

Code

from freva_client import databrowser
db = databrowser(experiment="cmorph", uniq_key="uri")
print(len(db))
# Get all the search keys associated with this search

Results

49

Or you can retrieve the combined metadata of the search objects.

Code

from freva_client import databrowser
db = databrowser(experiment="cmorph", uniq_key="uri")
print(db.metadata)

Results

{'cmor_table': ['30min'], 'dataset': ['obs-fs', 'obs-hsm', 'obs-swfit'], 'driving_model': [], 'ensemble': ['r1i1p1'], 'experiment': ['cmorph'], 'format': ['nc', 'zarr'], 'fs_type': ['posix'], 'grid_id': [], 'grid_label': ['gn'], 'institute': ['cpc'], 'level_type': ['2d'], 'model': ['cpc', 'cpc-cmorph'], 'product': ['grid'], 'project': ['observations'], 'rcm_name': [], 'rcm_version': [], 'realm': ['atmos'], 'time_aggregation': ['mean'], 'time_frequency': ['1min'], 'user': [], 'variable': ['pr']}

Most importantly you can retrieve the locations of all encountered objects

Code

from freva_client import databrowser
db = databrowser(experiment="cmorph", uniq_key="uri")
for file in db:
    pass
all_files = sorted(db)
print(all_files[0])

Results

/home/runner/work/freva-nextgen/freva-nextgen/freva-rest/src/databrowser_api/mock/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020000-201609020030.nc

You can also set a different flavour, for example according to cmip6 standard:

Code

from freva_client import databrowser
db = databrowser(flavour="cmip6", experiment_id="cmorph")
print(db.metadata)

Results

{'table_id': ['30min'], 'dataset': ['obs-fs', 'obs-hsm', 'obs-swfit'], 'driving_model': [], 'member_id': ['r1i1p1'], 'experiment_id': ['cmorph'], 'format': ['nc', 'zarr'], 'fs_type': ['posix'], 'grid_id': [], 'grid_label': ['gn'], 'institution_id': ['cpc'], 'level_type': ['2d'], 'source_id': ['cpc', 'cpc-cmorph'], 'activity_id': ['grid'], 'mip_era': ['observations'], 'rcm_name': [], 'rcm_version': [], 'realm': ['atmos'], 'time_aggregation': ['mean'], 'frequency': ['1min'], 'user': [], 'variable_id': ['pr']}

Sometimes you don’t exactly know the exact names of the search keys and want retrieve all file objects that match a certain category. For example for getting all ocean reanalysis datasets you can apply the ‘reana*’ search key as a positional argument:

Code

from freva_client import databrowser
db = databrowser("reana*", realm="ocean", flavour="cmip6")
for file in db:
    print(file)

Results

https://swift.dkrz.de/v1/dkrz_a32dc0e8-2299-4239-a47d-6bf45c8b0160/freva_test/model/obs/reanalysis/reanalysis/NOAA/NODC/OC5/mon/ocean/Omon/r1i1p1/v20200101/hc700/hc700_mon_NODC_OC5_r1i1p1_201201-201212.zarr
/home/runner/work/freva-nextgen/freva-nextgen/freva-rest/src/databrowser_api/mock/data/model/obs/reanalysis/reanalysis/NOAA/NODC/OC5/mon/ocean/Omon/r1i1p1/v20200101/hc700/hc700_mon_NODC_OC5_r1i1p1_201201-201212.nc
/arch/bb1203/freva_test/model/obs/reanalysis/reanalysis/NOAA/NODC/OC5/mon/ocean/Omon/r1i1p1/v20200101/hc700/hc700_mon_NODC_OC5_r1i1p1_201201-201212.nc

If you don’t have direct access to the data, for example because you are not directly logged in to the computer where the data is stored you can set stream_zarr=True. The data will then be provisioned in zarr format and can be opened from anywhere. But bear in mind that zarr streams if not accessed in time will expire. Since the data can be accessed from anywhere you will also have to authenticate before you are able to access the data. Refer also to the freva_client.authenticate() method.

Code

from freva_client import authenticate, databrowser
token_info = authenticate(username="janedoe")
db = databrowser(dataset="cmip6-fs", stream_zarr=True)
zarr_files = list(db)
print(zarr_files)

Results

['http://localhost:7777/api/freva-data-portal/zarr/c48f9024-f61e-5682-9651-7b9a21d81048.zarr', 'http://localhost:7777/api/freva-data-portal/zarr/ff8b9acf-b652-5ce3-a26c-58f9bc4884a3.zarr']

After you have created the paths to the zarr files you can open them

import xarray as xr
dset = xr.open_dataset(
   zarr_files[0],
   chunks="auto",
   engine="zarr",
   storage_options={"header":
        {"Authorization": f"Bearer {token_info['access_token']}"}
   }
)
classmethod count_values(*facets: str, flavour: Literal['freva', 'cmip6', 'cmip5', 'cordex', 'nextgems'] = 'freva', time: str | None = None, host: str | None = None, time_select: Literal['flexible', 'strict', 'file'] = 'flexible', multiversion: bool = False, fail_on_error: bool = False, extended_search: bool = False, **search_keys: str | List[str]) Dict[str, Dict[str, int]]#

Count the number of objects in the databrowser.

Parameters#

*facets: str

If you are not sure about the correct search key’s you can use positional arguments to search of any matching entries. For example ‘era5’ would allow you to search for any entries containing era5, regardless of project, product etc.

flavour: str, default: freva

The Data Reference Syntax (DRS) standard specifying the type of climate datasets to query.

time: str, default: “”

Special search facet to refine/subset search results by time. This can be a string representation of a time range or a single timestamp. The timestamp has to follow ISO-8601. Valid strings are %Y-%m-%dT%H:%M to %Y-%m-%dT%H:%M for time ranges and %Y-%m-%dT%H:%M. Note: You don’t have to give the full string format to subset time steps %Y, %Y-%m etc are also valid.

time_select: str, default: flexible

Operator that specifies how the time period is selected. Choose from flexible (default), strict or file. strict returns only those files that have the entire time period covered. The time search 2000 to 2012 will not select files containing data from 2010 to 2020 with the strict method. flexible will select those files as flexible returns those files that have either start or end period covered. file will only return files where the entire time period is contained within one single file.

extended_search: bool, default: False

Retrieve information on additional search keys.

host: str, default: None

Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.

multiversion: bool, default: False

Select all versions and not just the latest version (default).

fail_on_error: bool, default: False

Make the call fail if the connection to the databrowser could not

**search_keys: str

The search constraints to be applied in the data search. If not given the whole dataset will be queried.

Returns#

dict[str, int]:

Dictionary with the number of objects for each search facet/key is given.

Example#

Code

from freva_client import databrowser
print(databrowser.count_values(experiment="cmorph"))

Results

{'ensemble': {'r1i1p1': 49}, 'experiment': {'cmorph': 49}, 'institute': {'cpc': 49}, 'model': {'cpc': 25, 'cpc-cmorph': 24}, 'product': {'grid': 49}, 'project': {'observations': 49}, 'realm': {'atmos': 49}, 'time_aggregation': {'mean': 49}, 'time_frequency': {'1min': 49}, 'variable': {'pr': 49}}

Code

from freva_client import databrowser
print(databrowser.count_values("model"))

Results

{'ensemble': {}, 'experiment': {}, 'institute': {}, 'model': {}, 'product': {}, 'project': {}, 'realm': {}, 'time_aggregation': {}, 'time_frequency': {}, 'variable': {}}

Sometimes you don’t exactly know the exact names of the search keys and want retrieve all file objects that match a certain category. For example for getting all ocean reanalysis datasets you can apply the ‘reana*’ search key as a positional argument:

Code

from freva_client import databrowser
print(databrowser.count_values("reana*", realm="ocean", flavour="cmip6"))

Results

{'member_id': {'r1i1p1': 3}, 'experiment_id': {'oc5': 3}, 'institution_id': {'noaa': 3}, 'source_id': {'nodc': 3}, 'activity_id': {'reanalysis': 3}, 'mip_era': {'observations': 3}, 'realm': {'ocean': 3}, 'time_aggregation': {'mean': 3}, 'frequency': {'mon': 3}, 'variable_id': {'hc700': 3}}
intake_catalogue() esm_datastore#

Create an intake esm catalogue object from the search.

This method creates a intake-esm catalogue from the current object search. Instead of having the original files as target objects you can also choose to stream the files via zarr.

Returns#

intake_esm.core.esm_datastore: intake-esm catalogue.

Raises#

ValueError: If user is not authenticated or catalogue creation failed.

Example#

Let’s create an intake-esm catalogue that points points allows for streaming the target data as zarr:

Code

from freva_client import databrowser
db = databrowser(dataset="cmip6-fs", stream_zarr=True)
cat = db.intake_catalogue()
print(cat.df)

Results

                                                 uri project  ... fs_type grid_label
0  http://localhost:7777/api/freva-data-portal/za...   CMIP6  ...   posix         gn
1  http://localhost:7777/api/freva-data-portal/za...   CMIP6  ...   posix         gn

[2 rows x 13 columns]
property metadata: Dict[str, List[str]]#

Get the metadata (facets) for the current databrowser query.

You can retrieve all information that is associated with your current databrowser search. This can be useful for reverse searches for example for retrieving metadata of object stores or file/directory names.

Example#

Reverse search: retrieving meta data from a known file

Code

from freva_client import databrowser
db = databrowser(uri="slk:///arch/*/CPC/*")
print(db.metadata)

Results

{'cmor_table': ['30min'], 'dataset': ['obs-hsm'], 'driving_model': [], 'ensemble': ['r1i1p1'], 'experiment': ['cmorph'], 'format': ['nc'], 'fs_type': ['posix'], 'grid_id': [], 'grid_label': ['gn'], 'institute': ['cpc'], 'level_type': ['2d'], 'model': ['cpc'], 'product': ['grid'], 'project': ['observations'], 'rcm_name': [], 'rcm_version': [], 'realm': ['atmos'], 'time_aggregation': ['mean'], 'time_frequency': ['1min'], 'user': [], 'variable': ['pr']}

Search for data attributes (facets) in the databrowser.

The method queries the databrowser for available search facets (keys) like model, experiment etc.

Parameters#

*facets: str

If you are not sure about the correct search key’s you can use positional arguments to search of any matching entries. For example ‘era5’ would allow you to search for any entries containing era5, regardless of project, product etc.

flavour: str, default: freva

The Data Reference Syntax (DRS) standard specifying the type of climate datasets to query.

time: str, default: “”

Special search facet to refine/subset search results by time. This can be a string representation of a time range or a single timestamp. The timestamp has to follow ISO-8601. Valid strings are %Y-%m-%dT%H:%M to %Y-%m-%dT%H:%M for time ranges and %Y-%m-%dT%H:%M. Note: You don’t have to give the full string format to subset time steps %Y, %Y-%m etc are also valid.

time_select: str, default: flexible

Operator that specifies how the time period is selected. Choose from flexible (default), strict or file. strict returns only those files that have the entire time period covered. The time search 2000 to 2012 will not select files containing data from 2010 to 2020 with the strict method. flexible will select those files as flexible returns those files that have either start or end period covered. file will only return files where the entire time period is contained within one single file.

extended_search: bool, default: False

Retrieve information on additional search keys.

multiversion: bool, default: False

Select all versions and not just the latest version (default).

host: str, default: None

Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.

fail_on_error: bool, default: False

Make the call fail if the connection to the databrowser could not

**search_keys: str, list[str]

The facets to be applied in the data search. If not given the whole dataset will be queried.

Returns#

dict[str, list[str]]:

Dictionary with a list search facet values for each search facet key

Example#

Code

from freva_client import databrowser
all_facets = databrowser.metadata_search(project='obs*')
print(all_facets)

Results

{'ensemble': ['r1i1p1'], 'experiment': ['cmorph', 'oc5'], 'institute': ['cpc', 'noaa'], 'model': ['cpc', 'cpc-cmorph', 'nodc'], 'product': ['grid', 'reanalysis'], 'project': ['observations'], 'realm': ['atmos', 'ocean'], 'time_aggregation': ['mean'], 'time_frequency': ['1min', 'mon'], 'variable': ['hc700', 'pr']}

You can also search for all metadata matching a search string:

Code

from freva_client import databrowser
spec_facets = databrowser.metadata_search("obs*")
print(spec_facets)

Results

{'ensemble': ['r1i1p1'], 'experiment': ['cmorph'], 'institute': ['cpc'], 'model': ['cpc', 'cpc-cmorph'], 'product': ['grid'], 'project': ['observations'], 'realm': ['atmos'], 'time_aggregation': ['mean'], 'time_frequency': ['1min'], 'variable': ['pr']}

Get all models that have a given time step:

Code

from freva_client import databrowser
model = databrowser.metadata_search(
    project="obs*",
    time="2016-09-02T22:10"
)
print(model)

Results

{'ensemble': ['r1i1p1'], 'experiment': ['cmorph'], 'institute': ['cpc'], 'model': ['cpc', 'cpc-cmorph'], 'product': ['grid'], 'project': ['observations'], 'realm': ['atmos'], 'time_aggregation': ['mean'], 'time_frequency': ['1min'], 'variable': ['pr']}

Reverse search: retrieving meta data from a known file

Code

from freva_client import databrowser
res = databrowser.metadata_search(file="/arch/*CPC/*")
print(res)

Results

{'ensemble': ['r1i1p1'], 'experiment': ['cmorph'], 'institute': ['cpc'], 'model': ['cpc'], 'product': ['grid'], 'project': ['observations'], 'realm': ['atmos'], 'time_aggregation': ['mean'], 'time_frequency': ['1min'], 'variable': ['pr']}

Sometimes you don’t exactly know the exact names of the search keys and want retrieve all file objects that match a certain category. For example for getting all ocean reanalysis datasets you can apply the ‘reana*’ search key as a positional argument:

Code

from freva_client import databrowser
print(databrowser.metadata_search("reana*", realm="ocean", flavour="cmip6"))

Results

{'member_id': ['r1i1p1'], 'experiment_id': ['oc5'], 'institution_id': ['noaa'], 'source_id': ['nodc'], 'activity_id': ['reanalysis'], 'mip_era': ['observations'], 'realm': ['ocean'], 'time_aggregation': ['mean'], 'frequency': ['mon'], 'variable_id': ['hc700']}
classmethod overview(host: str | None = None) str#

Get an overview over the available search options.

If you don’t know what search flavours or search keys you can use for searching the data you can use this method to get an overview over what is available.

Parameters#

host: str, default None

Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.

Returns#

str: A string representation over what is available.

Example#

Code

from freva_client import databrowser
print(databrowser.overview())

Results

Available search flavours:
- freva
- cmip6
- cmip5
- cordex
- nextgems
Search attributes by flavour:
  cmip5:
  - experiment
  - member_id
  - fs_type
  - grid_label
  - institution_id
  - model_id
  - project
  - product
  - realm
  - variable
  - time
  - time_aggregation
  - time_frequency
  - cmor_table
  - dataset
  - format
  - grid_id
  - level_type
  cmip6:
  - experiment_id
  - member_id
  - fs_type
  - grid_label
  - institution_id
  - source_id
  - mip_era
  - activity_id
  - realm
  - variable_id
  - time
  - time_aggregation
  - frequency
  - table_id
  - dataset
  - format
  - grid_id
  - level_type
  cordex:
  - experiment
  - ensemble
  - fs_type
  - grid_label
  - institution
  - model
  - project
  - domain
  - realm
  - variable
  - time
  - time_aggregation
  - time_frequency
  - cmor_table
  - dataset
  - driving_model
  - format
  - grid_id
  - level_type
  - rcm_name
  - rcm_version
  freva:
  - project
  - product
  - institute
  - model
  - experiment
  - time_frequency
  - realm
  - variable
  - ensemble
  - time_aggregation
  - fs_type
  - grid_label
  - cmor_table
  - format
  - grid_id
  - level_type
  - dataset
  - time
  nextgems:
  - experiment
  - member_id
  - fs_type
  - grid_label
  - institution_id
  - source_id
  - project
  - experiment_id
  - realm
  - variable_id
  - time
  - time_reduction
  - time_frequency
  - cmor_table
  - dataset
  - format
  - grid_id
  - level_type

property url: str#

Get the url of the databrowser API.

Example#

Code

from freva_client import databrowser
db = databrowser()
print(db.url)

Results

http://localhost:7777/api/databrowser