Databrowser python module#
The following section gives an overview over the usage of the databrowser client python module. Please see the Installation and configuration section on how to install and configure the library.
TLDR: Too long didn’t read#
To query data databrowser and search for data you have three different options. You can the to following methods
freva_client.databrowser
: The main class for searching data is thefreva_client.databrowser
class. After creating in instance of the databrowser class with your specific search constraints you can get retrieve all files or uris that matching your search constraints. You can also retrieve a count of the number objects matching the search, as well as getting an overview over the available metadata and creating an intake-esm catalogue from your search. Searching for Uris instead of file paths can be useful to get information on the storage system where the files or object stores are located.freva_client.databrowser.metadata_search()
: This class method lists all search categories (facets) and their values.freva_client.databrowser.count_values()
: You can count the occurrences of search results with this method.
Library Reference#
Below you can find a more detailed documentation.
Client software freva evaluation system framework (freva):
Freva, the free evaluation system framework, is a data search and analysis platform developed by the atmospheric science community for the atmospheric science community. With help of Freva researchers can:
quickly and intuitively search for data stored at typical data centers that host many datasets.
create a common interface for user defined data analysis tools.
apply data analysis tools in a reproducible manner.
The code described here is currently in testing phase. The client and server library described in the documentation only support searching for data. If you need to apply data analysis plugins, please visit the
- class freva_client.databrowser(*facets: str, uniq_key: Literal['file', 'uri'] = 'file', flavour: Literal['freva', 'cmip6', 'cmip5', 'cordex', 'nextgems', 'user'] = 'freva', time: str | None = None, host: str | None = None, time_select: Literal['flexible', 'strict', 'file'] = 'flexible', stream_zarr: bool = False, multiversion: bool = False, fail_on_error: bool = False, **search_keys: str | List[str])#
Find data in the system.
You can either search for files or uri’s. Uri’s give you an information on the storage system where the files or objects you are looking for are located. The query is of the form
key=value
. Forvalue
you might use wild cards such as *, ? or any regular expression.Parameters#
- *facets: str
If you are not sure about the correct search key’s you can use positional arguments to search of any matching entries. For example ‘era5’ would allow you to search for any entries containing era5, regardless of project, product etc.
- **search_keys: str
The search constraints applied in the data search. If not given the whole dataset will be queried.
- flavour: str, default: freva
The Data Reference Syntax (DRS) standard specifying the type of climate datasets to query. You can get an overview by using the :py:meth:databrowser.overview class method to retrieve information on the available search flavours and their different search keys.
- time: str, default: “”
Special search key to refine/subset search results by time. This can be a string representation of a time range or a single timestamp. The timestamps has to follow ISO-8601. Valid strings are
%Y-%m-%dT%H:%M to %Y-%m-%dT%H:%M
for time ranges or%Y-%m-%dT%H:%M
for single time stamps. Note: You don’t have to give the full string format to subset time steps: %Y, %Y-%m etc are also valid.- time_select: str, default: flexible
Operator that specifies how the time period is selected. Choose from flexible (default), strict or file.
strict
returns only those files that have the entire time period covered. The time search2000 to 2012
will not select files containing data from 2010 to 2020 with thestrict
method.flexible
will select those files asflexible
returns those files that have either start or end period covered.file
will only return files where the entire time period is contained within one single file.- uniq_key: str, default: file
Chose if the solr search query should return paths to files or uris, uris will have the file path along with protocol of the storage system. Uris can be useful if the search query result should be used libraries like fsspec.
- host: str, default: None
Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.
- stream_zarr: bool, default: False
Create a zarr stream for all search results. When set to true the files are served in zarr format and can be opened from anywhere.
- multiversion: bool, default: False
Select all versions and not just the latest version (default).
- fail_on_error: bool, default: False
Make the call fail if the connection to the databrowser could not be established.
Attributes#
- url: str
the url of the currently selected databrowser api server
- metadata: dict[str, str]
The available search keys, or metadata, found for the applied search constraints. This can be useful for reverse searches.
Example#
Search for the cmorph datasets. Suppose we know that the experiment name of this dataset is cmorph therefore we can create in instance of the databrowser class using the
experiment
search constraint. If you just ‘print’ the created object you will get a quick overview:from freva_client import databrowser db = databrowser(experiment="cmorph", uniq_key="uri") print(db)
databrowser(flavour=freva, host=http://localhost:7777/api/databrowser, multi_version=False, experiment=cmorph)
After having created the search object you can acquire different kinds of information like the number of found objects:
from freva_client import databrowser db = databrowser(experiment="cmorph", uniq_key="uri") print(len(db)) # Get all the search keys associated with this search
49
Or you can retrieve the combined metadata of the search objects.
from freva_client import databrowser db = databrowser(experiment="cmorph", uniq_key="uri") print(db.metadata)
{'cmor_table': ['30min'], 'dataset': ['obs-fs', 'obs-hsm', 'obs-swfit'], 'driving_model': [], 'ensemble': ['r1i1p1'], 'experiment': ['cmorph'], 'format': ['nc', 'zarr'], 'fs_type': ['posix'], 'grid_id': [], 'grid_label': ['gn'], 'institute': ['cpc'], 'level_type': ['2d'], 'model': ['cpc', 'cpc-cmorph'], 'product': ['grid'], 'project': ['observations'], 'rcm_name': [], 'rcm_version': [], 'realm': ['atmos'], 'time_aggregation': ['mean'], 'time_frequency': ['1min'], 'user': [], 'variable': ['pr']}
Most importantly you can retrieve the locations of all encountered objects
from freva_client import databrowser db = databrowser(experiment="cmorph", uniq_key="uri") for file in db: pass all_files = sorted(db) print(all_files[0])
/home/runner/work/freva-nextgen/freva-nextgen/freva-rest/src/databrowser_api/mock/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609020000-201609020030.nc
You can also set a different flavour, for example according to cmip6 standard:
from freva_client import databrowser db = databrowser(flavour="cmip6", experiment_id="cmorph") print(db.metadata)
{'table_id': ['30min'], 'dataset': ['obs-fs', 'obs-hsm', 'obs-swfit'], 'driving_model': [], 'member_id': ['r1i1p1'], 'experiment_id': ['cmorph'], 'format': ['nc', 'zarr'], 'fs_type': ['posix'], 'grid_id': [], 'grid_label': ['gn'], 'institution_id': ['cpc'], 'level_type': ['2d'], 'source_id': ['cpc', 'cpc-cmorph'], 'activity_id': ['grid'], 'mip_era': ['observations'], 'rcm_name': [], 'rcm_version': [], 'realm': ['atmos'], 'time_aggregation': ['mean'], 'frequency': ['1min'], 'user': [], 'variable_id': ['pr']}
Sometimes you don’t exactly know the exact names of the search keys and want retrieve all file objects that match a certain category. For example for getting all ocean reanalysis datasets you can apply the ‘reana*’ search key as a positional argument:
from freva_client import databrowser db = databrowser("reana*", realm="ocean", flavour="cmip6") for file in db: print(file)
https://swift.dkrz.de/v1/dkrz_a32dc0e8-2299-4239-a47d-6bf45c8b0160/freva_test/model/obs/reanalysis/reanalysis/NOAA/NODC/OC5/mon/ocean/Omon/r1i1p1/v20200101/hc700/hc700_mon_NODC_OC5_r1i1p1_201201-201212.zarr /home/runner/work/freva-nextgen/freva-nextgen/freva-rest/src/databrowser_api/mock/data/model/obs/reanalysis/reanalysis/NOAA/NODC/OC5/mon/ocean/Omon/r1i1p1/v20200101/hc700/hc700_mon_NODC_OC5_r1i1p1_201201-201212.nc /arch/bb1203/freva_test/model/obs/reanalysis/reanalysis/NOAA/NODC/OC5/mon/ocean/Omon/r1i1p1/v20200101/hc700/hc700_mon_NODC_OC5_r1i1p1_201201-201212.nc
If you don’t have direct access to the data, for example because you are not directly logged in to the computer where the data is stored you can set
stream_zarr=True
. The data will then be provisioned in zarr format and can be opened from anywhere. But bear in mind that zarr streams if not accessed in time will expire. Since the data can be accessed from anywhere you will also have to authenticate before you are able to access the data. Refer also to thefreva_client.authenticate()
method.from freva_client import authenticate, databrowser token_info = authenticate(username="janedoe") db = databrowser(dataset="cmip6-fs", stream_zarr=True) zarr_files = list(db) print(zarr_files)
['http://localhost:7777/api/freva-data-portal/zarr/c48f9024-f61e-5682-9651-7b9a21d81048.zarr', 'http://localhost:7777/api/freva-data-portal/zarr/ff8b9acf-b652-5ce3-a26c-58f9bc4884a3.zarr']
After you have created the paths to the zarr files you can open them
import xarray as xr dset = xr.open_dataset( zarr_files[0], chunks="auto", engine="zarr", storage_options={"header": {"Authorization": f"Bearer {token_info['access_token']}"} } )
- classmethod count_values(*facets: str, flavour: Literal['freva', 'cmip6', 'cmip5', 'cordex', 'nextgems', 'user'] = 'freva', time: str | None = None, host: str | None = None, time_select: Literal['flexible', 'strict', 'file'] = 'flexible', multiversion: bool = False, fail_on_error: bool = False, extended_search: bool = False, **search_keys: str | List[str]) Dict[str, Dict[str, int]] #
Count the number of objects in the databrowser.
Parameters#
- *facets: str
If you are not sure about the correct search key’s you can use positional arguments to search of any matching entries. For example ‘era5’ would allow you to search for any entries containing era5, regardless of project, product etc.
- flavour: str, default: freva
The Data Reference Syntax (DRS) standard specifying the type of climate datasets to query.
- time: str, default: “”
Special search facet to refine/subset search results by time. This can be a string representation of a time range or a single timestamp. The timestamp has to follow ISO-8601. Valid strings are
%Y-%m-%dT%H:%M
to%Y-%m-%dT%H:%M
for time ranges and%Y-%m-%dT%H:%M
. Note: You don’t have to give the full string format to subset time steps%Y
,%Y-%m
etc are also valid.- time_select: str, default: flexible
Operator that specifies how the time period is selected. Choose from flexible (default), strict or file.
strict
returns only those files that have the entire time period covered. The time search2000 to 2012
will not select files containing data from 2010 to 2020 with thestrict
method.flexible
will select those files asflexible
returns those files that have either start or end period covered.file
will only return files where the entire time period is contained within one single file.- extended_search: bool, default: False
Retrieve information on additional search keys.
- host: str, default: None
Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.
- multiversion: bool, default: False
Select all versions and not just the latest version (default).
- fail_on_error: bool, default: False
Make the call fail if the connection to the databrowser could not be established.
- **search_keys: str
The search constraints to be applied in the data search. If not given the whole dataset will be queried.
Returns#
- dict[str, int]:
Dictionary with the number of objects for each search facet/key is given.
Example#
from freva_client import databrowser print(databrowser.count_values(experiment="cmorph"))
{'ensemble': {'r1i1p1': 49}, 'experiment': {'cmorph': 49}, 'institute': {'cpc': 49}, 'model': {'cpc': 25, 'cpc-cmorph': 24}, 'product': {'grid': 49}, 'project': {'observations': 49}, 'realm': {'atmos': 49}, 'time_aggregation': {'mean': 49}, 'time_frequency': {'1min': 49}, 'variable': {'pr': 49}}
from freva_client import databrowser print(databrowser.count_values("model"))
{'ensemble': {}, 'experiment': {}, 'institute': {}, 'model': {}, 'product': {}, 'project': {}, 'realm': {}, 'time_aggregation': {}, 'time_frequency': {}, 'variable': {}}
Sometimes you don’t exactly know the exact names of the search keys and want retrieve all file objects that match a certain category. For example for getting all ocean reanalysis datasets you can apply the ‘reana*’ search key as a positional argument:
from freva_client import databrowser print(databrowser.count_values("reana*", realm="ocean", flavour="cmip6"))
{'member_id': {'r1i1p1': 3}, 'experiment_id': {'oc5': 3}, 'institution_id': {'noaa': 3}, 'source_id': {'nodc': 3}, 'activity_id': {'reanalysis': 3}, 'mip_era': {'observations': 3}, 'realm': {'ocean': 3}, 'time_aggregation': {'mean': 3}, 'frequency': {'mon': 3}, 'variable_id': {'hc700': 3}}
- intake_catalogue() esm_datastore #
Create an intake esm catalogue object from the search.
This method creates a intake-esm catalogue from the current object search. Instead of having the original files as target objects you can also choose to stream the files via zarr.
Returns#
intake_esm.core.esm_datastore: intake-esm catalogue.
Raises#
ValueError: If user is not authenticated or catalogue creation failed.
Example#
Let’s create an intake-esm catalogue that points points allows for streaming the target data as zarr:
from freva_client import databrowser db = databrowser(dataset="cmip6-hsm", stream_zarr=True) cat = db.intake_catalogue() print(cat.df)
uri project ... grid_label format 0 http://localhost:7777/api/freva-data-portal/za... CMIP6 ... gn nc [1 rows x 14 columns]
- property metadata: Dict[str, List[str]]#
Get the metadata (facets) for the current databrowser query.
You can retrieve all information that is associated with your current databrowser search. This can be useful for reverse searches for example for retrieving metadata of object stores or file/directory names.
Example#
Reverse search: retrieving meta data from a known file
from freva_client import databrowser db = databrowser(uri="slk:///arch/*/CPC/*") print(db.metadata)
{'cmor_table': ['30min'], 'dataset': ['obs-hsm'], 'driving_model': [], 'ensemble': ['r1i1p1'], 'experiment': ['cmorph'], 'format': ['nc'], 'fs_type': ['posix'], 'grid_id': [], 'grid_label': ['gn'], 'institute': ['cpc'], 'level_type': ['2d'], 'model': ['cpc'], 'product': ['grid'], 'project': ['observations'], 'rcm_name': [], 'rcm_version': [], 'realm': ['atmos'], 'time_aggregation': ['mean'], 'time_frequency': ['1min'], 'user': [], 'variable': ['pr']}
- classmethod metadata_search(*facets: str, flavour: Literal['freva', 'cmip6', 'cmip5', 'cordex', 'nextgems', 'user'] = 'freva', time: str | None = None, host: str | None = None, time_select: Literal['flexible', 'strict', 'file'] = 'flexible', multiversion: bool = False, fail_on_error: bool = False, extended_search: bool = False, **search_keys: str | List[str]) Dict[str, List[str]] #
Search for data attributes (facets) in the databrowser.
The method queries the databrowser for available search facets (keys) like model, experiment etc.
Parameters#
- *facets: str
If you are not sure about the correct search key’s you can use positional arguments to search of any matching entries. For example ‘era5’ would allow you to search for any entries containing era5, regardless of project, product etc.
- flavour: str, default: freva
The Data Reference Syntax (DRS) standard specifying the type of climate datasets to query.
- time: str, default: “”
Special search facet to refine/subset search results by time. This can be a string representation of a time range or a single timestamp. The timestamp has to follow ISO-8601. Valid strings are
%Y-%m-%dT%H:%M
to%Y-%m-%dT%H:%M
for time ranges and%Y-%m-%dT%H:%M
. Note: You don’t have to give the full string format to subset time steps%Y
,%Y-%m
etc are also valid.- time_select: str, default: flexible
Operator that specifies how the time period is selected. Choose from flexible (default), strict or file.
strict
returns only those files that have the entire time period covered. The time search2000 to 2012
will not select files containing data from 2010 to 2020 with thestrict
method.flexible
will select those files asflexible
returns those files that have either start or end period covered.file
will only return files where the entire time period is contained within one single file.- extended_search: bool, default: False
Retrieve information on additional search keys.
- multiversion: bool, default: False
Select all versions and not just the latest version (default).
- host: str, default: None
Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.
- fail_on_error: bool, default: False
Make the call fail if the connection to the databrowser could not be established.
- **search_keys: str, list[str]
The facets to be applied in the data search. If not given the whole dataset will be queried.
Returns#
- dict[str, list[str]]:
Dictionary with a list search facet values for each search facet key
Example#
from freva_client import databrowser all_facets = databrowser.metadata_search(project='obs*') print(all_facets)
{'ensemble': ['r1i1p1'], 'experiment': ['cmorph', 'oc5'], 'institute': ['cpc', 'noaa'], 'model': ['cpc', 'cpc-cmorph', 'nodc'], 'product': ['grid', 'reanalysis'], 'project': ['observations'], 'realm': ['atmos', 'ocean'], 'time_aggregation': ['mean'], 'time_frequency': ['1min', 'mon'], 'variable': ['hc700', 'pr']}
You can also search for all metadata matching a search string:
from freva_client import databrowser spec_facets = databrowser.metadata_search("obs*") print(spec_facets)
{'ensemble': ['r1i1p1'], 'experiment': ['cmorph'], 'institute': ['cpc'], 'model': ['cpc', 'cpc-cmorph'], 'product': ['grid'], 'project': ['observations'], 'realm': ['atmos'], 'time_aggregation': ['mean'], 'time_frequency': ['1min'], 'variable': ['pr']}
Get all models that have a given time step:
from freva_client import databrowser model = databrowser.metadata_search( project="obs*", time="2016-09-02T22:10" ) print(model)
{'ensemble': ['r1i1p1'], 'experiment': ['cmorph'], 'institute': ['cpc'], 'model': ['cpc', 'cpc-cmorph'], 'product': ['grid'], 'project': ['observations'], 'realm': ['atmos'], 'time_aggregation': ['mean'], 'time_frequency': ['1min'], 'variable': ['pr']}
Reverse search: retrieving meta data from a known file
from freva_client import databrowser res = databrowser.metadata_search(file="/arch/*CPC/*") print(res)
{'ensemble': ['r1i1p1'], 'experiment': ['cmorph'], 'institute': ['cpc'], 'model': ['cpc'], 'product': ['grid'], 'project': ['observations'], 'realm': ['atmos'], 'time_aggregation': ['mean'], 'time_frequency': ['1min'], 'variable': ['pr']}
Sometimes you don’t exactly know the exact names of the search keys and want retrieve all file objects that match a certain category. For example for getting all ocean reanalysis datasets you can apply the ‘reana*’ search key as a positional argument:
from freva_client import databrowser print(databrowser.metadata_search("reana*", realm="ocean", flavour="cmip6"))
{'member_id': ['r1i1p1'], 'experiment_id': ['oc5'], 'institution_id': ['noaa'], 'source_id': ['nodc'], 'activity_id': ['reanalysis'], 'mip_era': ['observations'], 'realm': ['ocean'], 'time_aggregation': ['mean'], 'frequency': ['mon'], 'variable_id': ['hc700']}
- classmethod overview(host: str | None = None) str #
Get an overview over the available search options.
If you don’t know what search flavours or search keys you can use for searching the data you can use this method to get an overview over what is available.
Parameters#
- host: str, default None
Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.
Returns#
str: A string representation over what is available.
Example#
from freva_client import databrowser print(databrowser.overview())
Available search flavours: - freva - cmip6 - cmip5 - cordex - nextgems - user Search attributes by flavour: cmip5: - experiment - member_id - fs_type - grid_label - institution_id - model_id - project - product - realm - variable - time - time_aggregation - time_frequency - cmor_table - dataset - format - grid_id - level_type cmip6: - experiment_id - member_id - fs_type - grid_label - institution_id - source_id - mip_era - activity_id - realm - variable_id - time - time_aggregation - frequency - table_id - dataset - format - grid_id - level_type cordex: - experiment - ensemble - fs_type - grid_label - institution - model - project - domain - realm - variable - time - time_aggregation - time_frequency - cmor_table - dataset - driving_model - format - grid_id - level_type - rcm_name - rcm_version freva: - project - product - institute - model - experiment - time_frequency - realm - variable - ensemble - time_aggregation - fs_type - grid_label - cmor_table - format - grid_id - level_type - dataset - time - user nextgems: - experiment - member_id - fs_type - grid_label - institution_id - source_id - project - experiment_id - realm - variable_id - time - time_reduction - time_frequency - cmor_table - dataset - format - grid_id - level_type user: - project - product - institute - model - experiment - time_frequency - realm - variable - ensemble - time_aggregation - fs_type - grid_label - cmor_table - format - grid_id - level_type - dataset - time - user
- property url: str#
Get the url of the databrowser API.
Example#
from freva_client import databrowser db = databrowser() print(db.url)
http://localhost:7777/api/databrowser
- classmethod userdata(action: Literal['add', 'delete'], userdata_items: List[str | Dataset] | None = None, metadata: Dict[str, str] | None = None, host: str | None = None, fail_on_error: bool = False) None #
Add or delete user data in the databrowser system.
Manage user data in the databrowser system by adding new data or deleting existing data.
For the “
add
” action, the user can provide data items (file paths or xarray datasets) along with metadata (key-value pairs) to categorize and organize the data.For the “
delete
” action, the user provides metadata as search criteria to identify and remove the existing data from the system.Parameters#
- actionLiteral[“add”, “delete”]
The action to perform: “add” to add new data, or “delete” to remove existing data.
- userdata_itemsList[Union[str, xr.Dataset]], optional
A list of user file paths or xarray datasets to add to the databrowser (required for “add”).
- metadataDict[str, str], optional
Key-value metadata pairs to categorize the data (for “add”) or search and identify data for deletion (for “delete”).
- hoststr, optional
Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.
- fail_on_errorbool, optional
Make the call fail if the connection to the databrowser could not be established.
Raises#
- ValueError
If the operation fails or required parameters are missing for the specified action.
- FileNotFoundError
If no user data is provided for the “add” action.
Example#
Adding user data:
from freva_client import authenticate, databrowser import xarray as xr token_info = authenticate(username="janedoe") filenames = ( "../freva-rest/src/databrowser_api/mock/data/model/regional/cordex/output/EUR-11/" "GERICS/NCC-NorESM1-M/rcp85/r1i1p1/GERICS-REMO2015/v1/3hr/pr/v20181212/*.nc" ) filename1 = ( "../freva-rest/src/databrowser_api/mock/data/model/regional/cordex/output/EUR-11/" "CLMcom/MPI-M-MPI-ESM-LR/historical/r0i0p0/CLMcom-CCLM4-8-17/v1/fx/orog/v20140515/" "orog_EUR-11_MPI-M-MPI-ESM-LR_historical_r1i1p1_CLMcom-CCLM4-8-17_v1_fx.nc" ) xarray_data = xr.open_dataset(filename1) databrowser.userdata( action="add", userdata_items=[xarray_data, filenames], metadata={"project": "cmip5", "experiment": "myFavExp"} )
1 have been successfully added to the databrowser. 1 files were duplicates and not added.
Deleting user data:
from freva_client import authenticate, databrowser token_info = authenticate(username="janedoe") databrowser.userdata( action="delete", metadata={"project": "cmip5", "experiment": "myFavExp"} )
User data deleted successfully