Resource Manager¶
With ResourceManager
VASCA provides a utility that helps managing the raw input
data. Data volumes processed by VASCA are generally pretty large and use cases as well
as computation and storage resources can vary. ResourceManager
adds an
abstraction layer that is flexible enough to varying contexts while exposing a
consistent API to the rest ofVASCA’s pipeline functions.
As an example, the processing of GALEX data for the proof-of-principle study was done by downloading raw data from MAST to a local directory, running the pipeline on a regular office laptop. This directory was then cloud-synced via DESY’s NextCloud service to allow collaborative work with multiple users on the same dataset.
Another use case is the unit testing for this package as well as this tutorial, wich should both work in a development environment and the GitHub continuous integration workflows.
Configuration¶
Configuration files are used to specify file locations, environment variables and even
specific data products that are relevant for the processing of a specific instrument’s
raw data. These can be freely edited by users to include data locations items as the
use case requires. ResourceManager
has the necessary consistency checks to warn
if any miss-configuration has happened. So try it out!
.env
¶
Text file located at the root directory of the package. This is read by the resource
manager at initialization which uses dotenv
to set the environment variables temporarily during run time. See .env_template
when using VASCA for the first time.
resource_envs.yml
¶
Configuration file specifying the required environment variables and associated attributes like a name, a project name and a short description to help other users to understand what variable is used for.
resource_catalag.yml
¶
Configuration file that associates directory or file items to specific environment variables. Each item has a name, description, type, and path attribute.
Note
The YAML configuration files are stored under the vasca
module in a subdirectory
named resource_metadata
Example¶
Initialize the ResourceManager and see what metadata it parsed from the config files.
from pprint import pprint
from vasca.resource_manager import ResourceManager
rm = ResourceManager()
# Resource item catalog
pprint(rm.metadata["catalog"], sort_dicts=False)
Show code cell output
{'sas_cloud': {0: {'name': 'gal_visits_list',
'description': 'Complete list of all GALEX visits with NUV '
'exposure.',
'type': 'file',
'path': '/GALEX_visits_list/GALEX_visits_list_qualvars.fits'},
1: {'name': 'gal_fields',
'description': 'Collection of GALEX dataproducts for fileds '
'of interest.',
'type': 'directory',
'path': '/GALEX_fields'},
2: {'name': 'gal_gphoton',
'description': 'Collection of GALEX gphoton runs.',
'type': 'directory',
'path': '/GALEX_gPhoton'},
3: {'name': 'gal_visits_list_qualsel',
'description': 'Complete list of all GALEX visits.',
'type': 'file',
'path': '/GALEX_visits_list/GALEX_visits_list_qualsel.fits'}},
'lustre': {0: {'name': 'gal_ds_visits_list',
'description': 'Complete list of all GALEX drift-scan visits.',
'type': 'file',
'path': '/GALEX_DS_GCK_visits_list/GALEX_DS_GCK_visits_list.fits'},
1: {'name': 'gal_ds_fields',
'description': 'Collection of GALEX drift scan dataproducts',
'type': 'directory',
'path': '/GALEX_DS_GCK_fields'},
2: {'name': 'gal_gphoton',
'description': 'Collection of GALEX gphoton runs.',
'type': 'directory',
'path': '/GALEX_gPhoton'}},
'vasca': {0: {'name': 'test_resources',
'description': 'Data used for VASCA development',
'type': 'directory',
'path': '/vasca/test/resources'},
1: {'name': 'gal_visits_list',
'description': 'Complete list of all GALEX visits with NUV '
'exposure.',
'type': 'file',
'path': '/vasca/test/resources/GALEX_visits_list.fits'},
2: {'name': 'docs_resources',
'description': 'Data used for VASCA documentation',
'type': 'directory',
'path': '/docs/tutorial_resources'}}}
# Resource environment variables
pprint(rm.metadata["envs"], sort_dicts=False)
Show code cell output
{'VASCA_DEFAULT': {'storage': 'vasca',
'project': 'vasca',
'description': 'Default environment variable for data '
'management in the VASCA package',
'set': True,
'path': '/home/runner/work/vasca/vasca'},
'UC_SAS_VASCARCAT': {'storage': 'sas_cloud',
'project': 'vascarcat',
'description': 'Resources for the UV variability catalog '
'project on DESY Sync & Share. Remote '
'directory: '
'ULTRASAT-data/uc_science/vascarcat',
'set': True,
'path': '/dev/null'},
'UC_LUSTRE_VASCARCAT': {'storage': 'lustre',
'project': 'vascarcat',
'description': 'Resoruces for the UV variability '
'catalog project on DESY LUSTRE. Not '
'used at the moment.',
'set': True,
'path': '/dev/null'}}
The main functionality: receiving paths from the ResourceManager to specific resource items:
rpath = rm.get_path(resource="gal_visits_list", storage="vasca")
print(rpath)
/home/runner/work/vasca/vasca/vasca/test/resources/GALEX_visits_list.fits
All paths returned by get_path
are verified:
from pathlib import Path
Path(rpath).exists()
True
Otherwise actionable error messages are given.
Resource name not found:
rm.get_path(resource="foo", storage="vasca")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[7], line 1
----> 1 rm.get_path(resource="foo", storage="vasca")
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/vasca/resource_manager.py:323, in ResourceManager.get_path(self, resource, storage)
321 # validate resource name
322 if resource not in resource_list:
--> 323 raise KeyError(
324 str.format(
325 "Unknown resource '{}'. Select one from {}.",
326 resource,
327 resource_list_verbose,
328 )
329 )
331 # get resource metadata
332 success = False
KeyError: "Unknown resource 'foo'. Select one from ['gal_visits_list(sas_cloud:0)', 'gal_fields(sas_cloud:1)', 'gal_gphoton(sas_cloud:2)', 'gal_visits_list_qualsel(sas_cloud:3)', 'gal_ds_visits_list(lustre:0)', 'gal_ds_fields(lustre:1)', 'gal_gphoton(lustre:2)', 'test_resources(vasca:0)', 'gal_visits_list(vasca:1)', 'docs_resources(vasca:2)']."
Storage system not recognized:
rm.get_path(resource="gal_visits_list", storage="foo")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[8], line 1
----> 1 rm.get_path(resource="gal_visits_list", storage="foo")
File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/vasca/resource_manager.py:300, in ResourceManager.get_path(self, resource, storage)
298 # validate storage name
299 if storage not in self.metadata["catalog"]:
--> 300 raise KeyError(
301 str.format(
302 ("Unknown storage system '{}'. Select one from {}."),
303 storage,
304 [strg for strg in list(self.metadata["catalog"].keys())],
305 )
306 )
307 # list of all known resources: [<resource name>]
308 resource_list = [
309 self.metadata["catalog"][strg][id]["name"]
310 for strg in self.metadata["catalog"]
311 for id in self.metadata["catalog"][strg]
312 ]
KeyError: "Unknown storage system 'foo'. Select one from ['sas_cloud', 'lustre', 'vasca']."