Define Environment Variables For Input Workers
In a Wafer-Scale Cluster execution, the input pipeline runs on input worker nodes within the cluster, which are separate processes started by the Appliance on different CPU nodes.
Overview
As such, any environment variables set on the user process are not seen by the input workers as they run on different nodes. But since input workers run user code (e.g., dataloader), it is desirable in some cases to have some environment variables set on user side to be visible by workers as well. This section describes how to set environment variables on the user node that are visible by the input workers that stream data to the system.
Procedure
For this, we provide a drop-in replacement for os.environ
with an identical interface. This object can be imported as follows:
The main difference of this object compared to os.environ
is that it tracks any environment variables set through it and sends them to the input workers. Setting an environment variable through appliance_environ
effectively stores the key and when an execute job is created, values of these environment variables are read and sent to the input workers. The input workers then see these variables as if they were set through os.environ
.
Important considerations
-
This object does not track environment variables that are not set through it. For example, if you set an environment variable through os.environ, while this object will see the value, it will not transfer this variable to the input workers.
-
These environment variables are sent to the Appliance when an execute job is requested. Setting variables after an execute job is created has no effect until the next execute job is created (if any).
-
Environment variables set throug appliance_environ are injected into the input workers after the worker process has been created. This means setting environment variables that need to exist before the Python interpreter is started will not work. For example, setting PYTHONPATH will not have the intended effect. This object is mainly for user-level environment variables set within a run.
Usage
Here’s an example of how to use this object:
Conclusion
The introduction of the specialized environment variable manager, appliance_environ
, provides a robust solution for propagating environment variables from the user node to input worker nodes in a Wafer-Scale Cluster setup. This functionality is essential for ensuring that user code running on input workers can access necessary environment variables, despite the inherent separation of these processes across different CPU nodes. By utilizing appliance_environ to set environment variables, users can ensure that their settings are consistently applied across all nodes involved in the execution, thereby maintaining the desired environment for data streaming and processing. This approach not only simplifies the configuration process but also enhances the reliability and efficiency of the input pipeline in distributed computing environments.