Tutorial#
The motivation for the pyiron_base
workflow manager is to provide a self-consistent environment for the development
of simulation protocols. Many pyiron_base
users previously used the command line to manage their simulation protocols
and construct parameter studies. Still based on the limitations in coupling command line approaches to machine learning
models, pyiron_base
is based on the Python programming language, and it is recommended to use the pyiro_base
workflow
manager in combination with Jupyter Lab for maximum efficiency. Consequently, the first challenge is guiding users who
used the command line before and now switch to using only Jupyter notebooks.
The three most fundamental objects to learn when interacting with the pyiron_base
workflow manager are jupyter
notebooks, pyiron_base
Project objects and pyiron_base
Job objects. These are introduced below.
Jupyter Lab#
By default, Jupyter notebooks consist of three types of cells: <code>
cells, <raw>
cells and <markdown>
cells. To
execute python code, select the <code>
cells. These cells can then be executed using <enter> + <shift>
. While it is
technically possible to execute cells in Jupyter notebooks in arbitrary order, for reproducibility it is recommended to
execute the individual cells in order.
To learn more about Jupyter notebooks, checkout the Jupyter notebook documentation.
To start a jupyter lab in a given directory call the jupyter lab
command:
jupyter lab
This opens a jupyter lab session in your web browser, typically on port 8888
.
Alternatively, the pyiron_base
workflow manager can also be executed in a regular python shell or python script, still
for most users the jupyter lab environment offers more flexibility and simplifies the development and documentation of
simulation workflows.
Project#
To encourage rapid prototyping the pyiron_base
workflow manager requires just a single import statement. So to start
using pyiron_base
import the Project
class:
from pyiron_base import Project
The Project
class represents a folder on the file system. Instead of hiding all calculation in an abstract database
the pyiron_base
workflow manager leverages the file system to store results of calculation. To initialize a Project
object instance just provide a folder name. This folder is then created right next to the Jupyter notebook.
pr = Project(path="demonstration")
The project object instance is the central object of the pyiron_base
workflow manager. All other object instances are
created from the project object instance. To get the full path of the project object instance is using the path
property:
pr.path
To accelerate the development of simulation protocols the pyiron_base
workflow manager benefits from the <tab>
based
autocompletion. If you type pr.p
and press <tab>
the jupyter notebook environment automatically completes your entry.
In addition to the representation of the project object instance as a folder on the file system, the project object
instance is also connected to the SQL database. Using the job_table()
function the job objects in a given project are
listed. By default, the pyiron_base
workflow manager uses the SQL database for quickly generating a list of all job
objects. Still the pyiron_base
workflow manager can also be installed without an SQL database. In that case the job_table()
function generates a list of job objects by iterating over the file system.
pr.job_table()
When the project was just started, the job table is expected to be an empty pandas.DataFrame
. To get an overview of
all the parameters of the job_table()
function the jupyter lab environment provides the question mark parameter to
look up the documentation:
pr.job_table?
Furthermore, with two question marks it is also possible to take a look at the source code of a given python function:
pr.job_table??
In the same way the job_table()
function can be used to get a list of all job objects in a Project object instance the
remove_jobs()
function can be used to delete all job objects in a project.
pr.remove_jobs()
In many cases it is also useful to just remove a single job object using the remove_job()
function. This is introduced
below once the job objects are introduced in the section below.
Job#
The job object class is the second building block of the pyiron_base
workflow manager. In your parameter study each
unique combination of a set of parameters is represented as a single job object. In the most typical cases this can be
either the call of a python function or the call of an external executable, still also the aggregation of results in a
pyiron table is represented as job object. The advantage of representing all these different tasks as job objects is
that the job objects can be submitted to the queuing system to distribute the individual tasks over the computing
resources of an HPC cluster.
Python Function#
For python functions which run several minutes or hours, it is essential to have a workflow manager. The pyiron_base
workflow manager addresses this challenge by storing the input and output of each python function call in an HDF5 file,
which has the advantage that each function call can also be submitted to the queuing system of an HPC cluster. In case
your function only requires a run time of several seconds or minutes it is recommended to combine multiple function calls
in a larger function and then submit this function as a pyiron job to the HPC cluster. This can be achieved by using the
concurrent.futures.ProcessPoolExecutor
inside the function to distribute a number of function calls over the multiple CPU cores of the compute nodes. So the
concurrent.futures.ProcessPoolExecutor
is used for the task distribution inside the compute node and the pyiron_base
workflow manager is used for the task
distribution over multiple compute nodes and handling the submission to the queuing system.
In this example a very simple python function is used:
def test_function(a, b=8):
return a+b
The run time for this function is far below a millisecond so is it not reasonable to submit it to a remote computing cluster.
This is primarily a demonstration to highlight the capabilities of the pyiron_base
workflow manager.
from pyiron_base import Project
pr = Project("test")
job = pr.wrap_python_function(test_function)
job.input["a"] = 4
job.input["b"] = 5
job.run()
job.output["result"]
>>> 9
After the import of the Project class a Project instance is created connected to the folder test
. Then the test_function()
is wrapped as a job object. This allows the user to set the input using the job.input
property of the job object. The
individual input parameters can be accessed using the edge bracket notation, just like a python dictionary. Finally,
when the job.run()
function is called the job is executed. After the successful submission the output can be accessed
via the job.output
property.
To submit the function call to a remote HPC cluster the server object of the job object can be configured:
job.server.queue = "slurm"
job.server.cores = 120
job.server.run_time = 3600 # in seconds
In this example the slurm
queue was selected, a total of 120 CPU cores were selected and a run time of one hour was
selected. Still it is important to mention, that assigning 120 CPU cores does not enable parallel execution of the python
function. Only by implementing internal parallelization inside the python functions with solutions like
concurrent.futures.ProcessPoolExecutor
it is possible to parallelize the execution of python functions on a single compute node. Finally, the pyiron developers
released the pympipool to enable parallelization of python functions as well as the
direct assignment of GPU resources inside a given queuing system allocation over multiple compute nodes using the
hierarchical queuing system flux.
External Executable#
As many scientific simulation codes do not have Python bindings the pyiron_base
workflow manager also supports the
submission of external executables. In the pyiron_base
workflow manager external executables are interfaces using three
components, a write_input()
function, a collect_output()
function and finally an executable string which is executed
after the input files were written. The write_input()
function takes a input_dict
dictionary and a working_directory
as input parameters and writes the input files into the working directory.
import os
def write_input(input_dict, working_directory="."):
with open(os.path.join(working_directory, "input_file"), "w") as f:
f.write(str(input_dict["energy"]))
In analogy the collect_output()
function takes the working_directory
as an input parameter and returns a dictionary
of the output.
def collect_output(working_directory="."):
with open(os.path.join(working_directory, "output_file"), "r") as f:
return {"energy": float(f.readline())}
Once the write_input()
and collect_output()
function are defined the actual workflow is defined. Starting with the
definition of the Project object instance, followed by creating the job class using the create_job_class()
function.
In this example the cat
command is used to copy the energy value from the input file to the output file. Again this is
not a function which is typically submitted to a HPC cluster, it is primarily a demonstration how to implement how to
create a job class based on an external executable plus a write_input()
and collect_output()
function.
from pyiron_base import Project
pr = Project(path="test")
pr.create_job_class(
class_name="CatJob",
write_input_funct=write_input,
collect_output_funct=collect_output,
default_input_dict={"energy": 1.0},
executable_str="cat input_file > output_file",
)
job = pr.create.job.CatJob(job_name="job_test")
job.input["energy"] = 2.0
job.run()
job.output["result"]
>>> 2.0
In analogy to the job objects for python functions also the python functions for external executables can be submitted
to the queuing system by configuring the same job.server
property. Still it is important to update the executable
string executable_str
variable to use mpiexec
or other means of parallelization to execute the external executable
in parallel.
If the executable_str
supports multiple cores, multiple threads or GPUs acceleration, then these can be accessed via
the environment variables PYIRON_CORES
for CPU cores, PYIRON_THREADS
for threads and PYIRON_GPUS
for number of
GPUs. So you can wrap MPI parallel executables using mpirun -n ${PYIRON_CORES} executable
and then set the number of
cores on the job
object using job.server.cores=10
. Alternatively you can create a shell script like executable.sh
:
#!/bin/bash
mpirun -n ${PYIRON_CORES} executable
This shell script can then be set as executable_str
.
Jupyter Notebook#
The third category of job objects the pyiron_base
workflow manager supports are the ScriptJob
which are used to
submit Jupyter notebooks to the queuing system. While it is recommended to use the wrap_python_function()
to wrap the
python commands some users prefer to submit a whole Jupyter notebook at once. So the pyiron_base
workflow manager
introduces the ScriptJob
job type:
from pyiron_base import Project
pr = Project(path="demo")
script = pr.create.job.ScriptJob(job_name="script")
script.script_path = "demo.ipynb"
script.input['my_variable'] = 5
script.run()
The ScriptJob
jobs take a jupyter notebook as an input script.script_path
as well as a series of input parameters
in the script.input
. These input parameters can then be accessed in the jupyter notebook using the get_external_input()
function. So an example demo.ipynb
jupyter notebook could include the following code:
from pyiron_base import Project
pr = Project(path="script")
external_input = pr.get_external_input()
external_input
>>> {"my_variable": 5}
Table#
The third category of central objects in the pyiron_base
workflow manager next to the project object and the job objects
is the pyiron table object. While technically the pyiron table object is another job object, it is application is primarily
to gather job objects in a given project. The pyiron table object follows the map-reduce.
The pyiron table object takes three kinds of inputs. The first is a filter function which is used to identify which jobs the following functions are applied on. The filter function is not mandatory still it is very helpful in particular when a large number of jobs are created in a given project.
def myfilter(job):
return "test_function" in job.job_name
The second part are the selection of functions which are applied on all job objects in a given project. Again it takes
a job object as an input. Each job is going to be represented as a row in the pandas.DataFrame
created by the pyiron
job table and each function represents a column.
def len_job_name(job):
return len(job.job_name)
The third part is a project the pyiron table is applied on. This is required to store the pyiron table in a different project compared to the project the job objects are located in. By default the analysis is applied on the project the pyiron table object is created in.
from pyiron_base import Project
pr = Project(path="demo")
table = pr.create_table()
table.filter_function = myfilter
table.add["len"] = len_job_name
# table.analysis_project = pr
table.run()
table.get_dataframe()
After the definition of optional filter function and the functions as well as the analysis project the table is executed
just like a job object. This means that the pyiron table can also be submitted to the queuing system for large map-reduce
tasks. Finally, the pyiron table collects the results as a pandas.DataFrame
so the results can be directly used in
machine learning models or data analysis.