pyiron_base.storage.flattenedstorage.FlattenedStorage#
- class pyiron_base.storage.flattenedstorage.FlattenedStorage(num_chunks=1, num_elements=1, lock_method='error', **kwargs)[source]#
Bases:
Lockable,HasDictfromHDF,HasHDFEfficient storage of ragged arrays in flattened arrays.
This class stores multiple arrays at the same time. Storage is organized in “chunks” that may be of any size, but all arrays within chunk are of the same size, e.g.
>>> a = [ [1], [2, 3], [4, 5, 6] ] >>> b = [ [2], [4, 6], [8, 10, 12] ]
are stored as in three chunks like
>>> a_flat = [ 1, 2, 3, 4, 5, 6 ] >>> b_flat = [ 2, 4, 6, 8, 10, 12 ]
with additional metadata to indicate where the boundaries of each chunk are.
First add arrays and chunks like this
>>> store = FlattenedStorage() >>> store.add_array("even", dtype=np.int64) >>> store.add_chunk(1, even=[2]) >>> store.add_chunk(2, even=[4, 6]) >>> store.add_chunk(3, even=[8, 10, 12])
where the first argument indicates the length of each chunk. You may retrieve stored values like this
>>> store.get_array("even", 1) array([4, 6]) >>> store.get_array("even", 0) array([2])
where the second arguments are integer indices in the order of insertion. After intial storage you may modify arrays.
>>> store.set_array("even", 0, [0]) >>> store.get_array("even", 0) array([0])
As a shorthand you can use regular index syntax
>>> store["even", 0] = [2] >>> store["even", 0] array([2]) >>> store["even", 1] array([4, 6]) >>> store["even"] array([2, 4, 6, 8, 10, 12]) >>> store["even", 0] = [0]
You can add arrays to the storage even after you added already other arrays and chunks.
>>> store.add_array("odd", dtype=np.int64, fill=0) >>> store.get_array("odd", 1) array([0, 0]) >>> store.set_array("odd", 0, [1]) >>> store.set_array("odd", 1, [3, 5]) >>> store.set_array("odd", 2, [7, 9, 11]) >>> store.get_array("odd", 2) array([ 7, 9, 11])
Because the second chunk is already known to be of length two and fill was specified the ‘odd’ array has been appropriatly allocated.
Additionally arrays may also only have one value per chunk (“per chunk”, previous examples are “per element”).
>>> store.add_array("sum", dtype=np.int64, per="chunk") >>> for i in range(len(store)): ... store.set_array("sum", i, sum(store.get_array("even", i) + store.get_array("odd", i))) >>> store.get_array("sum", 0) 1 >>> store.get_array("sum", 1) 18 >>> store.get_array("sum", 2) 57
Finally you may add multiple arrays in one call to
add_chunk()by using keyword arguments>>> store.add_chunk(4, even=[14, 16, 18, 20], odd=[13, 15, 17, 19], sum=119) >>> store.get_array("sum", 3) 119 >>> store.get_array("even", 3) array([14, 16, 18, 20])
It is usually not necessary to call
add_array()beforeadd_chunk(), the type of the array will be inferred in this case.If you skip the frame argument to
get_array()it will return a flat array of all the values for that array in storage.>>> store.get_array("sum") array([ 1, 18, 57, 119]) >>> store.get_array("even") array([ 0, 4, 6, 8, 10, 12, 14, 16, 18, 20])
Arrays may be of more complicated shape, too, see
add_array()for details.Use
copy()to obtain a deep copy of the storage, for shallow copies using the builting copy.copy is sufficient.>>> copy = store.copy() >>> copy["even", 0] array([0]) >>> copy["even", 1] array([4, 6]) >>> copy["even"] array([0, 4, 6, 8, 10, 12])
Storages can be
split()andjoin()again as long as their internal chunk structure is consistent, i.e. same number of chunks and same chunk lengths. If this is not the case a ValueError is raised.>>> even = store.split(["even"]) >>> bool(even.has_array("even")) True >>> bool(even.has_array("odd")) False >>> odd = store.split(["odd"])
join()adds new arrays to the storage it is called on in-place. To leave it unchanged, simply call copy before join. >>> both = even.copy().join(odd)Chunks may be given string names, either by passing identifier to
add_chunk()or by setting to the special per chunk array “identifier”>>> store.set_array("identifier", 1, "second") >>> all(store.get_array("even", "second") == store.get_array("even", 1)) True
When adding new arrays follow the convention that per-structure arrays should be named in singular and per-atom arrays should be named in plural.
You may initialize flattened storage objects with a ragged lists or numpy arrays of dtype object
>>> even = [ list(range(0, 2, 2)), list(range(2, 6, 2)), list(range(6, 12, 2)) ] >>> even [[0], [2, 4], [6, 8, 10]]
>>> import numpy as np >>> odd = np.array([ np.arange(1, 2, 2), np.arange(3, 6, 2), np.arange(7, 12, 2) ], dtype=object) >>> odd array([array([1]), array([3, 5]), array([ 7, 9, 11])], dtype=object)
>>> store = FlattenedStorage(even=even, odd=odd) >>> store.get_array("even", 1) array([2, 4]) >>> store.get_array("odd", 2) array([ 7, 9, 11]) >>> len(store) 3
You can set storages as read-only via methods defined on
Lockable.>>> store.lock() >>> store.get_array("even", 0) array([0]) >>> store.set_array("even", np.array([4])) >>> store.get_array("even", 0) array([0]) >>> with store.unlocked(): ... store.set_array("even", np.array([4])) >>> store.get_array("even", 0) array([4])
- __init__(num_chunks=1, num_elements=1, lock_method='error', **kwargs)[source]#
Create new flattened storage.
- Parameters:
num_chunks (int) – pre-allocation for per chunk arrays
num_elements (int) – pre-allocation for per elements arrays
Methods
__init__([num_chunks, num_elements, lock_method])Create new flattened storage.
add_array(name[, shape, dtype, fill, per])Add a custom array to the container.
add_chunk(chunk_length[, identifier])Add a new chunk to the storeage.
copy()Return a deep copy of the storage.
del_array(name[, ignore_missing])Remove an array.
extend(other)Add chunks from other to this storage.
find_chunk(identifier)Return integer index for given identifier.
from_dict(obj_dict[, version])Populate the object from the serialized object.
from_hdf(hdf[, group_name])Read object to HDF.
from_hdf_args(hdf)Read arguments for instance creation from HDF5 file.
get_array(name[, frame])Fetch array for given structure.
get_array_filled(name)Return elements of array name in all chunks.
get_array_ragged(name)Return elements of array name in all chunks.
has_array(name)Checks whether an array of the given name exists and returns meta data given to
add_array().instantiate(obj_dict[, version])Create a blank instance of this class.
join(store[, lsuffix, rsuffix])Merge given storage into this one.
list_arrays([only_user])Return a list of names of arrays inside the storage.
lock([method])Set
read_only.rewrite_hdf(hdf[, group_name])Update the HDF representation.
sample(selector)Create a new storage with chunks selected by given function.
set_array(name, frame, value)Add array for given structure.
split(array_names)Return a new storage with only the selected arrays present.
to_dict()Reduce the object to a dictionary.
to_hdf(hdf[, group_name])Write object to HDF.
to_pandas([explode, include_index])Convert arrays to pandas dataframe.
unlocked()Unlock the object temporarily.
Attributes
False if the object can currently be written to
- add_array(name, shape=(), dtype=<class 'numpy.float64'>, fill=None, per='element')[source]#
Add a custom array to the container.
When adding an array after some chunks have been added, specifying fill will be used as a default value for the value of the array for those chunks.
Adding an array with the same name twice is ignored, if dtype and shape match, otherwise raises an exception.
>>> store = FlattenedStorage() >>> store.add_chunk(1, "foo") >>> store.add_array("energy", shape=(), dtype=np.float64, fill=42, per="chunk") >>> store.get_array("energy", 0) 42.0
- Parameters:
name (str) – name of the new array
shape (tuple of int) – shape of the new array per element or chunk; scalars can pass ()
dtype (type) – data type of the new array, string arrays can pass ‘U$n’ where $n is the length of the string
fill (object) – populate the new array with this value for existing chunk, if given; default None
per (str) – either “element” or “chunk”; denotes whether the new array should exist for every element in a chunk or only once for every chunk; case-insensitive
- Raises:
ValueError – if wrong value for per is given
ValueError – if array with same name but different parameters exists already
- add_chunk(chunk_length, identifier=None, **arrays)[source]#
Add a new chunk to the storeage.
Additional keyword arguments given specify arrays to store for the chunk. If an array with the given keyword name does not exist yet, it will be added to the container.
>>> container = FlattenedStorage() >>> container.add_chunk(2, identifier="A", energy=3.14) >>> container.get_array("energy", 0) 3.14
If the first axis of the extra array matches the length of the chunk, it will be added as an per element array, otherwise as an per chunk array.
>>> container.add_chunk(2, identifier="B", forces=2 * [[0,0,0]]) >>> len(container.get_array("forces", 1)) == 2 True
Reshaping the array to have the first axis be length 1 forces the array to be set as per chunk array. That axis will then be stripped.
>>> container.add_chunk(2, identifier="C", pressure=np.eye(3)[np.newaxis, :, :]) >>> container.get_array("pressure", 2).shape (3, 3)
Attention
Edge-case!
This will not work when the chunk length is also 1 and the array does not exist yet! In this case the array will be assumed to be per element and there is no way around explicitly calling
add_array().- Parameters:
chunk_length (int) – length of the new chunk
identifier (str, optional) – human-readable name for the chunk, if None use current chunk index as string
**kwargs – additional arrays to store for the chunk
- del_array(name: str, ignore_missing: bool = False)[source]#
Remove an array.
Works with both per chunk and per element arrays.
- Parameters:
name (str) – name of the array
ignore_missing (bool) – if given do not raise an error if no array of the given name exists
- Raises:
KeyError – if no array with given name exists and ignore_missing is not given
- extend(other: FlattenedStorage)[source]#
Add chunks from other to this storage.
Afterwards the number of chunks and elements are the sum of the respective previous values.
If other defines new arrays or doesn’t define some of the arrays they are padded by the fill values.
- Parameters:
other (
FlattenedStorage) – other storage to add- Raises:
ValueError – if fill values between both storages are not compatible
- Returns:
return this storage
- Return type:
- find_chunk(identifier)[source]#
Return integer index for given identifier.
- Parameters:
identifier (str) – name of chunk previously passed to
add_chunk()- Returns:
integer index for chunk
- Return type:
int
- Raises:
KeyError – if identifier is not found in storage
- from_dict(obj_dict: dict, version: str = None)#
Populate the object from the serialized object.
- Parameters:
obj_dict (dict) – data previously returned from
to_dict()version (str) – version tag written together with the data
- from_hdf(hdf: ProjectHDFio, group_name: str = None)#
Read object to HDF.
If group_name is given descend into subgroup in hdf first.
- Parameters:
hdf (
ProjectHDFio) – HDF group to read fromgroup_name (str, optional) – name of subgroup
- classmethod from_hdf_args(hdf: ProjectHDFio) dict#
Read arguments for instance creation from HDF5 file.
- Parameters:
hdf (ProjectHDFio) – HDF5 group object
- Returns:
arguments that can be **kwarg-passed to cls().
- Return type:
dict
- get_array(name, frame=None)[source]#
Fetch array for given structure.
Works for per atom and per arrays.
- Parameters:
name (str) – name of the array to fetch
frame (int, str, optional) – selects structure to fetch, as in
get_structure(), if not given return a flat array of all values for either all chunks or elements
- Returns:
requested array
- Return type:
numpy.ndarray- Raises:
KeyError – if array with name does not exists
- get_array_filled(name: str) ndarray[source]#
Return elements of array name in all chunks. Arrays are padded to be all of the same length.
The padding value depends on the datatpye of the array or can be configured via the fill parameter of
add_array().If name specifies a per chunk array, there’s nothing to pad and this method is equivalent to
get_array().- Parameters:
name (str) – name of array to fetch
- Returns:
padded arrray of all elements in all chunks
- Return type:
numpy.ndarray
- get_array_ragged(name: str) ndarray[source]#
Return elements of array name in all chunks. Values are returned in a ragged array of dtype=object.
If name specifies a per chunk array, there’s nothing to pad and this method is equivalent to
get_array().- Parameters:
name (str) – name of array to fetch
- Returns:
ragged arrray of all elements in all chunks
- Return type:
numpy.ndarray, dtype=object
- has_array(name)[source]#
Checks whether an array of the given name exists and returns meta data given to
add_array().>>> container.has_array("energy") {'shape': (), 'dtype': np.float64, 'per': 'chunk'} >>> container.has_array("fnorble") None
- Parameters:
name (str) – name of the array to check
- Returns:
if array does not exist dict: if array exists, keys corresponds to the shape, dtype and per arguments of
add_array()- Return type:
None
- classmethod instantiate(obj_dict: dict, version: str = None) Self#
Create a blank instance of this class.
This can be used when some values are already necessary for the objects __init__.
- Parameters:
obj_dict (dict) – data previously returned from
to_dict()version (str) – version tag written together with the data
- Returns:
a blank instance of the object that is sufficiently initialized to call
_from_dict()on it- Return type:
object
- join(store: FlattenedStorage, lsuffix: str = '', rsuffix: str = '') FlattenedStorage[source]#
Merge given storage into this one.
self and store may not share any arrays. Arrays defined on stores are copied and then added to self.
- Parameters:
store (
FlattenedStorage) – storage to joinlsuffix (str, optional) – if either are given rename all arrays by appending the suffices to the array name; lsuffix for arrays in this storage, rsuffix for arrays in the added storage; in this case arrays are no longer available under the old name
rsuffix (str, optional) – if either are given rename all arrays by appending the suffices to the array name; lsuffix for arrays in this storage, rsuffix for arrays in the added storage; in this case arrays are no longer available under the old name
- Returns:
self
- Return type:
- Raises:
ValueError – if the two stores do not have the same number of chunks
ValueError – if the two stores do not have equal chunk lengths
ValueError – if lsuffix and rsuffix are equal and different from “”
ValueError – if the stores share array names but lsuffix and rsuffix are not given
- list_arrays(only_user=False) List[str][source]#
Return a list of names of arrays inside the storage.
- Parameters:
only_user (bool) – If True include only array names added by the
:param user via
add_array()and the identifier array.:- Returns:
array names
- Return type:
list of str
- lock(method: Literal['error', 'warning'] | None = None)#
Set
read_only.Objects may be safely locked multiple times without further effect.
- Parameters:
method (str, either "error" or "warning") – if “error” raise an
Lockedexception if modification is attempted; if “warning” raise aLockedWarningwarning; default is “error” or the value passed to the constructor.- Raises:
ValueError – if method is not an allowed value
- property read_only: bool#
False if the object can currently be written to
Setting this value will trigger
_on_lock()and_on_unlock()if it changes.- Type:
bool
- rewrite_hdf(hdf: ProjectHDFio, group_name: str = None)#
Update the HDF representation.
If an object is read from an older layout, this will remove the old data and rewrite it in the newest layout.
- Parameters:
hdf (
ProjectHDFio) – HDF group to read/writegroup_name (str, optional) – name of subgroup
- sample(selector: Callable[[FlattenedStorage, int], bool]) FlattenedStorage[source]#
Create a new storage with chunks selected by given function.
If called on a subclass this correctly returns an instance of that subclass instead.
- Parameters:
select (callable) – function that takes this storage as the first argument and the chunk index to sample as the second argument; if it returns True it will be part of the new storage.
- Returns:
storage with the selected chunks
- Return type:
FlattenedStorageor subclass
- set_array(name, frame, value)[source]#
Add array for given structure.
Works for per chunk and per element arrays.
- Parameters:
name (str) – name of array to set
frame (int, str) – selects structure to set, as in
get_strucure()value – value (for per chunk) or array of values (for per element); type and shape as per
hasarray().
- Raises:
KeyError – if array with name does not exists
- split(array_names: Iterable[str]) FlattenedStorage[source]#
Return a new storage with only the selected arrays present.
Arrays are deep-copied from self.
- Parameters:
array_names (list of str) – names of the arrays to present in new storage
- Returns:
storage with split arrays
- Return type:
- to_dict() dict#
Reduce the object to a dictionary.
- Returns:
serialized state of this object
- Return type:
dict
- to_hdf(hdf: ProjectHDFio, group_name: str = None)#
Write object to HDF.
If group_name is given create a subgroup in hdf first.
- Parameters:
hdf (
ProjectHDFio) – HDF group to write togroup_name (str, optional) – name of subgroup
- to_pandas(explode=False, include_index=False) DataFrame[source]#
Convert arrays to pandas dataframe.
- Parameters:
explode (bool) – If False values of per element arrays are stored in the dataframe as arrays, otherwise each row in the dataframe corresponds to an element in the original storage.
- Returns:
table of array values
- Return type:
pandas.DataFrame
- unlocked() _UnlockContext#
Unlock the object temporarily.
Context manager returns this object again and relocks it after the with statement finished.
Note
lock() vs. unlocked()
There is a small asymmetry between these two methods.
lock()can only be done once (meaningfully), whileunlocked()is a context manager and can be called multiple times.