# Data Storage¶

The AxoPy data storage model was created so that your experimental dataset can be easily passed on to others and read using a variety of tools on any platform, meaning AxoPy is not required to use the raw dataset. It was also designed so that you can browse the dataset using a standard file browser so you do not need to write any code to have a peek.

Data is stored in a hierarchical fashion using a specific file structure and common file types. There are two types of files: comma separated value (CSV) files for trial data (one row per trial) and HDF5 files for array data (one file per trial). Here’s the file structure in abstract terms:

data/
subject_id/
file: trials.csv
file: array_type1.hdf5
file: array_type2.hdf5


You have the root of the entire dataset, containing a subfolder for each subject. Each subject folder contains a set of subfolders, one per task. The task subfolders contain a single CSV file which contains all of the attributes (scalars) for each trial, and it contains HDF5 files which store array data, one for each kind of array (e.g. an emg.hdf5 file containing raw electrophysiological recordings for each trial and a cursor.hdf5 file containing cursor positions for each trial).

As an concrete example, suppose you are running an experiment where subjects are tasked with contracting a muscle to a specified level, determined using electromyography (EMG). For each trial, you want to store the following:

• the time it took for the subject to reach the desired contraction level for each trial
• the number of times the contraction level went past the desired level (overshoots)
• the raw EMG signals, which are recorded at 2 kHz
• the current “level of contraction,” which is computed by processing the EMG signals through some processing pipeline you have set up at 10 Hz

The trial data variables here are time to target and overshoots, so these are placed in a CSV file with one row per trial:

trial time_to_target overshoots
0 3.271942 1
1 2.159271 0
2 3.212450 2

Since you have two different array-like entities to store (raw EMG data at 2 kHz and processed position at 10 Hz), you create two different array types: emg and level. They are placed in separate subfolders of the task and each one is stored as an array in a HDF5 file, with one HDF5 dataset (in the root group) per trial. The result of all of this is a structure that looks like:

data_root/
subject_id/
file: trials.csv
file: emg.hdf5
file: level.hdf5


The HDF5 format was chosen for all array data because it naturally works with NumPy arrays, which are the assumed container for data as it goes from a hardware device through processing code to computer interaction. It also saves the arrays in a binary format instead of converting to strings as something like numpy.savetxt would do, potentially reducing the size of a whole experiment’s dataset significantly if you store many arrays representing high-frequency electrophysiological recordings.

The goals of this storage layout are to be simple to implement and reason about, to allow for manual browsing of the dataset, and to enable simultaneous sessions (i.e. multiple researchers running the experiment with different subjects) with a very simple and intuitive data merging procedure (i.e. just designate a single root folder and move all subject data there). The layout is not optimized for processing and analyzing data once an experiment is complete, however. For that, see Data Consolidation.

## Experiment-Level Storage¶

The top-level Storage class handles the first two layers of the storage hierarchy: subjects and tasks. It is initialized at the beginning of each session and (e.g. once per subject for a single-session experiment) and the data storage hierarchy is built for each subject. Initializing and adding subjects is typically handled for you by axopy.experiment.Experiment in the context of running an experiment. Once a task is given access to the Storage object, however, it is up to the task implementation to set up TaskReader objects for reading data from other tasks and TaskWriter objects for storing its own data. This is done by calling Storage.require_task() and Storage.create_task(), respectively.

Task storage is designed to make implementing a task’s data reading and writing as simple as possible, while being flexible enough to accommodate different kinds of experimental design. If you are interested in processing data after an experiment has been run, see the Data Consolidation documentation.

The TaskReader is used for reading in data from another task. In the context of an experiment, you would access a reader with Storage.require_task(), passing in the name of the task (i.e. the name of the directory corresponding to the task). You can then access the trial data (attrs) with the trials attribute, which returns a pandas DataFrame object. You can also access array data either by reading it all at once (arrays for each trial are stacked) or by iterating over each trial’s array.

Keeping with our example above, suppose we want to run the EMG data from the contraction_level_task through a processing pipeline.

# storage can either be created for post-processing
# or it can be given to us if this is another task implementation
# emg holds the EMG data for a single trial
out = pipeline.process(emg)
...


It is also common to need the trial attributes while iterating over the trial arrays, and this can be achieved using zip and the DataFrame.iterrows method:

for (i, trial_attrs), emg in zip(reader.trials.iterrows(),
if trial_attrs['time_to_target'] > 5.0:
continue
out = pipeline.process(emg)
...


## Data Consolidation¶

Most of the data reading and writing functionality implemented in AxoPy is designed to make implementing an experiment as easy as possible, but there are some convenience functions for compiling an experiment’s dataset into something more amenable to post-processing and analysis.

### Archiving Raw Data¶

In most cases, you’ll want to archive your entire untouched dataset once an experiment is complete, or maybe even periodically as an experiment is performed. For this purpose, there is the storage_to_zip() function, which creates a ZIP archive of the data contained in the root storage directory. It’s usage is fairly simple, since it does a simple task. You pass it the path to your data storage root directory, which can be relative to the directory you run the function from. Let’s say you have some data in a folder called experiment01_data:

>>> from axopy.storage import storage_to_zip
>>> storage_to_zip('experiment01_data')


There should now be a file called experiment01_data.zip in the current directory, containing a copy of the whole dataset hierarchy. You can also specify an output file if you don’t like the default:

>>> from axopy.storage import storage_to_zip
>>> storage_to_zip('experiment01_data', outfile='dataset.zip')