easyX: a simple Python library for saving big data

Motivation

As a researcher, I am developing advanced machine learning approaches for analyzing cognitive states in human brains. During my Ph.D. study, I have developed the easy fMRI project as a toolbox for analyzing task-based fMRI datasets. This project has already been used in several advance academic studies, such as modeling consciousness (Michigan University), diagnosing children’s anxiety (University of Alberta), decoding visual stimuli (Oxford University), working memory, and decision making (University of Cambridge), etc.

When I had started the easy fMRI project, I selected MATLAB format for saving preprocessed datasets and the analysis results. On the one hand, this format cannot save data files with size 4GB+. On the other hand, fMRI datasets are massive — e.g., a dataset for movie stimuli can be more than 200GB. That was a significant disadvantage for the project. We had been developing advanced machine learning approaches — such as deep neural networks — that could provide an accurate model for analyzing big fMRI datasets. BUT, our toolbox could not save (or even load) a large dataset as a single file, even though you had enough hardware resources. We first came up with the idea of the “Easy Data” technique — i.e., we partition the complex data structure into a bunch of MATLAB files that could be controlled via a header file. There are two disadvantages to handle this solution. First, using a single dataset needs to load and concatenate all of those MATLAB files, which is not a time-efficient approach. Further, managing a bunch of files after running analysis was hard to trace — you need to be sure that the processed data will be correctly partitioned and stored, again!

We first tried to replace MATLAB files with some alternative libraries in Python — such as Pandas and HDF5. Pandas cannot save and load the complex structures the same as MATLAB format — Dataframe in Pandas is mostly designed for homogeneous matrices. Further, Pandas cannot efficiently handle large files.

HDF5 does not suffer Pandas issues. It has a time-efficient algorithm for saving and loading massive data files. However, it could only store homogeneous tensors. That was our motivation to create the “easyX project” based on the HDF5 data structure — that can handle massive data files with complex structures.

easyX library enables you to save a Python dictionary with a complex structure to a single HDF5 file. We have tested this library to save fMRI datasets with size 150GB+ — n.b., you need a computer with 155 GB memory. You can stack all of your data in the form of a Python dictionary with a complex structure — it could be nested dictionaries or nonhomogeneous tensors.

How easyX works?

easyX uses a simple procedure. This library saves homogeneous tensors by using the regular algorithm that is utilized for HDF5. It stores these tensors in an HDF5 group called “raw.” If the dictionary has other complex structures — such as another dictionary or nonhomogeneous tensors — easyX first dumps the bytes of data from memory and encodes them in a “base64” coding system. The encoded data will be stored as a vector in an HDF5 group called “binary.”

How to install?

You only need to copy the easyX.py to your Python project. You can use git command for downloading easyX:

git clone https://gitlab.com/myousefnezhad/easyx.git

Requirements

We have tested easyX on Python 3.7 and Python 3.8. You need to install the libraries from the requirements.txt:

pip install -r requirements.txt

This file indeed installs the following libraries numpy, pickle, codecs, and h5py.

How to use it?

You first copy easyX.py to the main folder of your project. Then, you will store all variables in the form of a dictionary in Python.

As an example, we have created a sample data as follows:

data = {"a": np.array([[1, 2, 5, 8], [2., 4, 1, 6]]),
		 "b": [[1], [2, 4]],
		 "c": [[1, 20], [7, 4]],
		 "d": "Hi There",
		 "e": ["A", "B"],
		 "f": [["a", "b"], ["c", "d"]],
		 "h": np.random.rand(100, 1000)
		}

Here, we have the Python dictionary data that includes different shapes of variables.

A) Saving a dictionary into a file

You can use following commands for saving a dictionary into a file:

# Import easyX Library
from easyX import easyX
# Create an object from easyX class
ezx = easyX()
# Change this one with the PATH you need to save your data
fname = "/tmp/a.ezx"  
# Here, `data` is the example dictionary, you may replace it with yours
ezx.save(data, fname=fname) 

B) Loading a data file into a dictionary

You can use following commands for loading a data file into a dictionary:

# Import easyX Library
from easyX import easyX
# Create an object from easyX class
ezx = easyX()
# Change this one with the PATH you need to save your data
fname = "/tmp/a.ezx"  
# Data will be recovered in the `data` dictionary
data = ezx.load(fname=fname) 

C) Loading the data structures (keys) from a data file into a dictionary

You can use following commands for loading the data structures (keys) from a data file into a dictionary:

# Import easyX Library
from easyX import easyX
# Create an object from easyX class
ezx = easyX()
# Change this one with the PATH you need to save your data
fname = "/tmp/a.ezx"
# Keys will be recovered in the `keys` dictionary
keys = ezx.load_keys(fname=fname)

How to uninstall easyX?

You only need to remove easyX.py from your project.

Do not forget feedback

I hope you will enjoy using easyX in your project. For support or feedback, you can also contact us: info@learningbymachine.com.

Reference

easyX website: https://gitlab.com/myousefnezhad/easyx

Tagged : /