Big Data Distributed Computing
dislib is a distributed computing library highly focused on machine learning on top of PyCOMPSs. Inspired by NumPy and scikit-learn, dislib provides various supervised and unsupervised learning algorithms through an easy-to-use API.
Software Author: 
Workflows and Distributed Computing Group

Javier Conejero (javier.conejero@bsc.es)

Rosa M. Badia (rosa.m.badia@bsc.es)


Primary tabs

0.6.0 (Latest Version)

New dislib version

Release Notes


  • PyCOMPSs >= 2.7
  • Scikit-learn >= 0.19.2
  • NumPy >= 1.15.4
  • Scipy >= 1.0.0
  • cvxpy>=1.1.5

Upgrade Steps

If using docker, just use the new image.

If you have a local installation, upgrade to COMPSs 2.7 (see COMPSs doc) before upgrading to dislib 0.6.0. Also, install the Python cvxpy module in order to use the regression algorithms: pip install cvxpy.

Breaking Changes

  • ds-array doesn't accept a chunk_size bigger than the array.
  • Moved data loading routines to a different file as array.py was getting too big.
  • apply_along_axis for sparse data now returns sparse ds-arrays.
  • Some PyCOMPSs log messages have changed.

New Features

  • User guide and glossary
  • Method to read from npy files
  • Support for one-dimensional data in ds-array
  • Parametrized ds-array tests
  • identity, full and zeros methods that generate ds-arrays filled with a value
  • ds-array operators: subtraction, division, conjugate, transpose, item setting, etc.
  • matmul, kronecker product and rechunk methods for of ds-arrays
  • Automatic deletion of ds-arrays when the GC is called.
  • Multivariate linear regression.
  • SVD (Singular Value Decomposition)
  • PCA using SVD
  • ADMM Lasso algorithm
  • Daura clustering algorithm

Bug Fixes

  • Some bugs in the ds-array
  • Internal inconsistencies in transformed_array of PCA


  • Improved performance testing scripts and added new tests
  • Allow executing applications with params using dislib exec
  • Extended and improved the tutorial notebook
  • Updated dislib-base docker image
  • Replaced COLLECTION_INOUT parameters with COLLECTION_OUT when possible for improving performance

Old Versions