API

Top level user functions:

DataFrame(dsk, name, meta, divisions) Implements out-of-core DataFrame as a sequence of pandas DataFrames
DataFrame.add(other[, axis, level, fill_value]) Addition of dataframe and other, element-wise (binary operator add).
DataFrame.append(other) Append rows of other to the end of this frame, returning a new object.
DataFrame.apply(func[, axis, args, meta, ...]) Parallel version of pandas.DataFrame.apply
DataFrame.assign(\*\*kwargs) Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.
DataFrame.astype(dtype) Cast object to input numpy.dtype
DataFrame.cache([cache]) Evaluate Dataframe and store in local cache
DataFrame.categorize([columns]) Convert columns of the DataFrame to category dtype
DataFrame.columns
DataFrame.compute(\*\*kwargs) Compute several dask collections at once.
DataFrame.corr([method, min_periods]) Compute pairwise correlation of columns, excluding NA/null values
DataFrame.count([axis, split_every]) Return Series with number of non-NA/null observations over requested axis.
DataFrame.cov([min_periods]) Compute pairwise covariance of columns, excluding NA/null values
DataFrame.cummax([axis, skipna]) Return cumulative max over requested axis.
DataFrame.cummin([axis, skipna]) Return cumulative minimum over requested axis.
DataFrame.cumprod([axis, skipna]) Return cumulative product over requested axis.
DataFrame.cumsum([axis, skipna]) Return cumulative sum over requested axis.
DataFrame.describe([split_every]) Generate various summary statistics, excluding NaN values.
DataFrame.div(other[, axis, level, fill_value]) Floating division of dataframe and other, element-wise (binary operator truediv).
DataFrame.drop(labels[, axis, dtype]) Return new object with labels in requested axis removed.
DataFrame.drop_duplicates(\*\*kwargs) Return DataFrame with duplicate rows removed, optionally only
DataFrame.dropna([how, subset]) Return object with labels on given axis omitted where alternately any
DataFrame.dtypes Return data types
DataFrame.fillna(value) Fill NA/NaN values using the specified method
DataFrame.floordiv(other[, axis, level, ...]) Integer division of dataframe and other, element-wise (binary operator floordiv).
DataFrame.get_partition(n) Get a dask DataFrame/Series representing the nth partition.
DataFrame.groupby(key, \*\*kwargs) Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
DataFrame.head([n, npartitions, compute]) First n rows of the dataset
DataFrame.index Return dask Index instance
DataFrame.iterrows() Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.itertuples() Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple.
DataFrame.join(other[, on, how, lsuffix, ...]) Join columns with other DataFrame either on index or on a key column.
DataFrame.known_divisions Whether divisions are already known
DataFrame.loc Purely label-location based indexer for selection by label.
DataFrame.map_partitions(func, \*args, \*\*kwargs) Apply Python function on each DataFrame partition.
DataFrame.mask(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.
DataFrame.max([axis, skipna, split_every]) This method returns the maximum of the values in the object.
DataFrame.mean([axis, skipna, split_every]) Return the mean of the values for the requested axis
DataFrame.merge(right[, how, on, left_on, ...]) Merge DataFrame objects by performing a database-style join operation by columns or indexes.
DataFrame.min([axis, skipna, split_every]) This method returns the minimum of the values in the object.
DataFrame.mod(other[, axis, level, fill_value]) Modulo of dataframe and other, element-wise (binary operator mod).
DataFrame.mul(other[, axis, level, fill_value]) Multiplication of dataframe and other, element-wise (binary operator mul).
DataFrame.ndim Return dimensionality
DataFrame.nlargest([n, columns, split_every]) Get the rows of a DataFrame sorted by the n largest values of columns.
DataFrame.npartitions Return number of partitions
DataFrame.pow(other[, axis, level, fill_value]) Exponential power of dataframe and other, element-wise (binary operator pow).
DataFrame.quantile([q, axis]) Approximate row-wise and precise column-wise quantiles of DataFrame
DataFrame.query(expr, \*\*kwargs)
DataFrame.radd(other[, axis, level, fill_value]) Addition of dataframe and other, element-wise (binary operator radd).
DataFrame.random_split(frac[, random_state]) Pseudorandomly split dataframe into different pieces row-wise
DataFrame.rdiv(other[, axis, level, fill_value]) Floating division of dataframe and other, element-wise (binary operator rtruediv).
DataFrame.rename([index, columns]) Alter axes input function or functions.
DataFrame.repartition([divisions, ...]) Repartition dataframe along new divisions
DataFrame.reset_index([drop]) For DataFrame with multi-level index, return new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc.
DataFrame.rfloordiv(other[, axis, level, ...]) Integer division of dataframe and other, element-wise (binary operator rfloordiv).
DataFrame.rmod(other[, axis, level, fill_value]) Modulo of dataframe and other, element-wise (binary operator rmod).
DataFrame.rmul(other[, axis, level, fill_value]) Multiplication of dataframe and other, element-wise (binary operator rmul).
DataFrame.rpow(other[, axis, level, fill_value]) Exponential power of dataframe and other, element-wise (binary operator rpow).
DataFrame.rsub(other[, axis, level, fill_value]) Subtraction of dataframe and other, element-wise (binary operator rsub).
DataFrame.rtruediv(other[, axis, level, ...]) Floating division of dataframe and other, element-wise (binary operator rtruediv).
DataFrame.sample(frac[, replace, random_state]) Random sample of items
DataFrame.set_index(other[, drop, sorted]) Set the DataFrame index (row labels) using an existing column
DataFrame.set_partition(column, divisions, ...) Set explicit divisions for new column index
DataFrame.std([axis, skipna, ddof, split_every]) Return sample standard deviation over requested axis.
DataFrame.sub(other[, axis, level, fill_value]) Subtraction of dataframe and other, element-wise (binary operator sub).
DataFrame.sum([axis, skipna, split_every]) Return the sum of the values for the requested axis
DataFrame.tail([n, compute]) Last n rows of the dataset
DataFrame.to_bag([index]) Convert to a dask Bag of tuples of each row.
DataFrame.to_csv(filename, \*\*kwargs) Write DataFrame to a series of comma-separated values (csv) files
DataFrame.to_hdf(path_or_buf, key[, mode, ...]) Export frame to hdf file(s)
DataFrame.to_delayed() Convert dataframe into dask Delayed objects
DataFrame.truediv(other[, axis, level, ...]) Floating division of dataframe and other, element-wise (binary operator truediv).
DataFrame.var([axis, skipna, ddof, split_every]) Return unbiased variance over requested axis.
DataFrame.visualize([filename, format, ...]) Render the computation of this object’s task graph using graphviz.
DataFrame.where(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Rolling Operations

rolling.rolling_apply(arg, window, func[, ...]) Generic moving function application.
rolling.rolling_count(arg, window, \*\*kwargs) Rolling count of number of non-NaN observations inside provided window.
rolling.rolling_kurt(arg, window[, ...]) Unbiased moving kurtosis.
rolling.rolling_max(arg, window[, ...]) Moving maximum.
rolling.rolling_mean(arg, window[, ...]) Moving mean.
rolling.rolling_median(arg, window[, ...]) Moving median.
rolling.rolling_min(arg, window[, ...]) Moving minimum.
rolling.rolling_quantile(arg, window, quantile) Moving quantile.
rolling.rolling_skew(arg, window[, ...]) Unbiased moving skewness.
rolling.rolling_std(arg, window[, ...]) Moving standard deviation.
rolling.rolling_sum(arg, window[, ...]) Moving sum.
rolling.rolling_var(arg, window[, ...]) Moving variance.
rolling.rolling_window(arg[, window, ...]) Applies a moving window of type window_type and size window on the data.

Create DataFrames

read_csv(urlpath[, blocksize, collection, ...]) Read CSV files into a Dask.DataFrame
read_table(urlpath[, blocksize, collection, ...]) Read delimited files into a Dask.DataFrame
read_hdf(path_or_buf[, key]) read from the store, close it if we opened it
from_array(x[, chunksize, columns]) Read dask Dataframe from any slicable array
from_bcolz(x[, chunksize, categorize, ...]) Read dask Dataframe from bcolz.ctable
from_dask_array(x[, columns]) Convert dask Array to dask DataFrame
from_delayed(dfs[, meta, divisions, prefix, ...]) Create DataFrame from many dask.delayed objects
from_pandas(data[, npartitions, chunksize, ...]) Construct a dask object from a pandas object.

DataFrame Methods

class dask.dataframe.DataFrame(dsk, name, meta, divisions)

Implements out-of-core DataFrame as a sequence of pandas DataFrames

Parameters:

dask: dict

The dask graph to compute this DataFrame

name: str

The key prefix that specifies which keys in the dask comprise this particular DataFrame

meta: pandas.DataFrame

An empty pandas.DataFrame with names, dtypes, and index matching the expected output.

divisions: tuple of index values

Values along which we partition our blocks on the index

add(other, axis='columns', level=None, fill_value=None)

Addition of dataframe and other, element-wise (binary operator add).

Equivalent to dataframe + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.radd

Notes

Mismatched indices will be unioned together

align(other, join='outer', axis=None, fill_value=None)

Align two object on their axes with the specified join method for each axis Index

Parameters:

other : DataFrame or Series

join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’

axis : allowed axis of the other object, default None

Align on index (0), columns (1), or both (None)

level : int or level name, default None

Broadcast across a level, matching Index values on the passed MultiIndex level

copy : boolean, default True

Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

fill_value : scalar, default np.NaN

Value to use for missing values. Defaults to NaN, but can be any “compatible” value

method : str, default None

limit : int, default None

fill_axis : {0 or ‘index’, 1 or ‘columns’}, default 0

Filling axis, method and limit

broadcast_axis : {0 or ‘index’, 1 or ‘columns’}, default None

Broadcast values along this axis, if aligning two objects of different dimensions

New in version 0.17.0.

Returns:

(left, right) : (DataFrame, type of other)

Aligned objects

Notes

Dask doesn’t supports following argument(s).

  • level
  • copy
  • method
  • limit
  • fill_axis
  • broadcast_axis
all(axis=None, skipna=True, split_every=False)

Return whether all elements are True over requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

bool_only : boolean, default None

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

Returns:

all : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • bool_only
  • level
any(axis=None, skipna=True, split_every=False)

Return whether any element is True over requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

bool_only : boolean, default None

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

Returns:

any : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • bool_only
  • level
append(other)

Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.

Parameters:

other : DataFrame or Series/dict-like object, or list of these

The data to append.

ignore_index : boolean, default False

If True, do not use the index labels.

verify_integrity : boolean, default False

If True, raise ValueError on creating index with duplicates.

Returns:

appended : DataFrame

See also

pandas.concat
General function to concatenate DataFrame, Series or Panel objects

Notes

Dask doesn’t supports following argument(s).

  • ignore_index
  • verify_integrity

Examples

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))    
>>> df    
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))    
>>> df.append(df2)    
   A  B
0  1  2
1  3  4
0  5  6
1  7  8

With ignore_index set to True:

>>> df.append(df2, ignore_index=True)    
   A  B
0  1  2
1  3  4
2  5  6
3  7  8
apply(func, axis=0, args=(), meta='__no_default__', columns='__no_default__', **kwds)

Parallel version of pandas.DataFrame.apply

This mimics the pandas version except for the following:

  1. Only axis=1 is supported (and must be specified explicitly).
  2. The user should provide output metadata via the meta keyword.
Parameters:

func : function

Function to apply to each column/row

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

  • 0 or ‘index’: apply function to each column (NOT SUPPORTED)
  • 1 or ‘columns’: apply function to each row

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

columns : list, scalar or None

Deprecated, please use meta instead. If list is given, the result is a DataFrame which columns is specified list. Otherwise, the result is a Series which name is given scalar or None (no name). If name keyword is not given, dask tries to infer the result type using its beginning of data. This inference may take some time and lead to unexpected result

args : tuple

Positional arguments to pass to function in addition to the array/series

Additional keyword arguments will be passed as keywords to the function

Returns:

applied : Series or DataFrame

See also

dask.DataFrame.map_partitions

Examples

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

Apply a function to row-wise passing in extra arguments in args and kwargs:

>>> def myadd(row, a, b=1):
...     return row.sum() + a + b
>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5)

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with name 'x', and dtype float64:

>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5, meta=('x', 'f8'))

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.apply(lambda row: row + 1, axis=1, meta=ddf)
applymap(func, meta='__no_default__')

Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame

Parameters:

func : function

Python function, returns a single value from a single value

Returns:

applied : DataFrame

See also

DataFrame.apply
For operations on rows/columns

Examples

>>> df = pd.DataFrame(np.random.randn(3, 3))    
>>> df    
    0         1          2
0  -0.029638  1.081563   1.280300
1   0.647747  0.831136  -1.549481
2   0.513416 -0.884417   0.195343
>>> df = df.applymap(lambda x: '%.2f' % x)    
>>> df    
    0         1          2
0  -0.03      1.08       1.28
1   0.65      0.83      -1.55
2   0.51     -0.88       0.20
assign(**kwargs)

Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.

New in version 0.16.0.

Parameters:

kwargs : keyword, value pairs

keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.

Returns:

df : DataFrame

A new DataFrame with the new columns in addition to all the existing columns.

Notes

Since kwargs is a dictionary, the order of your arguments may not be preserved. The make things predicatable, the columns are inserted in alphabetical order, at the end of your DataFrame. Assigning multiple columns within the same assign is possible, but you cannot reference other columns created within the same assign call.

Examples

>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})    

Where the value is a callable, evaluated on df:

>>> df.assign(ln_A = lambda x: np.log(x.A))    
    A         B      ln_A
0   1  0.426905  0.000000
1   2 -0.780949  0.693147
2   3 -0.418711  1.098612
3   4 -0.269708  1.386294
4   5 -0.274002  1.609438
5   6 -0.500792  1.791759
6   7  1.649697  1.945910
7   8 -1.495604  2.079442
8   9  0.549296  2.197225
9  10 -0.758542  2.302585

Where the value already exists and is inserted:

>>> newcol = np.log(df['A'])    
>>> df.assign(ln_A=newcol)    
    A         B      ln_A
0   1  0.426905  0.000000
1   2 -0.780949  0.693147
2   3 -0.418711  1.098612
3   4 -0.269708  1.386294
4   5 -0.274002  1.609438
5   6 -0.500792  1.791759
6   7  1.649697  1.945910
7   8 -1.495604  2.079442
8   9  0.549296  2.197225
9  10 -0.758542  2.302585
astype(dtype)

Cast object to input numpy.dtype Return a copy when copy = True (be really careful with this!)

Parameters:

dtype : data type, or dict of column name -> data type

Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, ...}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

raise_on_error : raise on invalid input

kwargs : keyword arguments to pass on to the constructor

Returns:

casted : type of caller

Notes

Dask doesn’t supports following argument(s).

  • copy
  • raise_on_error
cache(cache=<class 'dict'>)

Evaluate Dataframe and store in local cache

Uses chest by default to store data on disk

categorize(columns=None, **kwargs)

Convert columns of the DataFrame to category dtype

Parameters:

columns : list, optional

A list of column names to convert to the category type. By default any column with an object dtype is converted to a categorical.

kwargs

Keyword arguments are passed on to compute.

See also

dask.dataframes.categorical.categorize

Notes

When dealing with columns of repeated text values converting to categorical type is often much more performant, both in terms of memory and in writing to disk or communication over the network.

clip(lower=None, upper=None, out=None)

Trim values at input threshold(s).

Parameters:

lower : float or array_like, default None

upper : float or array_like, default None

axis : int or string axis name, optional

Align object with lower and upper along the given axis.

Returns:

clipped : Series

Notes

Dask doesn’t supports following argument(s).

  • axis

Examples

>>> df    
  0         1
0  0.335232 -1.256177
1 -1.367855  0.746646
2  0.027753 -1.176076
3  0.230930 -0.679613
4  1.261967  0.570967
>>> df.clip(-1.0, 0.5)    
          0         1
0  0.335232 -1.000000
1 -1.000000  0.500000
2  0.027753 -1.000000
3  0.230930 -0.679613
4  0.500000  0.500000
>>> t    
0   -0.3
1   -0.2
2   -0.1
3    0.0
4    0.1
dtype: float64
>>> df.clip(t, t + 1, axis=0)    
          0         1
0  0.335232 -0.300000
1 -0.200000  0.746646
2  0.027753 -0.100000
3  0.230930  0.000000
4  1.100000  0.570967
clip_lower(threshold)

Return copy of the input with values below given value(s) truncated.

Parameters:

threshold : float or array_like

axis : int or string axis name, optional

Align object with threshold along the given axis.

Returns:

clipped : same type as input

See also

clip

Notes

Dask doesn’t supports following argument(s).

  • axis
clip_upper(threshold)

Return copy of input with values above given value(s) truncated.

Parameters:

threshold : float or array_like

axis : int or string axis name, optional

Align object with threshold along the given axis.

Returns:

clipped : same type as input

See also

clip

Notes

Dask doesn’t supports following argument(s).

  • axis
combine_first(other)

Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns

Parameters:other : DataFrame
Returns:combined : DataFrame

Examples

a’s values prioritized, use values from b to fill holes:

>>> a.combine_first(b)    
compute(**kwargs)

Compute several dask collections at once.

Parameters:

get : callable, optional

A scheduler get function to use. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

optimize_graph : bool, optional

If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

kwargs

Extra keywords to forward to the scheduler get function.

corr(method='pearson', min_periods=None)

Compute pairwise correlation of columns, excluding NA/null values

Parameters:

method : {‘pearson’, ‘kendall’, ‘spearman’}

  • pearson : standard correlation coefficient
  • kendall : Kendall Tau correlation coefficient
  • spearman : Spearman rank correlation

min_periods : int, optional

Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation

Returns:

y : DataFrame

count(axis=None, split_every=False)

Return Series with number of non-NA/null observations over requested axis. Works with non-floating point data as well (detects NaN and None)

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame

numeric_only : boolean, default False

Include only float, int, boolean data

Returns:

count : Series (or DataFrame if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
cov(min_periods=None)

Compute pairwise covariance of columns, excluding NA/null values

Parameters:

min_periods : int, optional

Minimum number of observations required per pair of columns to have a valid result.

Returns:

y : DataFrame

Notes

y contains the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1 (unbiased estimator).

cummax(axis=None, skipna=True)

Return cumulative max over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cummax : Series

cummin(axis=None, skipna=True)

Return cumulative minimum over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cummin : Series

cumprod(axis=None, skipna=True)

Return cumulative product over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cumprod : Series

cumsum(axis=None, skipna=True)

Return cumulative sum over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cumsum : Series

describe(split_every=False)

Generate various summary statistics, excluding NaN values.

Parameters:

percentiles : array-like, optional

The percentiles to include in the output. Should all be in the interval [0, 1]. By default percentiles is [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.

include, exclude : list-like, ‘all’, or None (default)

Specify the form of the returned result. Either:

  • None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.
  • A list of dtypes or strings to be included/excluded. To select all numeric types use numpy numpy.number. To select categorical objects use type object. See also the select_dtypes documentation. eg. df.describe(include=[‘O’])
  • If include is the string ‘all’, the output column-set will match the input one.
Returns:

summary: NDFrame of summary statistics

Notes

Dask doesn’t supports following argument(s).

  • percentiles
  • include
  • exclude
div(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

drop(labels, axis=0, dtype=None)

Return new object with labels in requested axis removed.

Parameters:

labels : single label or list-like

axis : int or axis name

level : int or level name, default None

For MultiIndex

inplace : bool, default False

If True, do operation inplace and return None.

errors : {‘ignore’, ‘raise’}, default ‘raise’

If ‘ignore’, suppress error and existing labels are dropped.

New in version 0.16.1.

Returns:

dropped : type of caller

Notes

Dask doesn’t supports following argument(s).

  • level
  • inplace
  • errors
drop_duplicates(**kwargs)

Return DataFrame with duplicate rows removed, optionally only considering certain columns

Parameters:

subset : column label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.

take_last : deprecated

inplace : boolean, default False

Whether to drop duplicates in place or to return a copy

Returns:

deduplicated : DataFrame

dropna(how='any', subset=None)

Return object with labels on given axis omitted where alternately any or all of the data are missing

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, or tuple/list thereof

Pass tuple or list to drop on multiple axes

how : {‘any’, ‘all’}

  • any : if any NA values are present, drop that label
  • all : if all values are NA, drop that label

thresh : int, default None

int value : require that many non-NA values

subset : array-like

Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include

inplace : boolean, default False

If True, do operation inplace and return None.

Returns:

dropped : DataFrame

Notes

Dask doesn’t supports following argument(s).

  • axis
  • thresh
  • inplace
dtypes

Return data types

eq(other, axis='columns', level=None)

Wrapper for flexible comparison methods eq

eval(expr, inplace=None, **kwargs)

Evaluate an expression in the context of the calling DataFrame instance.

Parameters:

expr : string

The expression string to evaluate.

inplace : bool

If the expression contains an assignment, whether to return a new DataFrame or mutate the existing.

WARNING: inplace=None currently falls back to to True, but in a future version, will default to False. Use inplace=True explicitly rather than relying on the default.

New in version 0.18.0.

kwargs : dict

See the documentation for eval() for complete details on the keyword arguments accepted by query().

Returns:

ret : ndarray, scalar, or pandas object

See also

pandas.DataFrame.query, pandas.DataFrame.assign, pandas.eval

Notes

For more details see the API documentation for eval(). For detailed examples see enhancing performance with eval.

Examples

>>> from numpy.random import randn    
>>> from pandas import DataFrame    
>>> df = DataFrame(randn(10, 2), columns=list('ab'))    
>>> df.eval('a + b')    
>>> df.eval('c = a + b')    
fillna(value)

Fill NA/NaN values using the specified method

Parameters:

value : scalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

axis : {0, ‘index’}

inplace : boolean, default False

If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).

limit : int, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.

downcast : dict, default is None

a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)

Returns:

filled : Series

See also

reindex, asfreq

Notes

Dask doesn’t supports following argument(s).

  • method
  • axis
  • inplace
  • limit
  • downcast
floordiv(other, axis='columns', level=None, fill_value=None)

Integer division of dataframe and other, element-wise (binary operator floordiv).

Equivalent to dataframe // other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

ge(other, axis='columns', level=None)

Wrapper for flexible comparison methods ge

get_dtype_counts()

Return the counts of dtypes in this object.

get_ftype_counts()

Return the counts of ftypes in this object.

get_partition(n)

Get a dask DataFrame/Series representing the nth partition.

groupby(key, **kwargs)

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

Parameters:

by : mapping function / list of functions, dict, Series, or tuple /

list of column names. Called on each element of the object index to determine the groups. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups

axis : int, default 0

level : int, level name, or sequence of such, default None

If the axis is a MultiIndex (hierarchical), group by a particular level or levels

as_index : boolean, default True

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort : boolean, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.

group_keys : boolean, default True

When calling apply, add group keys to index to identify pieces

squeeze : boolean, default False

reduce the dimensionality of the return type if possible, otherwise return a consistent type

Returns:

GroupBy object

Notes

Dask doesn’t supports following argument(s).

  • by
  • axis
  • level
  • as_index
  • sort
  • group_keys
  • squeeze

Examples

DataFrame results

>>> data.groupby(func, axis=0).mean()    
>>> data.groupby(['col1', 'col2'])['col3'].mean()    

DataFrame with hierarchical index

>>> data.groupby(['col1', 'col2']).mean()    
gt(other, axis='columns', level=None)

Wrapper for flexible comparison methods gt

head(n=5, npartitions=1, compute=True)

First n rows of the dataset

Parameters:

n : int, optional

The number of rows to return. Default is 5.

npartitions : int, optional

Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.

compute : bool, optional

Whether to compute the result, default is True.

idxmax(axis=None, skipna=True, split_every=False)

Return index of first occurrence of maximum over requested axis. NA/null values are excluded.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be first index.

Returns:

idxmax : Series

See also

Series.idxmax

Notes

This method is the DataFrame version of ndarray.argmax.

idxmin(axis=None, skipna=True, split_every=False)

Return index of first occurrence of minimum over requested axis. NA/null values are excluded.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

idxmin : Series

See also

Series.idxmin

Notes

This method is the DataFrame version of ndarray.argmin.

index

Return dask Index instance

info(buf=None, verbose=False, memory_usage=False)

Concise summary of a Dask DataFrame.

isnull()

Return a boolean same-sized object indicating if the values are null.

See also

notnull
boolean inverse of isnull
iterrows()

Iterate over DataFrame rows as (index, Series) pairs.

Returns:

it : generator

A generator that iterates over the rows of the frame.

See also

itertuples
Iterate over DataFrame rows as namedtuples of the values.
iteritems
Iterate over (column name, Series) pairs.

Notes

  1. Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,

    >>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])    
    >>> row = next(df.iterrows())[1]    
    >>> row    
    int      1.0
    float    1.5
    Name: 0, dtype: float64
    >>> print(row['int'].dtype)    
    float64
    >>> print(df['int'].dtype)    
    int64
    

    To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows.

  2. You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

itertuples()

Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple.

Parameters:

index : boolean, default True

If True, return the index as the first element of the tuple.

name : string, default “Pandas”

The name of the returned namedtuples or None to return regular tuples.

See also

iterrows
Iterate over DataFrame rows as (index, Series) pairs.
iteritems
Iterate over (column name, Series) pairs.

Notes

Dask doesn’t supports following argument(s).

  • index
  • name

Examples

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},    
                      index=['a', 'b'])
>>> df    
   col1  col2
a     1   0.1
b     2   0.2
>>> for row in df.itertuples():    
...     print(row)
...
Pandas(Index='a', col1=1, col2=0.10000000000000001)
Pandas(Index='b', col1=2, col2=0.20000000000000001)
join(other, on=None, how='left', lsuffix='', rsuffix='', npartitions=None, shuffle=None)

Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.

Parameters:

other : DataFrame, Series with name field set, or list of DataFrame

Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame

on : column name, tuple/list of column names, or array-like

Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’

How to handle the operation of the two objects.

  • left: use calling frame’s index (or column if on is specified)
  • right: use other frame’s index
  • outer: form union of calling frame’s index (or column if on is
    specified) with other frame’s index
  • inner: form intersection of calling frame’s index (or column if
    on is specified) with other frame’s index

lsuffix : string

Suffix to use from left frame’s overlapping columns

rsuffix : string

Suffix to use from right frame’s overlapping columns

sort : boolean, default False

Order result DataFrame lexicographically by the join key. If False, preserves the index order of the calling (left) DataFrame

Returns:

joined : DataFrame

See also

DataFrame.merge
For column(s)-on-columns(s) operations

Notes

Dask doesn’t supports following argument(s).

  • sort

Examples

>>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],    
...                        'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> caller    
    A key
0  A0  K0
1  A1  K1
2  A2  K2
3  A3  K3
4  A4  K4
5  A5  K5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],    
...                       'B': ['B0', 'B1', 'B2']})
>>> other    
    B key
0  B0  K0
1  B1  K1
2  B2  K2

Join DataFrames using their indexes.

>>> caller.join(other, lsuffix='_caller', rsuffix='_other')    
>>>     A key_caller    B key_other    
    0  A0         K0   B0        K0
    1  A1         K1   B1        K1
    2  A2         K2   B2        K2
    3  A3         K3  NaN       NaN
    4  A4         K4  NaN       NaN
    5  A5         K5  NaN       NaN

If we want to join using the key columns, we need to set key to be the index in both caller and other. The joined DataFrame will have key as its index.

>>> caller.set_index('key').join(other.set_index('key'))    
>>>      A    B    
    key
    K0   A0   B0
    K1   A1   B1
    K2   A2   B2
    K3   A3  NaN
    K4   A4  NaN
    K5   A5  NaN

Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in the caller. This method preserves the original caller’s index in the result.

>>> caller.join(other.set_index('key'), on='key')    
>>>     A key    B    
    0  A0  K0   B0
    1  A1  K1   B1
    2  A2  K2   B2
    3  A3  K3  NaN
    4  A4  K4  NaN
    5  A5  K5  NaN
known_divisions

Whether divisions are already known

le(other, axis='columns', level=None)

Wrapper for flexible comparison methods le

loc

Purely label-location based indexer for selection by label.

>>> df.loc["b"]  
>>> df.loc["b":"d"]  
lt(other, axis='columns', level=None)

Wrapper for flexible comparison methods lt

map_partitions(func, *args, **kwargs)

Apply Python function on each DataFrame partition.

Parameters:

func : function

Function applied to each partition.

args, kwargs :

Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

One can use map_partitions to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.

Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:

>>> def myadd(df, a, b=1):
...     return df.x + df.y + a + b
>>> res = ddf.map_partitions(myadd, 1, b=2)
>>> res.dtype
dtype('float64')

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with no name, and dtype float64:

>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))

Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
>>> res.dtypes
x      int64
y    float64
z    float64
dtype: object

As before, the output metadata can also be specified manually. This time we pass in a dict, as the output is a DataFrame:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y),
...                          meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.map_partitions(lambda df: df.head(), meta=df)
mask(cond, other=nan)

Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.

Parameters:

cond : boolean NDFrame, array or callable

If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as cond.

other : scalar, NDFrame, or callable

If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as other.

inplace : boolean, default False

Whether to perform the operation in place on the data

axis : alignment axis if needed, default None

level : alignment level if needed, default None

try_cast : boolean, default False

try to cast the result back to the input type (if possible),

raise_on_error : boolean, default True

Whether to raise on invalid data types (e.g. trying to where on strings)

Returns:

wh : same type as caller

Notes

Dask doesn’t supports following argument(s).

  • inplace
  • axis
  • level
  • try_cast
  • raise_on_error

Examples

>>> s = pd.Series(range(5))    
>>> s.where(s > 0)    
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])    
>>> m = df % 3 == 0    
>>> df.where(m, -df)    
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)    
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)    
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
max(axis=None, skipna=True, split_every=False)
This method returns the maximum of the values in the object.
If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.
Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

max : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
mean(axis=None, skipna=True, split_every=False)

Return the mean of the values for the requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

mean : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), indicator=False, npartitions=None, shuffle=None)

Merge DataFrame objects by performing a database-style join operation by columns or indexes.

If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

Parameters:

right : DataFrame

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’

  • left: use only keys from left frame (SQL: left outer join)
  • right: use only keys from right frame (SQL: right outer join)
  • outer: use union of keys from both frames (SQL: full outer join)
  • inner: use intersection of keys from both frames (SQL: inner join)

on : label or list

Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.

left_on : label or list, or array-like

Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like

Field names to join on in right DataFrame or vector/list of vectors per left_on docs

left_index : boolean, default False

Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels

right_index : boolean, default False

Use the index from the right DataFrame as the join key. Same caveats as left_index

sort : boolean, default False

Sort the join keys lexicographically in the result DataFrame

suffixes : 2-length sequence (tuple, list, ...)

Suffix to apply to overlapping column names in the left and right side, respectively

copy : boolean, default True

If False, do not copy data unnecessarily

indicator : boolean or string, default False

If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.

New in version 0.17.0.

Returns:

merged : DataFrame

The output type will the be same as ‘left’, if it is a subclass of DataFrame.

See also

merge_ordered, merge_asof

Notes

Dask doesn’t supports following argument(s).

  • sort
  • copy

Examples

>>> A              >>> B    
    lkey value         rkey value
0   foo  1         0   foo  5
1   bar  2         1   bar  6
2   baz  3         2   qux  7
3   foo  4         3   bar  8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')    
   lkey  value_x  rkey  value_y
0  foo   1        foo   5
1  foo   4        foo   5
2  bar   2        bar   6
3  bar   2        bar   8
4  baz   3        NaN   NaN
5  NaN   NaN      qux   7
min(axis=None, skipna=True, split_every=False)
This method returns the minimum of the values in the object.
If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.
Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

min : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
mod(other, axis='columns', level=None, fill_value=None)

Modulo of dataframe and other, element-wise (binary operator mod).

Equivalent to dataframe % other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.rmod

Notes

Mismatched indices will be unioned together

mul(other, axis='columns', level=None, fill_value=None)

Multiplication of dataframe and other, element-wise (binary operator mul).

Equivalent to dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.rmul

Notes

Mismatched indices will be unioned together

ndim

Return dimensionality

ne(other, axis='columns', level=None)

Wrapper for flexible comparison methods ne

nlargest(n=5, columns=None, split_every=None)

Get the rows of a DataFrame sorted by the n largest values of columns.

New in version 0.17.0.

Parameters:

n : int

Number of items to retrieve

columns : list or str

Column name or names to order by

keep : {‘first’, ‘last’, False}, default ‘first’

Where there are duplicate values: - first : take the first occurrence. - last : take the last occurrence.

Returns:

DataFrame

Notes

Dask doesn’t supports following argument(s).

  • keep

Examples

>>> df = DataFrame({'a': [1, 10, 8, 11, -1],    
...                 'b': list('abdce'),
...                 'c': [1.0, 2.0, np.nan, 3.0, 4.0]})
>>> df.nlargest(3, 'a')    
    a  b   c
3  11  c   3
1  10  b   2
2   8  d NaN
notnull()

Return a boolean same-sized object indicating if the values are not null.

See also

isnull
boolean inverse of notnull
npartitions

Return number of partitions

pipe(func, *args, **kwargs)

Apply func(self, *args, **kwargs)

New in version 0.16.2.

Parameters:

func : function

function to apply to the NDFrame. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the NDFrame.

args : positional arguments passed into func.

kwargs : a dictionary of keyword arguments passed into func.

Returns:

object : the return type of func.

See also

pandas.DataFrame.apply, pandas.DataFrame.applymap, pandas.Series.map

Notes

Use .pipe when chaining together functions that expect on Series or DataFrames. Instead of writing

>>> f(g(h(df), arg1=a), arg2=b, arg3=c)    

You can write

>>> (df.pipe(h)    
...    .pipe(g, arg1=a)
...    .pipe(f, arg2=b, arg3=c)
... )

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose f takes its data as arg2:

>>> (df.pipe(h)    
...    .pipe(g, arg1=a)
...    .pipe((f, 'arg2'), arg1=a, arg3=c)
...  )
pivot_table(index=None, columns=None, values=None, aggfunc='mean')

Create a spreadsheet-style pivot table as a DataFrame. Target columns must have category dtype to infer result’s columns. index, columns, values and aggfunc must be all scalar.

Parameters:

values : scalar

column to aggregate

index : scalar

column to be index

columns : scalar

column to be columns

aggfunc : {‘mean’, ‘sum’, ‘count’}, default ‘mean’

Returns:

table : DataFrame

pow(other, axis='columns', level=None, fill_value=None)

Exponential power of dataframe and other, element-wise (binary operator pow).

Equivalent to dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.rpow

Notes

Mismatched indices will be unioned together

quantile(q=0.5, axis=0)

Approximate row-wise and precise column-wise quantiles of DataFrame

Parameters:

q : list/array of floats, default 0.5 (50%)

Iterable of numbers ranging from 0 to 1 for the desired quantiles

axis : {0, 1, ‘index’, ‘columns’} (default 0)

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

radd(other, axis='columns', level=None, fill_value=None)

Addition of dataframe and other, element-wise (binary operator radd).

Equivalent to other + dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.add

Notes

Mismatched indices will be unioned together

random_split(frac, random_state=None)

Pseudorandomly split dataframe into different pieces row-wise

Parameters:

frac : list

List of floats that should sum to one.

random_state: int or np.random.RandomState

If int create a new RandomState with this as the seed

Otherwise draw from the passed RandomState

See also

dask.DataFrame.sample

Examples

50/50 split

>>> a, b = df.random_split([0.5, 0.5])  

80/10/10 split, consistent random_state

>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)  
rdiv(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator rtruediv).

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

reduction(chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)

Generic row-wise reductions.

Parameters:

chunk : callable

Function to operate on each partition. Should return a pandas.DataFrame, pandas.Series, or a scalar.

aggregate : callable, optional

Function to operate on the concatenated result of chunk. If not specified, defaults to chunk. Used to do the final aggregation in a tree reduction.

The input to aggregate depends on the output of chunk. If the output of chunk is a: - scalar: Input is a Series, with one row per partition. - Series: Input is a DataFrame, with one row per partition. Columns

are the rows in the output series.

  • DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.

Should return a pandas.DataFrame, pandas.Series, or a scalar.

combine : callable, optional

Function to operate on intermediate concatenated results of chunk in a tree-reduction. If not provided, defaults to aggregate. The input/output requirements should match that of aggregate described above.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

token : str, optional

The name to use for the output keys.

split_every : int, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to aggregate. Default is 8.

chunk_kwargs : dict, optional

Keyword arguments to pass on to chunk only.

aggregate_kwargs : dict, optional

Keyword arguments to pass on to aggregate only.

combine_kwargs : dict, optional

Keyword arguments to pass on to combine only.

kwargs :

All remaining keywords will be passed to chunk, combine, and aggregate.

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)})
>>> ddf = dd.from_pandas(df, npartitions=4)

Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:

>>> res = ddf.reduction(lambda x: x.count(),
...                     aggregate=lambda x: x.sum())
>>> res.compute()
x    50
y    50
dtype: int64

Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).

>>> def count_greater(x, value=0):
...     return (x >= value).sum()
>>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(),
...                       chunk_kwargs={'value': 25})
>>> res.compute()
25

Aggregate both the sum and count of a Series at the same time:

>>> def sum_and_count(x):
...     return pd.Series({'sum': x.sum(), 'count': x.count()})
>>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum())
>>> res.compute()
count      50
sum      1225
dtype: int64

Doing the same, but for a DataFrame. Here chunk returns a DataFrame, meaning the input to aggregate is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.

>>> def sum_and_count(x):
...     return pd.DataFrame({'sum': x.sum(), 'count': x.count()})
>>> res = ddf.reduction(sum_and_count,
...                     aggregate=lambda x: x.groupby(level=0).sum())
>>> res.compute()
   count   sum
x     50  1225
y     50  3725
rename(index=None, columns=None)

Alter axes input function or functions. Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error. Alternatively, change Series.name with a scalar value (Series only).

Parameters:

index, columns : scalar, list-like, dict-like or function, optional

Scalar or list-like will alter the Series.name attribute, and raise on DataFrame or Panel. dict-like or functions are transformations to apply to that axis’ values

copy : boolean, default True

Also copy underlying data

inplace : boolean, default False

Whether to return a new DataFrame. If True then value of copy is ignored.

Returns:

renamed : DataFrame (new object)

See also

pandas.NDFrame.rename_axis

Examples

>>> s = pd.Series([1, 2, 3])    
>>> s    
0    1
1    2
2    3
dtype: int64
>>> s.rename("my_name") # scalar, changes Series.name    
0    1
1    2
2    3
Name: my_name, dtype: int64
>>> s.rename(lambda x: x ** 2)  # function, changes labels    
0    1
1    2
4    3
dtype: int64
>>> s.rename({1: 3, 2: 5})  # mapping, changes labels    
0    1
3    2
5    3
dtype: int64
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})    
>>> df.rename(2)    
...
TypeError: 'int' object is not callable
>>> df.rename(index=str, columns={"A": "a", "B": "c"})    
   a  c
0  1  4
1  2  5
2  3  6
>>> df.rename(index=str, columns={"A": "a", "C": "c"})    
   a  B
0  1  4
1  2  5
2  3  6
repartition(divisions=None, npartitions=None, force=False)

Repartition dataframe along new divisions

Parameters:

divisions : list, optional

List of partitions to be used. If specified npartitions will be ignored.

npartitions : int, optional

Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.

force : bool, default False

Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.

Examples

>>> df = df.repartition(npartitions=10)  
>>> df = df.repartition(divisions=[0, 5, 10, 20])  
resample(rule, how=None, closed=None, label=None)

Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

Parameters:

rule : string

the offset string or object representing target conversion

axis : int, optional, default 0

closed : {‘right’, ‘left’}

Which side of bin interval is closed

label : {‘right’, ‘left’}

Which bin edge label to label bucket with

convention : {‘start’, ‘end’, ‘s’, ‘e’}

loffset : timedelta

Adjust the resampled time labels

base : int, default 0

For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0

on : string, optional

For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.

New in version 0.19.0.

level : string or int, optional

For a MultiIndex, level (name or number) to use for resampling. Level must be datetime-like.

New in version 0.19.0.

To learn more about the offset strings, please see `this link

<http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases>`__.

Notes

Dask doesn’t supports following argument(s).

  • axis
  • fill_method
  • convention
  • kind
  • loffset
  • limit
  • base
  • on
  • level

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')    
>>> series = pd.Series(range(9), index=index)    
>>> series    
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample('3T').sum()    
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label``2000-01-01 00:03:00`` does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.

>>> series.resample('3T', label='right').sum()    
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but close the right side of the bin interval.

>>> series.resample('3T', label='right', closed='right').sum()    
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample('30S').asfreq()[0:5] #select first 5 rows    
2000-01-01 00:00:00     0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00     1
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00     2
Freq: 30S, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the pad method.

>>> series.resample('30S').pad()[0:5]    
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample('30S').bfill()[0:5]    
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(array_like):    
...     return np.sum(array_like)+5
>>> series.resample('3T').apply(custom_resampler)    
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64
reset_index(drop=False)

For DataFrame with multi-level index, return new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.

Parameters:

level : int, str, tuple, or list, default None

Only remove the given levels from the index. Removes all levels by default

drop : boolean, default False

Do not try to insert index into dataframe columns. This resets the index to the default integer index.

inplace : boolean, default False

Modify the DataFrame in place (do not create a new object)

col_level : int or str, default 0

If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.

col_fill : object, default ‘’

If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.

Returns:

resetted : DataFrame

Notes

Dask doesn’t supports following argument(s).

  • level
  • inplace
  • col_level
  • col_fill
rfloordiv(other, axis='columns', level=None, fill_value=None)

Integer division of dataframe and other, element-wise (binary operator rfloordiv).

Equivalent to other // dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

rmod(other, axis='columns', level=None, fill_value=None)

Modulo of dataframe and other, element-wise (binary operator rmod).

Equivalent to other % dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.mod

Notes

Mismatched indices will be unioned together

rmul(other, axis='columns', level=None, fill_value=None)

Multiplication of dataframe and other, element-wise (binary operator rmul).

Equivalent to other * dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.mul

Notes

Mismatched indices will be unioned together

rolling(window, min_periods=None, freq=None, center=False, win_type=None, axis=0)

Provides rolling transformations.

Parameters:

window : int

Size of the moving window. This is the number of observations used for calculating the statistic. The window size must not be so large as to span more than one adjacent partition.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

center : boolean, default False

Set the labels at the center of the window.

win_type : string, default None

Provide a window type. The recognized window types are identical to pandas.

axis : int, default 0

Returns:

a Rolling object on which to call a method to compute a statistic

Notes

The freq argument is not supported.

round(decimals=0)

Round a DataFrame to a variable number of decimal places.

New in version 0.17.0.

Parameters:

decimals : int, dict, Series

Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.

Returns:

DataFrame object

See also

numpy.around, Series.round

Examples

>>> df = pd.DataFrame(np.random.random([3, 3]),    
...     columns=['A', 'B', 'C'], index=['first', 'second', 'third'])
>>> df    
               A         B         C
first   0.028208  0.992815  0.173891
second  0.038683  0.645646  0.577595
third   0.877076  0.149370  0.491027
>>> df.round(2)    
           A     B     C
first   0.03  0.99  0.17
second  0.04  0.65  0.58
third   0.88  0.15  0.49
>>> df.round({'A': 1, 'C': 2})    
          A         B     C
first   0.0  0.992815  0.17
second  0.0  0.645646  0.58
third   0.9  0.149370  0.49
>>> decimals = pd.Series([1, 0, 2], index=['A', 'B', 'C'])    
>>> df.round(decimals)    
          A  B     C
first   0.0  1  0.17
second  0.0  1  0.58
third   0.9  0  0.49
rpow(other, axis='columns', level=None, fill_value=None)

Exponential power of dataframe and other, element-wise (binary operator rpow).

Equivalent to other ** dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.pow

Notes

Mismatched indices will be unioned together

rsub(other, axis='columns', level=None, fill_value=None)

Subtraction of dataframe and other, element-wise (binary operator rsub).

Equivalent to other - dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.sub

Notes

Mismatched indices will be unioned together

rtruediv(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator rtruediv).

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

sample(frac, replace=False, random_state=None)

Random sample of items

Parameters:

frac : float, optional

Fraction of axis items to return.

replace: boolean, optional

Sample with or without replacement. Default = False.

random_state: int or ``np.random.RandomState``

If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState

select_dtypes(include=None, exclude=None)

Return a subset of a DataFrame including/excluding columns based on their dtype.

Parameters:

include, exclude : list-like

A list of dtypes or strings to be included/excluded. You must pass in a non-empty sequence for at least one of these.

Returns:

subset : DataFrame

The subset of the frame including the dtypes in include and excluding the dtypes in exclude.

Raises:

ValueError

  • If both of include and exclude are empty
  • If include and exclude have overlapping elements
  • If any kind of string dtype is passed in.

TypeError

  • If either of include or exclude is not a sequence

Notes

  • To select all numeric types use the numpy dtype numpy.number
  • To select strings you must use the object dtype, but note that this will return all object dtype columns
  • See the numpy dtype hierarchy
  • To select Pandas categorical dtypes, use ‘category’

Examples

>>> df = pd.DataFrame({'a': np.random.randn(6).astype('f4'),    
...                    'b': [True, False] * 3,
...                    'c': [1.0, 2.0] * 3})
>>> df    
        a      b  c
0  0.3962   True  1
1  0.1459  False  2
2  0.2623   True  1
3  0.0764  False  2
4 -0.9703   True  1
5 -1.2094  False  2
>>> df.select_dtypes(include=['float64'])    
   c
0  1
1  2
2  1
3  2
4  1
5  2
>>> df.select_dtypes(exclude=['floating'])    
       b
0   True
1  False
2   True
3  False
4   True
5  False
set_index(other, drop=True, sorted=False, **kwargs)

Set the DataFrame index (row labels) using an existing column

This operation in dask.dataframe is expensive. If the input column is sorted then we accomplish the set_index in a single full read of that column. However, if the input column is not sorted then this operation triggers a full shuffle, which can take a while and only works on a single machine (not distributed).

Parameters:

other: Series or label

drop: boolean, default True

Delete columns to be used as the new index

sorted: boolean, default False

Set to True if the new index column is already sorted

Examples

>>> df.set_index('x')  
>>> df.set_index(d.x)  
>>> df.set_index(d.timestamp, sorted=True)  
set_partition(column, divisions, **kwargs)

Set explicit divisions for new column index

>>> df2 = df.set_partition('new-index-column', divisions=[10, 20, 50])  

See also

set_index

std(axis=None, skipna=True, ddof=1, split_every=False)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

std : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
sub(other, axis='columns', level=None, fill_value=None)

Subtraction of dataframe and other, element-wise (binary operator sub).

Equivalent to dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.rsub

Notes

Mismatched indices will be unioned together

sum(axis=None, skipna=True, split_every=False)

Return the sum of the values for the requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

sum : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
tail(n=5, compute=True)

Last n rows of the dataset

Caveat, the only checks the last n rows of the last partition.

to_bag(index=False)

Convert to a dask Bag of tuples of each row.

Parameters:

index : bool, optional

If True, the index is included as the first element of each tuple. Default is False.

to_castra(fn=None, categories=None, sorted_index_column=None, compute=True, get=<function get_sync>)

Write DataFrame to Castra on-disk store

See https://github.com/blosc/castra for details

See also

Castra.to_dask

to_csv(filename, **kwargs)

Write DataFrame to a series of comma-separated values (csv) files

One filename per partition will be created. You can specify the filenames in a variety of ways.

Use a globstring:

>>> df.to_csv('/path/to/data/export-*.csv')  

The * will be replaced by the increasing sequence 0, 1, 2, ...

/path/to/data/export-0.csv
/path/to/data/export-1.csv

Use a globstring and a name_function= keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.

>>> from datetime import date, timedelta
>>> def name(i):
...     return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0)
'2015-01-01'
>>> name(15)
'2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)  
/path/to/data/export-2015-01-01.csv
/path/to/data/export-2015-01-02.csv
...

You can also provide an explicit list of paths:

>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...]  
>>> df.to_csv(paths) 
Parameters:

filename : string

Path glob indicating the naming scheme for the output files

name_function : callable, default None

Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions

compression : string or None

String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically

sep : character, default ‘,’

Field delimiter for the output file

na_rep : string, default ‘’

Missing data representation

float_format : string, default None

Format string for floating point numbers

columns : sequence, optional

Columns to write

header : boolean or list of string, default True

Write out column names. If a list of string is given it is assumed to be aliases for the column names

index : boolean, default True

Write row names (index)

index_label : string or sequence, or False, default None

Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R

nanRep : None

deprecated, use na_rep

mode : str

Python write mode, default ‘w’

encoding : string, optional

A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

compression : string, optional

a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename

line_terminator : string, default ‘n’

The newline character or character sequence to use in the output file

quoting : optional constant from csv module

defaults to csv.QUOTE_MINIMAL

quotechar : string (length 1), default ‘”’

character used to quote fields

doublequote : boolean, default True

Control quoting of quotechar inside a field

escapechar : string (length 1), default None

character used to escape sep and quotechar when appropriate

chunksize : int or None

rows to write at a time

tupleize_cols : boolean, default False

write multi_index columns as a list of tuples (if True) or new (expanded format) if False)

date_format : string, default None

Format string for datetime objects

decimal: string, default ‘.’

Character recognized as decimal separator. E.g. use ‘,’ for European data

to_delayed()

Convert dataframe into dask Delayed objects

Returns a list of delayed values, one value per partition.

to_hdf(path_or_buf, key, mode='a', append=False, get=None, **kwargs)

Export frame to hdf file(s)

Export dataframe to one or multiple hdf5 files or nodes.

Exported hdf format is pandas’ hdf table format only. Data saved by this function should be read by pandas dataframe compatible reader.

By providing a single asterisk in either the path_or_buf or key parameters you direct dask to save each partition to a different file or node (respectively). The asterisk will be replaced with a zero padded partition number, as this is the default implementation of name_function.

When writing to a single hdf node in a single hdf file, all hdf save tasks are required to execute in a specific order, often becoming the bottleneck of the entire execution graph. Saving to multiple nodes or files removes that restriction (order is still preserved by enforcing order on output, using name_function) and enables executing save tasks in parallel.

Parameters:

path_or_buf: HDFStore object or string

Destination file(s). If string, can contain a single asterisk to

save each partition to a different file. Only one asterisk is

allowed in both path_or_buf and key parameters.

key: string

A node / group path in file, can contain a single asterisk to save

each partition to a different hdf node in a single file. Only one

asterisk is allowed in both path_or_buf and key parameters.

format: optional, default ‘table’

Default hdf storage format, currently only pandas’ ‘table’ format is supported.

mode: optional, {‘a’, ‘w’, ‘r+’}, default ‘a’

'a'

Append: Add data to existing file(s) or create new.

'w'

Write: overwrite any existing files with new ones.

'r+'

Append to existing files, files must already exist.

append: optional, default False

If False, overwrites existing node with the same name otherwise

appends to it.

complevel: optional, 0-9, default 0

compression level, higher means better compression ratio and possibly more CPU time. Depends on complib.

complib: {‘zlib’, ‘bzip2’, ‘lzo’, ‘blosc’, None}, default None

If complevel > 0 compress using this compression library when

possible

fletcher32: bool, default False

If True and compression is used, additionally apply the fletcher32

checksum.

get: callable, optional

A scheduler `get` function to use. If not provided, the default is

to check the global settings first, and then fall back to defaults

for the collections.

dask_kwargs: dict, optional

A dictionary of keyword arguments passed to the `get` function

used.

name_function: callable, optional, default None

A callable called for each partition that accepts a single int

representing the partition number. name_function must return a

string representation of a partition’s index in a way that will

preserve the partition’s location after a string sort.

If None, a default name_function is used. The default name_function

will return a zero padded string of received int. See

dask.utils.build_name_function for more info.

compute: bool, default True

If True, execute computation of resulting dask graph. If False, return a Delayed object.

lock: bool, None or lock object, default None

In to_hdf locks are needed for two reasons. First, to protect

against writing to the same file from multiple processes or threads

simultaneously. Second, default libhdf5 is not thread safe, so we

must additionally lock on it’s usage. By default if lock is None

lock will be determined optimally based on path_or_buf, key and the

scheduler used. Manually setting this parameter is usually not

required to improve performance.

Alternatively, you can specify specific values:

If False, no locking will occur. If True, default lock object will

be created (multiprocessing.Manager.Lock on multiprocessing

scheduler, Threading.Lock otherwise), This can be used to force

using a lock in scenarios the default behavior will be to avoid

locking. Else, value is assumed to implement the lock interface,

and will be the lock object used.

See also

dask.DataFrame.read_hdf
reading hdf files
dask.Series.read_hdf
reading hdf files

Examples

Saving data to a single file:

>>> df.to_hdf('output.hdf', '/data')            

Saving data to multiple nodes:

>>> with pd.HDFStore('output.hdf') as fh:
...     df.to_hdf(fh, '/data*')
...     fh.keys()                               
['/data0', '/data1']

Or multiple files:

>>> df.to_hdf('output_*.hdf', '/data')          

Saving multiple files with the multiprocessing scheduler and manually disabling locks:

>>> df.to_hdf('output_*.hdf', '/data',
...   get=dask.multiprocessing.get, lock=False) 
to_timestamp(freq=None, how='start', axis=0)

Cast to DatetimeIndex of timestamps, at beginning of period

Parameters:

freq : string, default frequency of PeriodIndex

Desired frequency

how : {‘s’, ‘e’, ‘start’, ‘end’}

Convention for converting period to timestamp; start of period vs. end

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

The axis to convert (the index by default)

copy : boolean, default True

If false then underlying input data is not copied

Returns:

df : DataFrame with DatetimeIndex

Notes

Dask doesn’t supports following argument(s).

  • copy
truediv(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

var(axis=None, skipna=True, ddof=1, split_every=False)

Return unbiased variance over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

var : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
visualize(filename='mydask', format=None, optimize_graph=False, **kwargs)

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters:

filename : str or None, optional

The name (without an extension) of the file to write to disk. If filename is None, no file will be written, and we communicate with dot using only pipes.

format : {‘png’, ‘pdf’, ‘dot’, ‘svg’, ‘jpeg’, ‘jpg’}, optional

Format in which to write output file. Default is ‘png’.

optimize_graph : bool, optional

If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.

**kwargs

Additional keyword arguments to forward to to_graphviz.

Returns:

result : IPython.diplay.Image, IPython.display.SVG, or None

See dask.dot.dot_graph for more information.

See also

dask.base.visualize, dask.dot.dot_graph

Notes

For more information on optimization see here:

http://dask.pydata.org/en/latest/optimize.html

where(cond, other=nan)

Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Parameters:

cond : boolean NDFrame, array or callable

If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as cond.

other : scalar, NDFrame, or callable

If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as other.

inplace : boolean, default False

Whether to perform the operation in place on the data

axis : alignment axis if needed, default None

level : alignment level if needed, default None

try_cast : boolean, default False

try to cast the result back to the input type (if possible),

raise_on_error : boolean, default True

Whether to raise on invalid data types (e.g. trying to where on strings)

Returns:

wh : same type as caller

See also

DataFrame.mask()

Notes

Dask doesn’t supports following argument(s).

  • inplace
  • axis
  • level
  • try_cast
  • raise_on_error

Examples

>>> s = pd.Series(range(5))    
>>> s.where(s > 0)    
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])    
>>> m = df % 3 == 0    
>>> df.where(m, -df)    
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)    
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)    
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

Series Methods

class dask.dataframe.Series(dsk, name, meta, divisions)

Out-of-core Series object

Mimics pandas.Series.

Parameters:

dsk: dict

The dask graph to compute this Series

_name: str

The key prefix that specifies which keys in the dask comprise this particular Series

meta: pandas.Series

An empty pandas.Series with names, dtypes, and index matching the expected output.

divisions: tuple of index values

Values along which we partition our blocks on the index

add(other, level=None, fill_value=None, axis=0)

Addition of series and other, element-wise (binary operator add).

Equivalent to series + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.radd

align(other, join='outer', axis=None, fill_value=None)

Align two object on their axes with the specified join method for each axis Index

Parameters:

other : DataFrame or Series

join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’

axis : allowed axis of the other object, default None

Align on index (0), columns (1), or both (None)

level : int or level name, default None

Broadcast across a level, matching Index values on the passed MultiIndex level

copy : boolean, default True

Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

fill_value : scalar, default np.NaN

Value to use for missing values. Defaults to NaN, but can be any “compatible” value

method : str, default None

limit : int, default None

fill_axis : {0, ‘index’}, default 0

Filling axis, method and limit

broadcast_axis : {0, ‘index’}, default None

Broadcast values along this axis, if aligning two objects of different dimensions

New in version 0.17.0.

Returns:

(left, right) : (Series, type of other)

Aligned objects

Notes

Dask doesn’t supports following argument(s).

  • level
  • copy
  • method
  • limit
  • fill_axis
  • broadcast_axis
all(axis=None, skipna=True, split_every=False)

Return whether all elements are True over requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

bool_only : boolean, default None

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

Returns:

all : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • bool_only
  • level
any(axis=None, skipna=True, split_every=False)

Return whether any element is True over requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

bool_only : boolean, default None

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

Returns:

any : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • bool_only
  • level
append(other)

Concatenate two or more Series.

Parameters:

to_append : Series or list/tuple of Series

ignore_index : boolean, default False

If True, do not use the index labels.

verify_integrity : boolean, default False

If True, raise Exception on creating index with duplicates

Returns:

appended : Series

Notes

Dask doesn’t supports following argument(s).

  • to_append
  • ignore_index
  • verify_integrity

Examples

>>> s1 = pd.Series([1, 2, 3])    
>>> s2 = pd.Series([4, 5, 6])    
>>> s3 = pd.Series([4, 5, 6], index=[3,4,5])    
>>> s1.append(s2)    
0    1
1    2
2    3
0    4
1    5
2    6
dtype: int64
>>> s1.append(s3)    
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

With ignore_index set to True:

>>> s1.append(s2, ignore_index=True)    
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

With verify_integrity set to True:

>>> s1.append(s2, verify_integrity=True)    
ValueError: Indexes have overlapping values: [0, 1, 2]
apply(func, convert_dtype=True, meta='__no_default__', name='__no_default__', args=(), **kwds)

Parallel version of pandas.Series.apply

Parameters:

func : function

Function to apply

convert_dtype : boolean, default True

Try to find better dtype for elementwise function results. If False, leave as dtype=object.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

name : list, scalar or None, optional

Deprecated, use meta instead. If list is given, the result is a DataFrame which columns is specified list. Otherwise, the result is a Series which name is given scalar or None (no name). If name keyword is not given, dask tries to infer the result type using its beginning of data. This inference may take some time and lead to unexpected result.

args : tuple

Positional arguments to pass to function in addition to the value.

Additional keyword arguments will be passed as keywords to the function.

Returns:

applied : Series or DataFrame if func returns a Series.

See also

dask.Series.map_partitions

Examples

>>> import dask.dataframe as dd
>>> s = pd.Series(range(5), name='x')
>>> ds = dd.from_pandas(s, npartitions=2)

Apply a function elementwise across the Series, passing in extra arguments in args and kwargs:

>>> def myadd(x, a, b=1):
...     return x + a + b
>>> res = ds.apply(myadd, args=(2,), b=1.5)

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with name 'x', and dtype float64:

>>> res = ds.apply(myadd, args=(2,), b=1.5, meta=('x', 'f8'))

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ds.apply(lambda x: x + 1, meta=ds)
astype(dtype)

Cast object to input numpy.dtype Return a copy when copy = True (be really careful with this!)

Parameters:

dtype : data type, or dict of column name -> data type

Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, ...}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

raise_on_error : raise on invalid input

kwargs : keyword arguments to pass on to the constructor

Returns:

casted : type of caller

Notes

Dask doesn’t supports following argument(s).

  • copy
  • raise_on_error
between(left, right, inclusive=True)

Return boolean Series equivalent to left <= series <= right. NA values will be treated as False

Parameters:

left : scalar

Left boundary

right : scalar

Right boundary

Returns:

is_between : Series

cache(cache=<class 'dict'>)

Evaluate Dataframe and store in local cache

Uses chest by default to store data on disk

clip(lower=None, upper=None, out=None)

Trim values at input threshold(s).

Parameters:

lower : float or array_like, default None

upper : float or array_like, default None

axis : int or string axis name, optional

Align object with lower and upper along the given axis.

Returns:

clipped : Series

Notes

Dask doesn’t supports following argument(s).

  • axis

Examples

>>> df    
  0         1
0  0.335232 -1.256177
1 -1.367855  0.746646
2  0.027753 -1.176076
3  0.230930 -0.679613
4  1.261967  0.570967
>>> df.clip(-1.0, 0.5)    
          0         1
0  0.335232 -1.000000
1 -1.000000  0.500000
2  0.027753 -1.000000
3  0.230930 -0.679613
4  0.500000  0.500000
>>> t    
0   -0.3
1   -0.2
2   -0.1
3    0.0
4    0.1
dtype: float64
>>> df.clip(t, t + 1, axis=0)    
          0         1
0  0.335232 -0.300000
1 -0.200000  0.746646
2  0.027753 -0.100000
3  0.230930  0.000000
4  1.100000  0.570967
clip_lower(threshold)

Return copy of the input with values below given value(s) truncated.

Parameters:

threshold : float or array_like

axis : int or string axis name, optional

Align object with threshold along the given axis.

Returns:

clipped : same type as input

See also

clip

Notes

Dask doesn’t supports following argument(s).

  • axis
clip_upper(threshold)

Return copy of input with values above given value(s) truncated.

Parameters:

threshold : float or array_like

axis : int or string axis name, optional

Align object with threshold along the given axis.

Returns:

clipped : same type as input

See also

clip

Notes

Dask doesn’t supports following argument(s).

  • axis
combine_first(other)

Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes

Parameters:other : Series
Returns:y : Series
compute(**kwargs)

Compute several dask collections at once.

Parameters:

get : callable, optional

A scheduler get function to use. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

optimize_graph : bool, optional

If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

kwargs

Extra keywords to forward to the scheduler get function.

corr(other, method='pearson', min_periods=None)

Compute correlation with other Series, excluding missing values

Parameters:

other : Series

method : {‘pearson’, ‘kendall’, ‘spearman’}

  • pearson : standard correlation coefficient
  • kendall : Kendall Tau correlation coefficient
  • spearman : Spearman rank correlation

min_periods : int, optional

Minimum number of observations needed to have a valid result

Returns:

correlation : float

count(split_every=False)

Return number of non-NA/null observations in the Series

Parameters:

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series

Returns:

nobs : int or Series (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
cov(other, min_periods=None)

Compute covariance with Series, excluding missing values

Parameters:

other : Series

min_periods : int, optional

Minimum number of observations needed to have a valid result

Returns:

covariance : float

Normalized by N-1 (unbiased estimator).

cummax(axis=None, skipna=True)

Return cumulative max over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cummax : Series

cummin(axis=None, skipna=True)

Return cumulative minimum over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cummin : Series

cumprod(axis=None, skipna=True)

Return cumulative product over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cumprod : Series

cumsum(axis=None, skipna=True)

Return cumulative sum over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cumsum : Series

describe(split_every=False)

Generate various summary statistics, excluding NaN values.

Parameters:

percentiles : array-like, optional

The percentiles to include in the output. Should all be in the interval [0, 1]. By default percentiles is [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.

include, exclude : list-like, ‘all’, or None (default)

Specify the form of the returned result. Either:

  • None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.
  • A list of dtypes or strings to be included/excluded. To select all numeric types use numpy numpy.number. To select categorical objects use type object. See also the select_dtypes documentation. eg. df.describe(include=[‘O’])
  • If include is the string ‘all’, the output column-set will match the input one.
Returns:

summary: NDFrame of summary statistics

Notes

Dask doesn’t supports following argument(s).

  • percentiles
  • include
  • exclude
div(other, level=None, fill_value=None, axis=0)

Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rtruediv

drop_duplicates(**kwargs)

Return DataFrame with duplicate rows removed, optionally only considering certain columns

Parameters:

subset : column label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.

take_last : deprecated

inplace : boolean, default False

Whether to drop duplicates in place or to return a copy

Returns:

deduplicated : DataFrame

dropna()

Return Series without null values

Returns:

valid : Series

inplace : boolean, default False

Do operation in place.

Notes

Dask doesn’t supports following argument(s).

  • axis
  • inplace
dtype

Return data type

eq(other, level=None, axis=0)

Equal to of series and other, element-wise (binary operator eq).

Equivalent to series == other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.None

fillna(value)

Fill NA/NaN values using the specified method

Parameters:

value : scalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

axis : {0, ‘index’}

inplace : boolean, default False

If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).

limit : int, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.

downcast : dict, default is None

a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)

Returns:

filled : Series

See also

reindex, asfreq

Notes

Dask doesn’t supports following argument(s).

  • method
  • axis
  • inplace
  • limit
  • downcast
floordiv(other, level=None, fill_value=None, axis=0)

Integer division of series and other, element-wise (binary operator floordiv).

Equivalent to series // other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rfloordiv

ge(other, level=None, axis=0)

Greater than or equal to of series and other, element-wise (binary operator ge).

Equivalent to series >= other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.None

get_partition(n)

Get a dask DataFrame/Series representing the nth partition.

groupby(index, **kwargs)

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

Parameters:

by : mapping function / list of functions, dict, Series, or tuple /

list of column names. Called on each element of the object index to determine the groups. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups

axis : int, default 0

level : int, level name, or sequence of such, default None

If the axis is a MultiIndex (hierarchical), group by a particular level or levels

as_index : boolean, default True

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort : boolean, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.

group_keys : boolean, default True

When calling apply, add group keys to index to identify pieces

squeeze : boolean, default False

reduce the dimensionality of the return type if possible, otherwise return a consistent type

Returns:

GroupBy object

Notes

Dask doesn’t supports following argument(s).

  • by
  • axis
  • level
  • as_index
  • sort
  • group_keys
  • squeeze

Examples

DataFrame results

>>> data.groupby(func, axis=0).mean()    
>>> data.groupby(['col1', 'col2'])['col3'].mean()    

DataFrame with hierarchical index

>>> data.groupby(['col1', 'col2']).mean()    
gt(other, level=None, axis=0)

Greater than of series and other, element-wise (binary operator gt).

Equivalent to series > other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.None

head(n=5, npartitions=1, compute=True)

First n rows of the dataset

Parameters:

n : int, optional

The number of rows to return. Default is 5.

npartitions : int, optional

Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.

compute : bool, optional

Whether to compute the result, default is True.

idxmax(axis=None, skipna=True, split_every=False)

Return index of first occurrence of maximum over requested axis. NA/null values are excluded.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be first index.

Returns:

idxmax : Series

See also

Series.idxmax

Notes

This method is the DataFrame version of ndarray.argmax.

idxmin(axis=None, skipna=True, split_every=False)

Return index of first occurrence of minimum over requested axis. NA/null values are excluded.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

idxmin : Series

See also

Series.idxmin

Notes

This method is the DataFrame version of ndarray.argmin.

index

Return dask Index instance

isin(other)

Return a boolean Series showing whether each element in the Series is exactly contained in the passed sequence of values.

Parameters:

values : set or list-like

The sequence of values to test. Passing in a single string will raise a TypeError. Instead, turn a single string into a list of one element.

New in version 0.18.1.

Support for values as a set

Returns:

isin : Series (bool dtype)

Raises:

TypeError

  • If values is a string

See also

pandas.DataFrame.isin

Notes

Dask doesn’t supports following argument(s).

  • values

Examples

>>> s = pd.Series(list('abc'))    
>>> s.isin(['a', 'c', 'e'])    
0     True
1    False
2     True
dtype: bool

Passing a single string as s.isin('a') will raise an error. Use a list of one element instead:

>>> s.isin(['a'])    
0     True
1    False
2    False
dtype: bool
isnull()

Return a boolean same-sized object indicating if the values are null.

See also

notnull
boolean inverse of isnull
iteritems()

Lazily iterate over (index, value) tuples

known_divisions

Whether divisions are already known

le(other, level=None, axis=0)

Less than or equal to of series and other, element-wise (binary operator le).

Equivalent to series <= other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.None

loc

Purely label-location based indexer for selection by label.

>>> df.loc["b"]  
>>> df.loc["b":"d"]  
lt(other, level=None, axis=0)

Less than of series and other, element-wise (binary operator lt).

Equivalent to series < other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.None

map(arg, na_action=None, meta='__no_default__')

Map values of Series using input correspondence (which can be a dict, Series, or function)

Parameters:

arg : function, dict, or Series

na_action : {None, ‘ignore’}

If ‘ignore’, propagate NA values, without passing them to the mapping function

Returns:

y : Series

same index as caller

Examples

Map inputs to outputs

>>> x    
one   1
two   2
three 3
>>> y    
1  foo
2  bar
3  baz
>>> x.map(y)    
one   foo
two   bar
three baz

Use na_action to control whether NA values are affected by the mapping function.

>>> s = pd.Series([1, 2, 3, np.nan])    
>>> s2 = s.map(lambda x: 'this is a string {}'.format(x),    
               na_action=None)
0    this is a string 1.0
1    this is a string 2.0
2    this is a string 3.0
3    this is a string nan
dtype: object
>>> s3 = s.map(lambda x: 'this is a string {}'.format(x),    
               na_action='ignore')
0    this is a string 1.0
1    this is a string 2.0
2    this is a string 3.0
3                     NaN
dtype: object
map_partitions(func, *args, **kwargs)

Apply Python function on each DataFrame partition.

Parameters:

func : function

Function applied to each partition.

args, kwargs :

Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

One can use map_partitions to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.

Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:

>>> def myadd(df, a, b=1):
...     return df.x + df.y + a + b
>>> res = ddf.map_partitions(myadd, 1, b=2)
>>> res.dtype
dtype('float64')

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with no name, and dtype float64:

>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))

Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
>>> res.dtypes
x      int64
y    float64
z    float64
dtype: object

As before, the output metadata can also be specified manually. This time we pass in a dict, as the output is a DataFrame:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y),
...                          meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.map_partitions(lambda df: df.head(), meta=df)
mask(cond, other=nan)

Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.

Parameters:

cond : boolean NDFrame, array or callable

If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as cond.

other : scalar, NDFrame, or callable

If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as other.

inplace : boolean, default False

Whether to perform the operation in place on the data

axis : alignment axis if needed, default None

level : alignment level if needed, default None

try_cast : boolean, default False

try to cast the result back to the input type (if possible),

raise_on_error : boolean, default True

Whether to raise on invalid data types (e.g. trying to where on strings)

Returns:

wh : same type as caller

Notes

Dask doesn’t supports following argument(s).

  • inplace
  • axis
  • level
  • try_cast
  • raise_on_error

Examples

>>> s = pd.Series(range(5))    
>>> s.where(s > 0)    
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])    
>>> m = df % 3 == 0    
>>> df.where(m, -df)    
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)    
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)    
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
max(axis=None, skipna=True, split_every=False)
This method returns the maximum of the values in the object.
If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.
Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

max : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
mean(axis=None, skipna=True, split_every=False)

Return the mean of the values for the requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

mean : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
min(axis=None, skipna=True, split_every=False)
This method returns the minimum of the values in the object.
If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.
Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

min : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
mod(other, level=None, fill_value=None, axis=0)

Modulo of series and other, element-wise (binary operator mod).

Equivalent to series % other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rmod

mul(other, level=None, fill_value=None, axis=0)

Multiplication of series and other, element-wise (binary operator mul).

Equivalent to series * other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rmul

ndim

Return dimensionality

ne(other, level=None, axis=0)

Not equal to of series and other, element-wise (binary operator ne).

Equivalent to series != other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.None

nlargest(n=5, split_every=None)

Return the largest n elements.

Parameters:

n : int

Return this many descending sorted values

keep : {‘first’, ‘last’, False}, default ‘first’

Where there are duplicate values: - first : take the first occurrence. - last : take the last occurrence.

take_last : deprecated

Returns:

top_n : Series

The n largest values in the Series, in sorted order

See also

Series.nsmallest

Notes

Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.

Examples

>>> import pandas as pd    
>>> import numpy as np    
>>> s = pd.Series(np.random.randn(1e6))    
>>> s.nlargest(10)  # only sorts up to the N requested    
notnull()

Return a boolean same-sized object indicating if the values are not null.

See also

isnull
boolean inverse of notnull
npartitions

Return number of partitions

nunique(split_every=None)

Return number of unique elements in the object.

Excludes NA values by default.

Parameters:

dropna : boolean, default True

Don’t include NaN in the count.

Returns:

nunique : int

Notes

Dask doesn’t supports following argument(s).

  • dropna
pipe(func, *args, **kwargs)

Apply func(self, *args, **kwargs)

New in version 0.16.2.

Parameters:

func : function

function to apply to the NDFrame. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the NDFrame.

args : positional arguments passed into func.

kwargs : a dictionary of keyword arguments passed into func.

Returns:

object : the return type of func.

See also

pandas.DataFrame.apply, pandas.DataFrame.applymap, pandas.Series.map

Notes

Use .pipe when chaining together functions that expect on Series or DataFrames. Instead of writing

>>> f(g(h(df), arg1=a), arg2=b, arg3=c)    

You can write

>>> (df.pipe(h)    
...    .pipe(g, arg1=a)
...    .pipe(f, arg2=b, arg3=c)
... )

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose f takes its data as arg2:

>>> (df.pipe(h)    
...    .pipe(g, arg1=a)
...    .pipe((f, 'arg2'), arg1=a, arg3=c)
...  )
pow(other, level=None, fill_value=None, axis=0)

Exponential power of series and other, element-wise (binary operator pow).

Equivalent to series ** other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rpow

quantile(q=0.5)

Approximate quantiles of Series

q
: list/array of floats, default 0.5 (50%)
Iterable of numbers ranging from 0 to 1 for the desired quantiles
radd(other, level=None, fill_value=None, axis=0)

Addition of series and other, element-wise (binary operator radd).

Equivalent to other + series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.add

random_split(frac, random_state=None)

Pseudorandomly split dataframe into different pieces row-wise

Parameters:

frac : list

List of floats that should sum to one.

random_state: int or np.random.RandomState

If int create a new RandomState with this as the seed

Otherwise draw from the passed RandomState

See also

dask.DataFrame.sample

Examples

50/50 split

>>> a, b = df.random_split([0.5, 0.5])  

80/10/10 split, consistent random_state

>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)  
rdiv(other, level=None, fill_value=None, axis=0)

Floating division of series and other, element-wise (binary operator rtruediv).

Equivalent to other / series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.truediv

reduction(chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)

Generic row-wise reductions.

Parameters:

chunk : callable

Function to operate on each partition. Should return a pandas.DataFrame, pandas.Series, or a scalar.

aggregate : callable, optional

Function to operate on the concatenated result of chunk. If not specified, defaults to chunk. Used to do the final aggregation in a tree reduction.

The input to aggregate depends on the output of chunk. If the output of chunk is a: - scalar: Input is a Series, with one row per partition. - Series: Input is a DataFrame, with one row per partition. Columns

are the rows in the output series.

  • DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.

Should return a pandas.DataFrame, pandas.Series, or a scalar.

combine : callable, optional

Function to operate on intermediate concatenated results of chunk in a tree-reduction. If not provided, defaults to aggregate. The input/output requirements should match that of aggregate described above.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

token : str, optional

The name to use for the output keys.

split_every : int, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to aggregate. Default is 8.

chunk_kwargs : dict, optional

Keyword arguments to pass on to chunk only.

aggregate_kwargs : dict, optional

Keyword arguments to pass on to aggregate only.

combine_kwargs : dict, optional

Keyword arguments to pass on to combine only.

kwargs :

All remaining keywords will be passed to chunk, combine, and aggregate.

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)})
>>> ddf = dd.from_pandas(df, npartitions=4)

Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:

>>> res = ddf.reduction(lambda x: x.count(),
...                     aggregate=lambda x: x.sum())
>>> res.compute()
x    50
y    50
dtype: int64

Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).

>>> def count_greater(x, value=0):
...     return (x >= value).sum()
>>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(),
...                       chunk_kwargs={'value': 25})
>>> res.compute()
25

Aggregate both the sum and count of a Series at the same time:

>>> def sum_and_count(x):
...     return pd.Series({'sum': x.sum(), 'count': x.count()})
>>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum())
>>> res.compute()
count      50
sum      1225
dtype: int64

Doing the same, but for a DataFrame. Here chunk returns a DataFrame, meaning the input to aggregate is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.

>>> def sum_and_count(x):
...     return pd.DataFrame({'sum': x.sum(), 'count': x.count()})
>>> res = ddf.reduction(sum_and_count,
...                     aggregate=lambda x: x.groupby(level=0).sum())
>>> res.compute()
   count   sum
x     50  1225
y     50  3725
repartition(divisions=None, npartitions=None, force=False)

Repartition dataframe along new divisions

Parameters:

divisions : list, optional

List of partitions to be used. If specified npartitions will be ignored.

npartitions : int, optional

Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.

force : bool, default False

Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.

Examples

>>> df = df.repartition(npartitions=10)  
>>> df = df.repartition(divisions=[0, 5, 10, 20])  
resample(rule, how=None, closed=None, label=None)

Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

Parameters:

rule : string

the offset string or object representing target conversion

axis : int, optional, default 0

closed : {‘right’, ‘left’}

Which side of bin interval is closed

label : {‘right’, ‘left’}

Which bin edge label to label bucket with

convention : {‘start’, ‘end’, ‘s’, ‘e’}

loffset : timedelta

Adjust the resampled time labels

base : int, default 0

For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0

on : string, optional

For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.

New in version 0.19.0.

level : string or int, optional

For a MultiIndex, level (name or number) to use for resampling. Level must be datetime-like.

New in version 0.19.0.

To learn more about the offset strings, please see `this link

<http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases>`__.

Notes

Dask doesn’t supports following argument(s).

  • axis
  • fill_method
  • convention
  • kind
  • loffset
  • limit
  • base
  • on
  • level

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')    
>>> series = pd.Series(range(9), index=index)    
>>> series    
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample('3T').sum()    
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label``2000-01-01 00:03:00`` does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.

>>> series.resample('3T', label='right').sum()    
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but close the right side of the bin interval.

>>> series.resample('3T', label='right', closed='right').sum()    
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample('30S').asfreq()[0:5] #select first 5 rows    
2000-01-01 00:00:00     0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00     1
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00     2
Freq: 30S, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the pad method.

>>> series.resample('30S').pad()[0:5]    
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample('30S').bfill()[0:5]    
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(array_like):    
...     return np.sum(array_like)+5
>>> series.resample('3T').apply(custom_resampler)    
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64
reset_index(drop=False)

Analogous to the pandas.DataFrame.reset_index() function, see docstring there.

Parameters:

level : int, str, tuple, or list, default None

Only remove the given levels from the index. Removes all levels by default

drop : boolean, default False

Do not try to insert index into dataframe columns

name : object, default None

The name of the column corresponding to the Series values

inplace : boolean, default False

Modify the Series in place (do not create a new object)

Returns:

resetted : DataFrame, or Series if drop == True

Notes

Dask doesn’t supports following argument(s).

  • level
  • name
  • inplace
rfloordiv(other, level=None, fill_value=None, axis=0)

Integer division of series and other, element-wise (binary operator rfloordiv).

Equivalent to other // series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.floordiv

rmod(other, level=None, fill_value=None, axis=0)

Modulo of series and other, element-wise (binary operator rmod).

Equivalent to other % series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.mod

rmul(other, level=None, fill_value=None, axis=0)

Multiplication of series and other, element-wise (binary operator rmul).

Equivalent to other * series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.mul

rolling(window, min_periods=None, freq=None, center=False, win_type=None, axis=0)

Provides rolling transformations.

Parameters:

window : int

Size of the moving window. This is the number of observations used for calculating the statistic. The window size must not be so large as to span more than one adjacent partition.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

center : boolean, default False

Set the labels at the center of the window.

win_type : string, default None

Provide a window type. The recognized window types are identical to pandas.

axis : int, default 0

Returns:

a Rolling object on which to call a method to compute a statistic

Notes

The freq argument is not supported.

round(decimals=0)

Round each value in a Series to the given number of decimals.

Parameters:

decimals : int

Number of decimal places to round to (default: 0). If decimals is negative, it specifies the number of positions to the left of the decimal point.

Returns:

Series object

See also

numpy.around, DataFrame.round

rpow(other, level=None, fill_value=None, axis=0)

Exponential power of series and other, element-wise (binary operator rpow).

Equivalent to other ** series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.pow

rsub(other, level=None, fill_value=None, axis=0)

Subtraction of series and other, element-wise (binary operator rsub).

Equivalent to other - series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.sub

rtruediv(other, level=None, fill_value=None, axis=0)

Floating division of series and other, element-wise (binary operator rtruediv).

Equivalent to other / series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.truediv

sample(frac, replace=False, random_state=None)

Random sample of items

Parameters:

frac : float, optional

Fraction of axis items to return.

replace: boolean, optional

Sample with or without replacement. Default = False.

random_state: int or ``np.random.RandomState``

If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState

std(axis=None, skipna=True, ddof=1, split_every=False)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

std : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
sub(other, level=None, fill_value=None, axis=0)

Subtraction of series and other, element-wise (binary operator sub).

Equivalent to series - other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rsub

sum(axis=None, skipna=True, split_every=False)

Return the sum of the values for the requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

sum : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
tail(n=5, compute=True)

Last n rows of the dataset

Caveat, the only checks the last n rows of the last partition.

to_bag(index=False)

Convert to a dask Bag.

Parameters:

index : bool, optional

If True, the elements are tuples of (index, value), otherwise they’re just the value. Default is False.

to_csv(filename, **kwargs)

Write DataFrame to a series of comma-separated values (csv) files

One filename per partition will be created. You can specify the filenames in a variety of ways.

Use a globstring:

>>> df.to_csv('/path/to/data/export-*.csv')  

The * will be replaced by the increasing sequence 0, 1, 2, ...

/path/to/data/export-0.csv
/path/to/data/export-1.csv

Use a globstring and a name_function= keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.

>>> from datetime import date, timedelta
>>> def name(i):
...     return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0)
'2015-01-01'
>>> name(15)
'2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)  
/path/to/data/export-2015-01-01.csv
/path/to/data/export-2015-01-02.csv
...

You can also provide an explicit list of paths:

>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...]  
>>> df.to_csv(paths) 
Parameters:

filename : string

Path glob indicating the naming scheme for the output files

name_function : callable, default None

Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions

compression : string or None

String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically

sep : character, default ‘,’

Field delimiter for the output file

na_rep : string, default ‘’

Missing data representation

float_format : string, default None

Format string for floating point numbers

columns : sequence, optional

Columns to write

header : boolean or list of string, default True

Write out column names. If a list of string is given it is assumed to be aliases for the column names

index : boolean, default True

Write row names (index)

index_label : string or sequence, or False, default None

Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R

nanRep : None

deprecated, use na_rep

mode : str

Python write mode, default ‘w’

encoding : string, optional

A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

compression : string, optional

a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename

line_terminator : string, default ‘n’

The newline character or character sequence to use in the output file

quoting : optional constant from csv module

defaults to csv.QUOTE_MINIMAL

quotechar : string (length 1), default ‘”’

character used to quote fields

doublequote : boolean, default True

Control quoting of quotechar inside a field

escapechar : string (length 1), default None

character used to escape sep and quotechar when appropriate

chunksize : int or None

rows to write at a time

tupleize_cols : boolean, default False

write multi_index columns as a list of tuples (if True) or new (expanded format) if False)

date_format : string, default None

Format string for datetime objects

decimal: string, default ‘.’

Character recognized as decimal separator. E.g. use ‘,’ for European data

to_delayed()

Convert dataframe into dask Delayed objects

Returns a list of delayed values, one value per partition.

to_frame(name=None)

Convert Series to DataFrame

Parameters:

name : object, default None

The passed name should substitute for the series name (if it has one).

Returns:

data_frame : DataFrame

to_hdf(path_or_buf, key, mode='a', append=False, get=None, **kwargs)

Export frame to hdf file(s)

Export dataframe to one or multiple hdf5 files or nodes.

Exported hdf format is pandas’ hdf table format only. Data saved by this function should be read by pandas dataframe compatible reader.

By providing a single asterisk in either the path_or_buf or key parameters you direct dask to save each partition to a different file or node (respectively). The asterisk will be replaced with a zero padded partition number, as this is the default implementation of name_function.

When writing to a single hdf node in a single hdf file, all hdf save tasks are required to execute in a specific order, often becoming the bottleneck of the entire execution graph. Saving to multiple nodes or files removes that restriction (order is still preserved by enforcing order on output, using name_function) and enables executing save tasks in parallel.

Parameters:

path_or_buf: HDFStore object or string

Destination file(s). If string, can contain a single asterisk to

save each partition to a different file. Only one asterisk is

allowed in both path_or_buf and key parameters.

key: string

A node / group path in file, can contain a single asterisk to save

each partition to a different hdf node in a single file. Only one

asterisk is allowed in both path_or_buf and key parameters.

format: optional, default ‘table’

Default hdf storage format, currently only pandas’ ‘table’ format is supported.

mode: optional, {‘a’, ‘w’, ‘r+’}, default ‘a’

'a'

Append: Add data to existing file(s) or create new.

'w'

Write: overwrite any existing files with new ones.

'r+'

Append to existing files, files must already exist.

append: optional, default False

If False, overwrites existing node with the same name otherwise

appends to it.

complevel: optional, 0-9, default 0

compression level, higher means better compression ratio and possibly more CPU time. Depends on complib.

complib: {‘zlib’, ‘bzip2’, ‘lzo’, ‘blosc’, None}, default None

If complevel > 0 compress using this compression library when

possible

fletcher32: bool, default False

If True and compression is used, additionally apply the fletcher32

checksum.

get: callable, optional

A scheduler `get` function to use. If not provided, the default is

to check the global settings first, and then fall back to defaults

for the collections.

dask_kwargs: dict, optional

A dictionary of keyword arguments passed to the `get` function

used.

name_function: callable, optional, default None

A callable called for each partition that accepts a single int

representing the partition number. name_function must return a

string representation of a partition’s index in a way that will

preserve the partition’s location after a string sort.

If None, a default name_function is used. The default name_function

will return a zero padded string of received int. See

dask.utils.build_name_function for more info.

compute: bool, default True

If True, execute computation of resulting dask graph. If False, return a Delayed object.

lock: bool, None or lock object, default None

In to_hdf locks are needed for two reasons. First, to protect

against writing to the same file from multiple processes or threads

simultaneously. Second, default libhdf5 is not thread safe, so we

must additionally lock on it’s usage. By default if lock is None

lock will be determined optimally based on path_or_buf, key and the

scheduler used. Manually setting this parameter is usually not

required to improve performance.

Alternatively, you can specify specific values:

If False, no locking will occur. If True, default lock object will

be created (multiprocessing.Manager.Lock on multiprocessing

scheduler, Threading.Lock otherwise), This can be used to force

using a lock in scenarios the default behavior will be to avoid

locking. Else, value is assumed to implement the lock interface,

and will be the lock object used.

See also

dask.DataFrame.read_hdf
reading hdf files
dask.Series.read_hdf
reading hdf files

Examples

Saving data to a single file:

>>> df.to_hdf('output.hdf', '/data')            

Saving data to multiple nodes:

>>> with pd.HDFStore('output.hdf') as fh:
...     df.to_hdf(fh, '/data*')
...     fh.keys()                               
['/data0', '/data1']

Or multiple files:

>>> df.to_hdf('output_*.hdf', '/data')          

Saving multiple files with the multiprocessing scheduler and manually disabling locks:

>>> df.to_hdf('output_*.hdf', '/data',
...   get=dask.multiprocessing.get, lock=False) 
to_timestamp(freq=None, how='start', axis=0)

Cast to DatetimeIndex of timestamps, at beginning of period

Parameters:

freq : string, default frequency of PeriodIndex

Desired frequency

how : {‘s’, ‘e’, ‘start’, ‘end’}

Convention for converting period to timestamp; start of period vs. end

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

The axis to convert (the index by default)

copy : boolean, default True

If false then underlying input data is not copied

Returns:

df : DataFrame with DatetimeIndex

Notes

Dask doesn’t supports following argument(s).

  • copy
truediv(other, level=None, fill_value=None, axis=0)

Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rtruediv

unique(split_every=None)

Return Series of unique values in the object. Includes NA values.

Returns:uniques : Series
value_counts(split_every=None)

Returns object containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters:

normalize : boolean, default False

If True then the object returned will contain the relative frequencies of the unique values.

sort : boolean, default True

Sort by values

ascending : boolean, default False

Sort in ascending order

bins : integer, optional

Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data

dropna : boolean, default True

Don’t include counts of NaN.

Returns:

counts : Series

Notes

Dask doesn’t supports following argument(s).

  • normalize
  • sort
  • ascending
  • bins
  • dropna
var(axis=None, skipna=True, ddof=1, split_every=False)

Return unbiased variance over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

var : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
visualize(filename='mydask', format=None, optimize_graph=False, **kwargs)

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters:

filename : str or None, optional

The name (without an extension) of the file to write to disk. If filename is None, no file will be written, and we communicate with dot using only pipes.

format : {‘png’, ‘pdf’, ‘dot’, ‘svg’, ‘jpeg’, ‘jpg’}, optional

Format in which to write output file. Default is ‘png’.

optimize_graph : bool, optional

If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.

**kwargs

Additional keyword arguments to forward to to_graphviz.

Returns:

result : IPython.diplay.Image, IPython.display.SVG, or None

See dask.dot.dot_graph for more information.

See also

dask.base.visualize, dask.dot.dot_graph

Notes

For more information on optimization see here:

http://dask.pydata.org/en/latest/optimize.html

where(cond, other=nan)

Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Parameters:

cond : boolean NDFrame, array or callable

If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as cond.

other : scalar, NDFrame, or callable

If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as other.

inplace : boolean, default False

Whether to perform the operation in place on the data

axis : alignment axis if needed, default None

level : alignment level if needed, default None

try_cast : boolean, default False

try to cast the result back to the input type (if possible),

raise_on_error : boolean, default True

Whether to raise on invalid data types (e.g. trying to where on strings)

Returns:

wh : same type as caller

See also

DataFrame.mask()

Notes

Dask doesn’t supports following argument(s).

  • inplace
  • axis
  • level
  • try_cast
  • raise_on_error

Examples

>>> s = pd.Series(range(5))    
>>> s.where(s > 0)    
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])    
>>> m = df % 3 == 0    
>>> df.where(m, -df)    
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)    
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)    
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

DataFrameGroupBy

class dask.dataframe.groupby.DataFrameGroupBy(df, index=None, slice=None, **kwargs)
agg(arg, split_every=None)

Aggregate using input function or dict of {column -> function}

Parameters:

arg : function or dict

Function to use for aggregating groups. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If passed a dict, the keys must be DataFrame column names.

Accepted Combinations are:
  • string cythonized function name
  • function
  • list of functions
  • dict of columns -> functions
  • nested dict of names -> dicts of functions
Returns:

aggregated : DataFrame

See also

pandas.Series.groupby, pandas.DataFrame.groupby

Notes

Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).

aggregate(arg, split_every=None)

Aggregate using input function or dict of {column -> function}

Parameters:

arg : function or dict

Function to use for aggregating groups. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If passed a dict, the keys must be DataFrame column names.

Accepted Combinations are:
  • string cythonized function name
  • function
  • list of functions
  • dict of columns -> functions
  • nested dict of names -> dicts of functions
Returns:

aggregated : DataFrame

See also

pandas.Series.groupby, pandas.DataFrame.groupby

Notes

Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).

apply(func, meta='__no_default__', columns='__no_default__')

Parallel version of pandas GroupBy.apply

This mimics the pandas version except for the following:

  1. The user should provide output metadata.
  2. If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Parameters:

func: function

Function to apply

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

columns: list, scalar or None

Deprecated, use meta instead. If list is given, the result is a DataFrame which columns is specified list. Otherwise, the result is a Series which name is given scalar or None (no name). If name keyword is not given, dask tries to infer the result type using its beginning of data. This inference may take some time and lead to unexpected result

Returns:

applied : Series or DataFrame depending on columns keyword

count(split_every=None)

Compute count of group, excluding missing values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

get_group(key)

Constructs NDFrame from group with provided name

Parameters:

name : object

the name of the group to get as a DataFrame

obj : NDFrame, default None

the NDFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used

Returns:

group : type of obj

Notes

Dask doesn’t supports following argument(s).

  • name
  • obj
max(split_every=None)

Compute max of group values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

mean(split_every=None)

Compute mean of groups, excluding missing values

For multiple groupings, the result index will be a MultiIndex

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

min(split_every=None)

Compute min of group values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

size(split_every=None)

Compute group sizes

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

std(ddof=1, split_every=None)

Compute standard deviation of groups, excluding missing values

For multiple groupings, the result index will be a MultiIndex

Parameters:

ddof : integer, default 1

degrees of freedom

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

sum(split_every=None)

Compute sum of group values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

var(ddof=1, split_every=None)

Compute variance of groups, excluding missing values

For multiple groupings, the result index will be a MultiIndex

Parameters:

ddof : integer, default 1

degrees of freedom

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

SeriesGroupBy

class dask.dataframe.groupby.SeriesGroupBy(df, index, slice=None, **kwargs)
agg(arg, split_every=None)

Apply aggregation function or functions to groups, yielding most likely Series but in some cases DataFrame depending on the output of the aggregation function

Parameters:

func_or_funcs : function or list / dict of functions

List/dict of functions will produce DataFrame with column names determined by the function names themselves (list) or the keys in the dict

Returns:

Series or DataFrame

See also

apply, transform

Notes

Dask doesn’t supports following argument(s).

  • func_or_funcs

Examples

>>> series    
bar    1.0
baz    2.0
qot    3.0
qux    4.0
>>> mapper = lambda x: x[0] # first letter    
>>> grouped = series.groupby(mapper)    
>>> grouped.aggregate(np.sum)    
b    3.0
q    7.0
>>> grouped.aggregate([np.sum, np.mean, np.std])    
   mean  std  sum
b  1.5   0.5  3
q  3.5   0.5  7
>>> grouped.agg({'result' : lambda x: x.mean() / x.std(),    
...              'total' : np.sum})
   result  total
b  2.121   3
q  4.95    7
aggregate(arg, split_every=None)

Apply aggregation function or functions to groups, yielding most likely Series but in some cases DataFrame depending on the output of the aggregation function

Parameters:

func_or_funcs : function or list / dict of functions

List/dict of functions will produce DataFrame with column names determined by the function names themselves (list) or the keys in the dict

Returns:

Series or DataFrame

See also

apply, transform

Notes

Dask doesn’t supports following argument(s).

  • func_or_funcs

Examples

>>> series    
bar    1.0
baz    2.0
qot    3.0
qux    4.0
>>> mapper = lambda x: x[0] # first letter    
>>> grouped = series.groupby(mapper)    
>>> grouped.aggregate(np.sum)    
b    3.0
q    7.0
>>> grouped.aggregate([np.sum, np.mean, np.std])    
   mean  std  sum
b  1.5   0.5  3
q  3.5   0.5  7
>>> grouped.agg({'result' : lambda x: x.mean() / x.std(),    
...              'total' : np.sum})
   result  total
b  2.121   3
q  4.95    7
apply(func, meta='__no_default__', columns='__no_default__')

Parallel version of pandas GroupBy.apply

This mimics the pandas version except for the following:

  1. The user should provide output metadata.
  2. If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Parameters:

func: function

Function to apply

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

columns: list, scalar or None

Deprecated, use meta instead. If list is given, the result is a DataFrame which columns is specified list. Otherwise, the result is a Series which name is given scalar or None (no name). If name keyword is not given, dask tries to infer the result type using its beginning of data. This inference may take some time and lead to unexpected result

Returns:

applied : Series or DataFrame depending on columns keyword

count(split_every=None)

Compute count of group, excluding missing values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

get_group(key)

Constructs NDFrame from group with provided name

Parameters:

name : object

the name of the group to get as a DataFrame

obj : NDFrame, default None

the NDFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used

Returns:

group : type of obj

Notes

Dask doesn’t supports following argument(s).

  • name
  • obj
max(split_every=None)

Compute max of group values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

mean(split_every=None)

Compute mean of groups, excluding missing values

For multiple groupings, the result index will be a MultiIndex

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

min(split_every=None)

Compute min of group values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

size(split_every=None)

Compute group sizes

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

std(ddof=1, split_every=None)

Compute standard deviation of groups, excluding missing values

For multiple groupings, the result index will be a MultiIndex

Parameters:

ddof : integer, default 1

degrees of freedom

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

sum(split_every=None)

Compute sum of group values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

var(ddof=1, split_every=None)

Compute variance of groups, excluding missing values

For multiple groupings, the result index will be a MultiIndex

Parameters:

ddof : integer, default 1

degrees of freedom

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

Other functions

dask.dataframe.compute(*args, **kwargs)

Compute several dask collections at once.

Parameters:

args : object

Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

get : callable, optional

A scheduler get function to use. If not provided, the default is to check the global settings first, and then fall back to defaults for the collections.

optimize_graph : bool, optional

If True [default], the optimizations for each collection are applied before computation. Otherwise the graph is run as is. This can be useful for debugging.

kwargs

Extra keywords to forward to the scheduler get function.

Examples

>>> import dask.array as da
>>> a = da.arange(10, chunks=2).sum()
>>> b = da.arange(10, chunks=2).mean()
>>> compute(a, b)
(45, 4.5)
dask.dataframe.map_partitions(func, *args, **kwargs)

Apply Python function on each DataFrame partition.

Parameters:

func : function

Function applied to each partition.

args, kwargs :

Arguments and keywords to pass to the function. At least one of the args should be a Dask.dataframe.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

dask.dataframe.multi.concat(dfs, axis=0, join='outer', interleave_partitions=False)

Concatenate DataFrames along rows.

  • When axis=0 (default), concatenate DataFrames row-wise:
    • If all divisions are known and ordered, concatenate DataFrames keeping divisions. When divisions are not ordered, specifying interleave_partition=True allows concatenate divisions each by each.
    • If any of division is unknown, concatenate DataFrames resetting its division to unknown (None)
  • When axis=1, concatenate DataFrames column-wise:
    • Allowed if all divisions are known.
    • If any of division is unknown, it raises ValueError.
Parameters:

dfs : list

List of dask.DataFrames to be concatenated

axis : {0, 1, ‘index’, ‘columns’}, default 0

The axis to concatenate along

join : {‘inner’, ‘outer’}, default ‘outer’

How to handle indexes on other axis

interleave_partitions : bool, default False

Whether to concatenate DataFrames ignoring its order. If True, every divisions are concatenated each by each.

Examples

If all divisions are known and ordered, divisions are kept.

>>> a                                               
dd.DataFrame<x, divisions=(1, 3, 5)>
>>> b                                               
dd.DataFrame<y, divisions=(6, 8, 10)>
>>> dd.concat([a, b])                               
dd.DataFrame<concat-..., divisions=(1, 3, 6, 8, 10)>

Unable to concatenate if divisions are not ordered.

>>> a                                               
dd.DataFrame<x, divisions=(1, 3, 5)>
>>> b                                               
dd.DataFrame<y, divisions=(2, 3, 6)>
>>> dd.concat([a, b])                               
ValueError: All inputs have known divisions which cannot be concatenated
in order. Specify interleave_partitions=True to ignore order

Specify interleave_partitions=True to ignore the division order.

>>> dd.concat([a, b], interleave_partitions=True)   
dd.DataFrame<concat-..., divisions=(1, 2, 3, 5, 6)>

If any of division is unknown, the result division will be unknown

>>> a                                               
dd.DataFrame<x, divisions=(None, None)>
>>> b                                               
dd.DataFrame<y, divisions=(1, 4, 10)>
>>> dd.concat([a, b])                               
dd.DataFrame<concat-..., divisions=(None, None, None, None)>
dask.dataframe.multi.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)

Merge DataFrame objects by performing a database-style join operation by columns or indexes.

If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

Parameters:

left : DataFrame

right : DataFrame

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’

  • left: use only keys from left frame (SQL: left outer join)
  • right: use only keys from right frame (SQL: right outer join)
  • outer: use union of keys from both frames (SQL: full outer join)
  • inner: use intersection of keys from both frames (SQL: inner join)

on : label or list

Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.

left_on : label or list, or array-like

Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like

Field names to join on in right DataFrame or vector/list of vectors per left_on docs

left_index : boolean, default False

Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels

right_index : boolean, default False

Use the index from the right DataFrame as the join key. Same caveats as left_index

sort : boolean, default False

Sort the join keys lexicographically in the result DataFrame

suffixes : 2-length sequence (tuple, list, ...)

Suffix to apply to overlapping column names in the left and right side, respectively

copy : boolean, default True

If False, do not copy data unnecessarily

indicator : boolean or string, default False

If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.

New in version 0.17.0.

Returns:

merged : DataFrame

The output type will the be same as ‘left’, if it is a subclass of DataFrame.

See also

merge_ordered, merge_asof

Examples

>>> A              >>> B
    lkey value         rkey value
0   foo  1         0   foo  5
1   bar  2         1   bar  6
2   baz  3         2   qux  7
3   foo  4         3   bar  8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')
   lkey  value_x  rkey  value_y
0  foo   1        foo   5
1  foo   4        foo   5
2  bar   2        bar   6
3  bar   2        bar   8
4  baz   3        NaN   NaN
5  NaN   NaN      qux   7
dask.dataframe.read_csv(urlpath, blocksize=33554432, collection=True, lineterminator=None, compression=None, sample=256000, enforce=False, storage_options=None, **kwargs)

Read CSV files into a Dask.DataFrame

This parallelizes the pandas.read_csv file in the following ways:

  1. It supports loading many files at once using globstrings as follows:

    >>> df = dd.read_csv('myfiles.*.csv')  
    
  2. In some cases it can break up large files as follows:

    >>> df = dd.read_csv('largefile.csv', blocksize=25e6)  # 25MB chunks  
    
  3. You can read CSV files from external resources (e.g. S3, HDFS) providing a URL:

    >>> df = dd.read_csv('s3://bucket/myfiles.*.csv')  
    >>> df = dd.read_csv('hdfs:///myfiles.*.csv')  
    >>> df = dd.read_csv('hdfs://namenode.example.com/myfiles.*.csv')  
    

Internally dd.read_csv uses pandas.read_csv and so supports many of the same keyword arguments with the same performance guarantees.

See the docstring for pandas.read_csv for more information on available keyword arguments.

Note that this function may fail if a CSV file includes quoted strings that contain the line terminator.

Parameters:

urlpath : string

Absolute or relative filepath, URL (may include protocols like s3://), or globstring for CSV files.

blocksize : int or None

Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores. If None, use a single block for each file.

collection : boolean

Return a dask.dataframe if True or list of dask.delayed objects if False

sample : int

Number of bytes to use when determining dtypes

storage_options : dict

Extra options that make sense to a particular storage connection, e.g. host, port, username, password, etc.

**kwargs : dict

Options to pass down to pandas.read_csv

dask.dataframe.read_table(urlpath, blocksize=33554432, collection=True, lineterminator=None, compression=None, sample=256000, enforce=False, storage_options=None, **kwargs)

Read delimited files into a Dask.DataFrame

This parallelizes the pandas.read_table file in the following ways:

  1. It supports loading many files at once using globstrings as follows:

    >>> df = dd.read_table('myfiles.*.csv')  
    
  2. In some cases it can break up large files as follows:

    >>> df = dd.read_table('largefile.csv', blocksize=25e6)  # 25MB chunks  
    
  3. You can read CSV files from external resources (e.g. S3, HDFS) providing a URL:

    >>> df = dd.read_table('s3://bucket/myfiles.*.csv')  
    >>> df = dd.read_table('hdfs:///myfiles.*.csv')  
    >>> df = dd.read_table('hdfs://namenode.example.com/myfiles.*.csv')  
    

Internally dd.read_table uses pandas.read_table and so supports many of the same keyword arguments with the same performance guarantees.

See the docstring for pandas.read_table for more information on available keyword arguments.

Note that this function may fail if a delimited file includes quoted strings that contain the line terminator.

Parameters:

urlpath : string

Absolute or relative filepath, URL (may include protocols like s3://), or globstring for delimited files.

blocksize : int or None

Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores. If None, use a single block for each file.

collection : boolean

Return a dask.dataframe if True or list of dask.delayed objects if False

sample : int

Number of bytes to use when determining dtypes

storage_options : dict

Extra options that make sense to a particular storage connection, e.g. host, port, username, password, etc.

**kwargs : dict

Options to pass down to pandas.read_table

dask.dataframe.read_hdf(path_or_buf, key=None, **kwargs)

read from the store, close it if we opened it

Retrieve pandas object stored in file, optionally based on where criteria

Parameters:

path_or_buf : path (string), buffer, or path object (pathlib.Path or

py._path.local.LocalPath) to read from

New in version 0.19.0: support for pathlib, py.path.

key : group identifier in the store. Can be omitted if the HDF file

contains a single pandas object.

where : list of Term (or convertable) objects, optional

start : optional, integer (defaults to None), row number to start

selection

stop : optional, integer (defaults to None), row number to stop

selection

columns : optional, a list of columns that if not None, will limit the

return columns

iterator : optional, boolean, return an iterator, default False

chunksize : optional, nrows to include in iteration, return an iterator

Returns:

The selected object

dask.dataframe.from_array(x, chunksize=50000, columns=None)

Read dask Dataframe from any slicable array

Uses getitem syntax to pull slices out of the array. The array need not be a NumPy array but must support slicing syntax

x[50000:100000]

and have 2 dimensions:

x.ndim == 2

or have a record dtype:

x.dtype == [(‘name’, ‘O’), (‘balance’, ‘i8’)]
dask.dataframe.from_pandas(data, npartitions=None, chunksize=None, sort=True, name=None)

Construct a dask object from a pandas object.

If given a pandas.Series a dask.Series will be returned. If given a pandas.DataFrame a dask.DataFrame will be returned. All other pandas objects will raise a TypeError.

Parameters:

df : pandas.DataFrame or pandas.Series

The DataFrame/Series with which to construct a dask DataFrame/Series

npartitions : int, optional

The number of partitions of the index to create.

chunksize : int, optional

The size of the partitions of the index.

sort: bool

Sort input first to obtain cleanly divided partitions or don’t sort and don’t get cleanly divided partitions

name: string, optional

An optional keyname for the dataframe. Defaults to hashing the input

Returns:

dask.DataFrame or dask.Series

A dask DataFrame/Series partitioned along the index

Raises:

TypeError

If something other than a pandas.DataFrame or pandas.Series is passed in.

See also

from_array
Construct a dask.DataFrame from an array that has record dtype
from_bcolz
Construct a dask.DataFrame from a bcolz ctable
read_csv
Construct a dask.DataFrame from a CSV file

Examples

>>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))),
...                   index=pd.date_range(start='20100101', periods=6))
>>> ddf = from_pandas(df, npartitions=3)
>>> ddf.divisions  
(Timestamp('2010-01-01 00:00:00', freq='D'),
 Timestamp('2010-01-03 00:00:00', freq='D'),
 Timestamp('2010-01-05 00:00:00', freq='D'),
 Timestamp('2010-01-06 00:00:00', freq='D'))
>>> ddf = from_pandas(df.a, npartitions=3)  # Works with Series too!
>>> ddf.divisions  
(Timestamp('2010-01-01 00:00:00', freq='D'),
 Timestamp('2010-01-03 00:00:00', freq='D'),
 Timestamp('2010-01-05 00:00:00', freq='D'),
 Timestamp('2010-01-06 00:00:00', freq='D'))
dask.dataframe.from_bcolz(x, chunksize=None, categorize=True, index=None, lock=<unlocked _thread.lock object>, **kwargs)

Read dask Dataframe from bcolz.ctable

Parameters:

x : bcolz.ctable

Input data

chunksize : int, optional

The size of blocks to pull out from ctable. Ideally as large as can comfortably fit in memory

categorize : bool, defaults to True

Automatically categorize all string dtypes

index : string, optional

Column to make the index

lock: bool or Lock

Lock to use when reading or False for no lock (not-thread-safe)

See also

from_array
more generic function not optimized for bcolz
dask.dataframe.from_dask_array(x, columns=None)

Convert dask Array to dask DataFrame

Converts a 2d array into a DataFrame and a 1d array into a Series.

Parameters:

x: da.Array

columns: list or string

list of column names if DataFrame, single string if Series

Examples

>>> import dask.array as da
>>> import dask.dataframe as dd
>>> x = da.ones((4, 2), chunks=(2, 2))
>>> df = dd.io.from_dask_array(x, columns=['a', 'b'])
>>> df.compute()
     a    b
0  1.0  1.0
1  1.0  1.0
2  1.0  1.0
3  1.0  1.0
dask.dataframe.from_delayed(dfs, meta=None, divisions=None, prefix='from-delayed', metadata=None)

Create DataFrame from many dask.delayed objects

Parameters:

dfs : list of Delayed

An iterable of dask.delayed.Delayed objects, such as come from dask.delayed These comprise the individual partitions of the resulting dataframe.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

divisions : tuple, str, optional

Partition boundaries along the index. For tuple, see http://dask.pydata.io/en/latest/dataframe-partitions.html For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information

prefix : str, optional

Prefix to prepend to the keys.

dask.dataframe.rolling.rolling_apply(arg, window, func, min_periods=None, freq=None, center=False, args=(), kwargs={})

Generic moving function application.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

func : function

Must produce a single value from an ndarray input

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Whether the label should correspond with center of window

args : tuple

Passed on to func

kwargs : dict

Passed on to func

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

To learn more about the frequency strings, please see this link.

dask.dataframe.rolling.rolling_chunk(func, part1, part2, window, *args)
dask.dataframe.rolling.rolling_count(arg, window, **kwargs)

Rolling count of number of non-NaN observations inside provided window.

Parameters:

arg : DataFrame or numpy ndarray-like

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Whether the label should correspond with center of window

how : string, default ‘mean’

Method for down- or re-sampling

Returns:

rolling_count : type of caller

Notes

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

To learn more about the frequency strings, please see this link.

dask.dataframe.rolling.rolling_kurt(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Unbiased moving kurtosis.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_max(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving maximum.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘’max’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_mean(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving mean.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_median(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving median.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘’median’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_min(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving minimum.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘’min’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_quantile(arg, window, quantile, min_periods=None, freq=None, center=False)

Moving quantile.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

quantile : float

0 <= quantile <= 1

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Whether the label should correspond with center of window

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

To learn more about the frequency strings, please see this link.

dask.dataframe.rolling.rolling_skew(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Unbiased moving skewness.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_std(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving standard deviation.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

ddof : int, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_sum(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving sum.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_var(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving variance.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

ddof : int, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_window(arg, window=None, win_type=None, min_periods=None, freq=None, center=False, mean=True, axis=0, how=None, **kwargs)

Applies a moving window of type window_type and size window on the data.

Parameters:

arg : Series, DataFrame

window : int or ndarray

Weighting window specification. If the window is an integer, then it is treated as the window length and win_type is required

win_type : str, default None

Window type (see Notes)

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Whether the label should correspond with center of window

mean : boolean, default True

If True computes weighted mean, else weighted sum

axis : {0, 1}, default 0

how : string, default ‘mean’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

The recognized window types are:

  • boxcar
  • triang
  • blackman
  • hamming
  • bartlett
  • parzen
  • bohman
  • blackmanharris
  • nuttall
  • barthann
  • kaiser (needs beta)
  • gaussian (needs std)
  • general_gaussian (needs power, width)
  • slepian (needs width).

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

To learn more about the frequency strings, please see this link.