API¶
Top level user functions:
DataFrame (dsk, name, meta, divisions) |
Implements out-of-core DataFrame as a sequence of pandas DataFrames |
DataFrame.add (other[, axis, level, fill_value]) |
Addition of dataframe and other, element-wise (binary operator add). |
DataFrame.append (other) |
Append rows of other to the end of this frame, returning a new object. |
DataFrame.apply (func[, axis, args, meta, ...]) |
Parallel version of pandas.DataFrame.apply |
DataFrame.assign (\*\*kwargs) |
Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones. |
DataFrame.astype (dtype) |
Cast object to input numpy.dtype |
DataFrame.cache ([cache]) |
Evaluate Dataframe and store in local cache |
DataFrame.categorize ([columns]) |
Convert columns of the DataFrame to category dtype |
DataFrame.columns |
|
DataFrame.compute (\*\*kwargs) |
Compute several dask collections at once. |
DataFrame.corr ([method, min_periods]) |
Compute pairwise correlation of columns, excluding NA/null values |
DataFrame.count ([axis, split_every]) |
Return Series with number of non-NA/null observations over requested axis. |
DataFrame.cov ([min_periods]) |
Compute pairwise covariance of columns, excluding NA/null values |
DataFrame.cummax ([axis, skipna]) |
Return cumulative max over requested axis. |
DataFrame.cummin ([axis, skipna]) |
Return cumulative minimum over requested axis. |
DataFrame.cumprod ([axis, skipna]) |
Return cumulative product over requested axis. |
DataFrame.cumsum ([axis, skipna]) |
Return cumulative sum over requested axis. |
DataFrame.describe ([split_every]) |
Generate various summary statistics, excluding NaN values. |
DataFrame.div (other[, axis, level, fill_value]) |
Floating division of dataframe and other, element-wise (binary operator truediv). |
DataFrame.drop (labels[, axis, dtype]) |
Return new object with labels in requested axis removed. |
DataFrame.drop_duplicates (\*\*kwargs) |
Return DataFrame with duplicate rows removed, optionally only |
DataFrame.dropna ([how, subset]) |
Return object with labels on given axis omitted where alternately any |
DataFrame.dtypes |
Return data types |
DataFrame.fillna (value) |
Fill NA/NaN values using the specified method |
DataFrame.floordiv (other[, axis, level, ...]) |
Integer division of dataframe and other, element-wise (binary operator floordiv). |
DataFrame.get_partition (n) |
Get a dask DataFrame/Series representing the nth partition. |
DataFrame.groupby (key, \*\*kwargs) |
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns. |
DataFrame.head ([n, npartitions, compute]) |
First n rows of the dataset |
DataFrame.index |
Return dask Index instance |
DataFrame.iterrows () |
Iterate over DataFrame rows as (index, Series) pairs. |
DataFrame.itertuples () |
Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple. |
DataFrame.join (other[, on, how, lsuffix, ...]) |
Join columns with other DataFrame either on index or on a key column. |
DataFrame.known_divisions |
Whether divisions are already known |
DataFrame.loc |
Purely label-location based indexer for selection by label. |
DataFrame.map_partitions (func, \*args, \*\*kwargs) |
Apply Python function on each DataFrame partition. |
DataFrame.mask (cond[, other]) |
Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other. |
DataFrame.max ([axis, skipna, split_every]) |
This method returns the maximum of the values in the object. |
DataFrame.mean ([axis, skipna, split_every]) |
Return the mean of the values for the requested axis |
DataFrame.merge (right[, how, on, left_on, ...]) |
Merge DataFrame objects by performing a database-style join operation by columns or indexes. |
DataFrame.min ([axis, skipna, split_every]) |
This method returns the minimum of the values in the object. |
DataFrame.mod (other[, axis, level, fill_value]) |
Modulo of dataframe and other, element-wise (binary operator mod). |
DataFrame.mul (other[, axis, level, fill_value]) |
Multiplication of dataframe and other, element-wise (binary operator mul). |
DataFrame.ndim |
Return dimensionality |
DataFrame.nlargest ([n, columns, split_every]) |
Get the rows of a DataFrame sorted by the n largest values of columns. |
DataFrame.npartitions |
Return number of partitions |
DataFrame.pow (other[, axis, level, fill_value]) |
Exponential power of dataframe and other, element-wise (binary operator pow). |
DataFrame.quantile ([q, axis]) |
Approximate row-wise and precise column-wise quantiles of DataFrame |
DataFrame.query (expr, \*\*kwargs) |
|
DataFrame.radd (other[, axis, level, fill_value]) |
Addition of dataframe and other, element-wise (binary operator radd). |
DataFrame.random_split (frac[, random_state]) |
Pseudorandomly split dataframe into different pieces row-wise |
DataFrame.rdiv (other[, axis, level, fill_value]) |
Floating division of dataframe and other, element-wise (binary operator rtruediv). |
DataFrame.rename ([index, columns]) |
Alter axes input function or functions. |
DataFrame.repartition ([divisions, ...]) |
Repartition dataframe along new divisions |
DataFrame.reset_index ([drop]) |
For DataFrame with multi-level index, return new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. |
DataFrame.rfloordiv (other[, axis, level, ...]) |
Integer division of dataframe and other, element-wise (binary operator rfloordiv). |
DataFrame.rmod (other[, axis, level, fill_value]) |
Modulo of dataframe and other, element-wise (binary operator rmod). |
DataFrame.rmul (other[, axis, level, fill_value]) |
Multiplication of dataframe and other, element-wise (binary operator rmul). |
DataFrame.rpow (other[, axis, level, fill_value]) |
Exponential power of dataframe and other, element-wise (binary operator rpow). |
DataFrame.rsub (other[, axis, level, fill_value]) |
Subtraction of dataframe and other, element-wise (binary operator rsub). |
DataFrame.rtruediv (other[, axis, level, ...]) |
Floating division of dataframe and other, element-wise (binary operator rtruediv). |
DataFrame.sample (frac[, replace, random_state]) |
Random sample of items |
DataFrame.set_index (other[, drop, sorted]) |
Set the DataFrame index (row labels) using an existing column |
DataFrame.set_partition (column, divisions, ...) |
Set explicit divisions for new column index |
DataFrame.std ([axis, skipna, ddof, split_every]) |
Return sample standard deviation over requested axis. |
DataFrame.sub (other[, axis, level, fill_value]) |
Subtraction of dataframe and other, element-wise (binary operator sub). |
DataFrame.sum ([axis, skipna, split_every]) |
Return the sum of the values for the requested axis |
DataFrame.tail ([n, compute]) |
Last n rows of the dataset |
DataFrame.to_bag ([index]) |
Convert to a dask Bag of tuples of each row. |
DataFrame.to_csv (filename, \*\*kwargs) |
Write DataFrame to a series of comma-separated values (csv) files |
DataFrame.to_hdf (path_or_buf, key[, mode, ...]) |
Export frame to hdf file(s) |
DataFrame.to_delayed () |
Convert dataframe into dask Delayed objects |
DataFrame.truediv (other[, axis, level, ...]) |
Floating division of dataframe and other, element-wise (binary operator truediv). |
DataFrame.var ([axis, skipna, ddof, split_every]) |
Return unbiased variance over requested axis. |
DataFrame.visualize ([filename, format, ...]) |
Render the computation of this object’s task graph using graphviz. |
DataFrame.where (cond[, other]) |
Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other. |
Rolling Operations¶
rolling.rolling_apply (arg, window, func[, ...]) |
Generic moving function application. |
rolling.rolling_count (arg, window, \*\*kwargs) |
Rolling count of number of non-NaN observations inside provided window. |
rolling.rolling_kurt (arg, window[, ...]) |
Unbiased moving kurtosis. |
rolling.rolling_max (arg, window[, ...]) |
Moving maximum. |
rolling.rolling_mean (arg, window[, ...]) |
Moving mean. |
rolling.rolling_median (arg, window[, ...]) |
Moving median. |
rolling.rolling_min (arg, window[, ...]) |
Moving minimum. |
rolling.rolling_quantile (arg, window, quantile) |
Moving quantile. |
rolling.rolling_skew (arg, window[, ...]) |
Unbiased moving skewness. |
rolling.rolling_std (arg, window[, ...]) |
Moving standard deviation. |
rolling.rolling_sum (arg, window[, ...]) |
Moving sum. |
rolling.rolling_var (arg, window[, ...]) |
Moving variance. |
rolling.rolling_window (arg[, window, ...]) |
Applies a moving window of type window_type and size window on the data. |
Create DataFrames¶
read_csv (urlpath[, blocksize, collection, ...]) |
Read CSV files into a Dask.DataFrame |
read_table (urlpath[, blocksize, collection, ...]) |
Read delimited files into a Dask.DataFrame |
read_hdf (path_or_buf[, key]) |
read from the store, close it if we opened it |
from_array (x[, chunksize, columns]) |
Read dask Dataframe from any slicable array |
from_bcolz (x[, chunksize, categorize, ...]) |
Read dask Dataframe from bcolz.ctable |
from_dask_array (x[, columns]) |
Convert dask Array to dask DataFrame |
from_delayed (dfs[, meta, divisions, prefix, ...]) |
Create DataFrame from many dask.delayed objects |
from_pandas (data[, npartitions, chunksize, ...]) |
Construct a dask object from a pandas object. |
DataFrame Methods¶
-
class
dask.dataframe.
DataFrame
(dsk, name, meta, divisions)¶ Implements out-of-core DataFrame as a sequence of pandas DataFrames
Parameters: dask: dict
The dask graph to compute this DataFrame
name: str
The key prefix that specifies which keys in the dask comprise this particular DataFrame
meta: pandas.DataFrame
An empty
pandas.DataFrame
with names, dtypes, and index matching the expected output.divisions: tuple of index values
Values along which we partition our blocks on the index
-
add
(other, axis='columns', level=None, fill_value=None)¶ Addition of dataframe and other, element-wise (binary operator add).
Equivalent to
dataframe + other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
align
(other, join='outer', axis=None, fill_value=None)¶ Align two object on their axes with the specified join method for each axis Index
Parameters: other : DataFrame or Series
join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’
axis : allowed axis of the other object, default None
Align on index (0), columns (1), or both (None)
level : int or level name, default None
Broadcast across a level, matching Index values on the passed MultiIndex level
copy : boolean, default True
Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
fill_value : scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any “compatible” value
method : str, default None
limit : int, default None
fill_axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Filling axis, method and limit
broadcast_axis : {0 or ‘index’, 1 or ‘columns’}, default None
Broadcast values along this axis, if aligning two objects of different dimensions
New in version 0.17.0.
Returns: (left, right) : (DataFrame, type of other)
Aligned objects
Notes
Dask doesn’t supports following argument(s).
- level
- copy
- method
- limit
- fill_axis
- broadcast_axis
-
all
(axis=None, skipna=True, split_every=False)¶ Return whether all elements are True over requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
bool_only : boolean, default None
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
Returns: all : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- bool_only
- level
-
any
(axis=None, skipna=True, split_every=False)¶ Return whether any element is True over requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
bool_only : boolean, default None
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
Returns: any : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- bool_only
- level
-
append
(other)¶ Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
Parameters: other : DataFrame or Series/dict-like object, or list of these
The data to append.
ignore_index : boolean, default False
If True, do not use the index labels.
verify_integrity : boolean, default False
If True, raise ValueError on creating index with duplicates.
Returns: appended : DataFrame
See also
pandas.concat
- General function to concatenate DataFrame, Series or Panel objects
Notes
Dask doesn’t supports following argument(s).
- ignore_index
- verify_integrity
Examples
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) >>> df A B 0 1 2 1 3 4 >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) >>> df.append(df2) A B 0 1 2 1 3 4 0 5 6 1 7 8
With ignore_index set to True:
>>> df.append(df2, ignore_index=True) A B 0 1 2 1 3 4 2 5 6 3 7 8
-
apply
(func, axis=0, args=(), meta='__no_default__', columns='__no_default__', **kwds)¶ Parallel version of pandas.DataFrame.apply
This mimics the pandas version except for the following:
- Only
axis=1
is supported (and must be specified explicitly). - The user should provide output metadata via the meta keyword.
Parameters: func : function
Function to apply to each column/row
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- 0 or ‘index’: apply function to each column (NOT SUPPORTED)
- 1 or ‘columns’: apply function to each row
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.columns : list, scalar or None
Deprecated, please use meta instead. If list is given, the result is a DataFrame which columns is specified list. Otherwise, the result is a Series which name is given scalar or None (no name). If name keyword is not given, dask tries to infer the result type using its beginning of data. This inference may take some time and lead to unexpected result
args : tuple
Positional arguments to pass to function in addition to the array/series
Additional keyword arguments will be passed as keywords to the function
Returns: applied : Series or DataFrame
See also
dask.DataFrame.map_partitions
Examples
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
Apply a function to row-wise passing in extra arguments in
args
andkwargs
:>>> def myadd(row, a, b=1): ... return row.sum() + a + b >>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5)
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with name
'x'
, and dtypefloat64
:>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5, meta=('x', 'f8'))
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ddf.apply(lambda row: row + 1, axis=1, meta=ddf)
- Only
-
applymap
(func, meta='__no_default__')¶ Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame
Parameters: func : function
Python function, returns a single value from a single value
Returns: applied : DataFrame
See also
DataFrame.apply
- For operations on rows/columns
Examples
>>> df = pd.DataFrame(np.random.randn(3, 3)) >>> df 0 1 2 0 -0.029638 1.081563 1.280300 1 0.647747 0.831136 -1.549481 2 0.513416 -0.884417 0.195343 >>> df = df.applymap(lambda x: '%.2f' % x) >>> df 0 1 2 0 -0.03 1.08 1.28 1 0.65 0.83 -1.55 2 0.51 -0.88 0.20
-
assign
(**kwargs)¶ Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.
New in version 0.16.0.
Parameters: kwargs : keyword, value pairs
keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
Returns: df : DataFrame
A new DataFrame with the new columns in addition to all the existing columns.
Notes
Since
kwargs
is a dictionary, the order of your arguments may not be preserved. The make things predicatable, the columns are inserted in alphabetical order, at the end of your DataFrame. Assigning multiple columns within the sameassign
is possible, but you cannot reference other columns created within the sameassign
call.Examples
>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
Where the value is a callable, evaluated on df:
>>> df.assign(ln_A = lambda x: np.log(x.A)) A B ln_A 0 1 0.426905 0.000000 1 2 -0.780949 0.693147 2 3 -0.418711 1.098612 3 4 -0.269708 1.386294 4 5 -0.274002 1.609438 5 6 -0.500792 1.791759 6 7 1.649697 1.945910 7 8 -1.495604 2.079442 8 9 0.549296 2.197225 9 10 -0.758542 2.302585
Where the value already exists and is inserted:
>>> newcol = np.log(df['A']) >>> df.assign(ln_A=newcol) A B ln_A 0 1 0.426905 0.000000 1 2 -0.780949 0.693147 2 3 -0.418711 1.098612 3 4 -0.269708 1.386294 4 5 -0.274002 1.609438 5 6 -0.500792 1.791759 6 7 1.649697 1.945910 7 8 -1.495604 2.079442 8 9 0.549296 2.197225 9 10 -0.758542 2.302585
-
astype
(dtype)¶ Cast object to input numpy.dtype Return a copy when copy = True (be really careful with this!)
Parameters: dtype : data type, or dict of column name -> data type
Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, ...}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
raise_on_error : raise on invalid input
kwargs : keyword arguments to pass on to the constructor
Returns: casted : type of caller
Notes
Dask doesn’t supports following argument(s).
- copy
- raise_on_error
-
cache
(cache=<class 'dict'>)¶ Evaluate Dataframe and store in local cache
Uses chest by default to store data on disk
-
categorize
(columns=None, **kwargs)¶ Convert columns of the DataFrame to category dtype
Parameters: columns : list, optional
A list of column names to convert to the category type. By default any column with an object dtype is converted to a categorical.
kwargs
Keyword arguments are passed on to compute.
See also
dask.dataframes.categorical.categorize
Notes
When dealing with columns of repeated text values converting to categorical type is often much more performant, both in terms of memory and in writing to disk or communication over the network.
-
clip
(lower=None, upper=None, out=None)¶ Trim values at input threshold(s).
Parameters: lower : float or array_like, default None
upper : float or array_like, default None
axis : int or string axis name, optional
Align object with lower and upper along the given axis.
Returns: clipped : Series
Notes
Dask doesn’t supports following argument(s).
- axis
Examples
>>> df 0 1 0 0.335232 -1.256177 1 -1.367855 0.746646 2 0.027753 -1.176076 3 0.230930 -0.679613 4 1.261967 0.570967 >>> df.clip(-1.0, 0.5) 0 1 0 0.335232 -1.000000 1 -1.000000 0.500000 2 0.027753 -1.000000 3 0.230930 -0.679613 4 0.500000 0.500000 >>> t 0 -0.3 1 -0.2 2 -0.1 3 0.0 4 0.1 dtype: float64 >>> df.clip(t, t + 1, axis=0) 0 1 0 0.335232 -0.300000 1 -0.200000 0.746646 2 0.027753 -0.100000 3 0.230930 0.000000 4 1.100000 0.570967
-
clip_lower
(threshold)¶ Return copy of the input with values below given value(s) truncated.
Parameters: threshold : float or array_like
axis : int or string axis name, optional
Align object with threshold along the given axis.
Returns: clipped : same type as input
See also
Notes
Dask doesn’t supports following argument(s).
- axis
-
clip_upper
(threshold)¶ Return copy of input with values above given value(s) truncated.
Parameters: threshold : float or array_like
axis : int or string axis name, optional
Align object with threshold along the given axis.
Returns: clipped : same type as input
See also
Notes
Dask doesn’t supports following argument(s).
- axis
-
combine_first
(other)¶ Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns
Parameters: other : DataFrame Returns: combined : DataFrame Examples
a’s values prioritized, use values from b to fill holes:
>>> a.combine_first(b)
-
compute
(**kwargs)¶ Compute several dask collections at once.
Parameters: get : callable, optional
A scheduler
get
function to use. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.optimize_graph : bool, optional
If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
kwargs
Extra keywords to forward to the scheduler
get
function.
-
corr
(method='pearson', min_periods=None)¶ Compute pairwise correlation of columns, excluding NA/null values
Parameters: method : {‘pearson’, ‘kendall’, ‘spearman’}
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
min_periods : int, optional
Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation
Returns: y : DataFrame
-
count
(axis=None, split_every=False)¶ Return Series with number of non-NA/null observations over requested axis. Works with non-floating point data as well (detects NaN and None)
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame
numeric_only : boolean, default False
Include only float, int, boolean data
Returns: count : Series (or DataFrame if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
cov
(min_periods=None)¶ Compute pairwise covariance of columns, excluding NA/null values
Parameters: min_periods : int, optional
Minimum number of observations required per pair of columns to have a valid result.
Returns: y : DataFrame
Notes
y contains the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1 (unbiased estimator).
-
cummax
(axis=None, skipna=True)¶ Return cumulative max over requested axis.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cummax : Series
-
cummin
(axis=None, skipna=True)¶ Return cumulative minimum over requested axis.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cummin : Series
-
cumprod
(axis=None, skipna=True)¶ Return cumulative product over requested axis.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cumprod : Series
-
cumsum
(axis=None, skipna=True)¶ Return cumulative sum over requested axis.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cumsum : Series
-
describe
(split_every=False)¶ Generate various summary statistics, excluding NaN values.
Parameters: percentiles : array-like, optional
The percentiles to include in the output. Should all be in the interval [0, 1]. By default percentiles is [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
include, exclude : list-like, ‘all’, or None (default)
Specify the form of the returned result. Either:
- None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.
- A list of dtypes or strings to be included/excluded. To select all numeric types use numpy numpy.number. To select categorical objects use type object. See also the select_dtypes documentation. eg. df.describe(include=[‘O’])
- If include is the string ‘all’, the output column-set will match the input one.
Returns: summary: NDFrame of summary statistics
See also
Notes
Dask doesn’t supports following argument(s).
- percentiles
- include
- exclude
-
div
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
drop
(labels, axis=0, dtype=None)¶ Return new object with labels in requested axis removed.
Parameters: labels : single label or list-like
axis : int or axis name
level : int or level name, default None
For MultiIndex
inplace : bool, default False
If True, do operation inplace and return None.
errors : {‘ignore’, ‘raise’}, default ‘raise’
If ‘ignore’, suppress error and existing labels are dropped.
New in version 0.16.1.
Returns: dropped : type of caller
Notes
Dask doesn’t supports following argument(s).
- level
- inplace
- errors
-
drop_duplicates
(**kwargs)¶ Return DataFrame with duplicate rows removed, optionally only considering certain columns
Parameters: subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns
keep : {‘first’, ‘last’, False}, default ‘first’
first
: Drop duplicates except for the first occurrence.last
: Drop duplicates except for the last occurrence.- False : Drop all duplicates.
take_last : deprecated
inplace : boolean, default False
Whether to drop duplicates in place or to return a copy
Returns: deduplicated : DataFrame
-
dropna
(how='any', subset=None)¶ Return object with labels on given axis omitted where alternately any or all of the data are missing
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, or tuple/list thereof
Pass tuple or list to drop on multiple axes
how : {‘any’, ‘all’}
- any : if any NA values are present, drop that label
- all : if all values are NA, drop that label
thresh : int, default None
int value : require that many non-NA values
subset : array-like
Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include
inplace : boolean, default False
If True, do operation inplace and return None.
Returns: dropped : DataFrame
Notes
Dask doesn’t supports following argument(s).
- axis
- thresh
- inplace
-
dtypes
¶ Return data types
-
eq
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods eq
-
eval
(expr, inplace=None, **kwargs)¶ Evaluate an expression in the context of the calling DataFrame instance.
Parameters: expr : string
The expression string to evaluate.
inplace : bool
If the expression contains an assignment, whether to return a new DataFrame or mutate the existing.
WARNING: inplace=None currently falls back to to True, but in a future version, will default to False. Use inplace=True explicitly rather than relying on the default.
New in version 0.18.0.
kwargs : dict
See the documentation for
eval()
for complete details on the keyword arguments accepted byquery()
.Returns: ret : ndarray, scalar, or pandas object
See also
pandas.DataFrame.query
,pandas.DataFrame.assign
,pandas.eval
Notes
For more details see the API documentation for
eval()
. For detailed examples see enhancing performance with eval.Examples
>>> from numpy.random import randn >>> from pandas import DataFrame >>> df = DataFrame(randn(10, 2), columns=list('ab')) >>> df.eval('a + b') >>> df.eval('c = a + b')
-
fillna
(value)¶ Fill NA/NaN values using the specified method
Parameters: value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
axis : {0, ‘index’}
inplace : boolean, default False
If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).
limit : int, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.
downcast : dict, default is None
a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
Returns: filled : Series
See also
reindex
,asfreq
Notes
Dask doesn’t supports following argument(s).
- method
- axis
- inplace
- limit
- downcast
-
floordiv
(other, axis='columns', level=None, fill_value=None)¶ Integer division of dataframe and other, element-wise (binary operator floordiv).
Equivalent to
dataframe // other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
ge
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods ge
-
get_dtype_counts
()¶ Return the counts of dtypes in this object.
-
get_ftype_counts
()¶ Return the counts of ftypes in this object.
-
get_partition
(n)¶ Get a dask DataFrame/Series representing the nth partition.
-
groupby
(key, **kwargs)¶ Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
Parameters: by : mapping function / list of functions, dict, Series, or tuple /
list of column names. Called on each element of the object index to determine the groups. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups
axis : int, default 0
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
sort : boolean, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
group_keys : boolean, default True
When calling apply, add group keys to index to identify pieces
squeeze : boolean, default False
reduce the dimensionality of the return type if possible, otherwise return a consistent type
Returns: GroupBy object
Notes
Dask doesn’t supports following argument(s).
- by
- axis
- level
- as_index
- sort
- group_keys
- squeeze
Examples
DataFrame results
>>> data.groupby(func, axis=0).mean() >>> data.groupby(['col1', 'col2'])['col3'].mean()
DataFrame with hierarchical index
>>> data.groupby(['col1', 'col2']).mean()
-
gt
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods gt
-
head
(n=5, npartitions=1, compute=True)¶ First n rows of the dataset
Parameters: n : int, optional
The number of rows to return. Default is 5.
npartitions : int, optional
Elements are only taken from the first
npartitions
, with a default of 1. If there are fewer thann
rows in the firstnpartitions
a warning will be raised and any found rows returned. Pass -1 to use all partitions.compute : bool, optional
Whether to compute the result, default is True.
-
idxmax
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be first index.
Returns: idxmax : Series
See also
Notes
This method is the DataFrame version of
ndarray.argmax
.
-
idxmin
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of minimum over requested axis. NA/null values are excluded.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: idxmin : Series
See also
Notes
This method is the DataFrame version of
ndarray.argmin
.
-
index
¶ Return dask Index instance
-
info
(buf=None, verbose=False, memory_usage=False)¶ Concise summary of a Dask DataFrame.
-
isnull
()¶ Return a boolean same-sized object indicating if the values are null.
See also
notnull
- boolean inverse of isnull
-
iterrows
()¶ Iterate over DataFrame rows as (index, Series) pairs.
Returns: it : generator
A generator that iterates over the rows of the frame.
See also
itertuples
- Iterate over DataFrame rows as namedtuples of the values.
iteritems
- Iterate over (column name, Series) pairs.
Notes
Because
iterrows
returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float']) >>> row = next(df.iterrows())[1] >>> row int 1.0 float 1.5 Name: 0, dtype: float64 >>> print(row['int'].dtype) float64 >>> print(df['int'].dtype) int64
To preserve dtypes while iterating over the rows, it is better to use
itertuples()
which returns namedtuples of the values and which is generally faster thaniterrows
.You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
-
itertuples
()¶ Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple.
Parameters: index : boolean, default True
If True, return the index as the first element of the tuple.
name : string, default “Pandas”
The name of the returned namedtuples or None to return regular tuples.
See also
iterrows
- Iterate over DataFrame rows as (index, Series) pairs.
iteritems
- Iterate over (column name, Series) pairs.
Notes
Dask doesn’t supports following argument(s).
- index
- name
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b']) >>> df col1 col2 a 1 0.1 b 2 0.2 >>> for row in df.itertuples(): ... print(row) ... Pandas(Index='a', col1=1, col2=0.10000000000000001) Pandas(Index='b', col1=2, col2=0.20000000000000001)
-
join
(other, on=None, how='left', lsuffix='', rsuffix='', npartitions=None, shuffle=None)¶ Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.
Parameters: other : DataFrame, Series with name field set, or list of DataFrame
Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame
on : column name, tuple/list of column names, or array-like
Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
How to handle the operation of the two objects.
- left: use calling frame’s index (or column if on is specified)
- right: use other frame’s index
- outer: form union of calling frame’s index (or column if on is
- specified) with other frame’s index
- inner: form intersection of calling frame’s index (or column if
- on is specified) with other frame’s index
lsuffix : string
Suffix to use from left frame’s overlapping columns
rsuffix : string
Suffix to use from right frame’s overlapping columns
sort : boolean, default False
Order result DataFrame lexicographically by the join key. If False, preserves the index order of the calling (left) DataFrame
Returns: joined : DataFrame
See also
DataFrame.merge
- For column(s)-on-columns(s) operations
Notes
Dask doesn’t supports following argument(s).
- sort
Examples
>>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], ... 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> caller A key 0 A0 K0 1 A1 K1 2 A2 K2 3 A3 K3 4 A4 K4 5 A5 K5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'], ... 'B': ['B0', 'B1', 'B2']})
>>> other B key 0 B0 K0 1 B1 K1 2 B2 K2
Join DataFrames using their indexes.
>>> caller.join(other, lsuffix='_caller', rsuffix='_other')
>>> A key_caller B key_other 0 A0 K0 B0 K0 1 A1 K1 B1 K1 2 A2 K2 B2 K2 3 A3 K3 NaN NaN 4 A4 K4 NaN NaN 5 A5 K5 NaN NaN
If we want to join using the key columns, we need to set key to be the index in both caller and other. The joined DataFrame will have key as its index.
>>> caller.set_index('key').join(other.set_index('key'))
>>> A B key K0 A0 B0 K1 A1 B1 K2 A2 B2 K3 A3 NaN K4 A4 NaN K5 A5 NaN
Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in the caller. This method preserves the original caller’s index in the result.
>>> caller.join(other.set_index('key'), on='key')
>>> A key B 0 A0 K0 B0 1 A1 K1 B1 2 A2 K2 B2 3 A3 K3 NaN 4 A4 K4 NaN 5 A5 K5 NaN
-
known_divisions
¶ Whether divisions are already known
-
le
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods le
-
loc
¶ Purely label-location based indexer for selection by label.
>>> df.loc["b"] >>> df.loc["b":"d"]
-
lt
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods lt
-
map_partitions
(func, *args, **kwargs)¶ Apply Python function on each DataFrame partition.
Parameters: func : function
Function applied to each partition.
args, kwargs :
Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Examples
Given a DataFrame, Series, or Index, such as:
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
One can use
map_partitions
to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:
>>> def myadd(df, a, b=1): ... return df.x + df.y + a + b >>> res = ddf.map_partitions(myadd, 1, b=2) >>> res.dtype dtype('float64')
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with no name, and dtype
float64
:>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))
Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:
>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y)) >>> res.dtypes x int64 y float64 z float64 dtype: object
As before, the output metadata can also be specified manually. This time we pass in a
dict
, as the output is a DataFrame:>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y), ... meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ddf.map_partitions(lambda df: df.head(), meta=df)
-
mask
(cond, other=nan)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.
Parameters: cond : boolean NDFrame, array or callable
If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as cond.
other : scalar, NDFrame, or callable
If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as other.
inplace : boolean, default False
Whether to perform the operation in place on the data
axis : alignment axis if needed, default None
level : alignment level if needed, default None
try_cast : boolean, default False
try to cast the result back to the input type (if possible),
raise_on_error : boolean, default True
Whether to raise on invalid data types (e.g. trying to where on strings)
Returns: wh : same type as caller
See also
Notes
Dask doesn’t supports following argument(s).
- inplace
- axis
- level
- try_cast
- raise_on_error
Examples
>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
-
max
(axis=None, skipna=True, split_every=False)¶ - This method returns the maximum of the values in the object.
- If you want the index of the maximum, use
idxmax
. This is the equivalent of thenumpy.ndarray
methodargmax
.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: max : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
mean
(axis=None, skipna=True, split_every=False)¶ Return the mean of the values for the requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: mean : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
merge
(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), indicator=False, npartitions=None, shuffle=None)¶ Merge DataFrame objects by performing a database-style join operation by columns or indexes.
If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
Parameters: right : DataFrame
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
- left: use only keys from left frame (SQL: left outer join)
- right: use only keys from right frame (SQL: right outer join)
- outer: use union of keys from both frames (SQL: full outer join)
- inner: use intersection of keys from both frames (SQL: inner join)
on : label or list
Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.
left_on : label or list, or array-like
Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns
right_on : label or list, or array-like
Field names to join on in right DataFrame or vector/list of vectors per left_on docs
left_index : boolean, default False
Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels
right_index : boolean, default False
Use the index from the right DataFrame as the join key. Same caveats as left_index
sort : boolean, default False
Sort the join keys lexicographically in the result DataFrame
suffixes : 2-length sequence (tuple, list, ...)
Suffix to apply to overlapping column names in the left and right side, respectively
copy : boolean, default True
If False, do not copy data unnecessarily
indicator : boolean or string, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.
New in version 0.17.0.
Returns: merged : DataFrame
The output type will the be same as ‘left’, if it is a subclass of DataFrame.
See also
merge_ordered
,merge_asof
Notes
Dask doesn’t supports following argument(s).
- sort
- copy
Examples
>>> A >>> B lkey value rkey value 0 foo 1 0 foo 5 1 bar 2 1 bar 6 2 baz 3 2 qux 7 3 foo 4 3 bar 8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 4 foo 5 2 bar 2 bar 6 3 bar 2 bar 8 4 baz 3 NaN NaN 5 NaN NaN qux 7
-
min
(axis=None, skipna=True, split_every=False)¶ - This method returns the minimum of the values in the object.
- If you want the index of the minimum, use
idxmin
. This is the equivalent of thenumpy.ndarray
methodargmin
.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: min : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
mod
(other, axis='columns', level=None, fill_value=None)¶ Modulo of dataframe and other, element-wise (binary operator mod).
Equivalent to
dataframe % other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
mul
(other, axis='columns', level=None, fill_value=None)¶ Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
ndim
¶ Return dimensionality
-
ne
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods ne
-
nlargest
(n=5, columns=None, split_every=None)¶ Get the rows of a DataFrame sorted by the n largest values of columns.
New in version 0.17.0.
Parameters: n : int
Number of items to retrieve
columns : list or str
Column name or names to order by
keep : {‘first’, ‘last’, False}, default ‘first’
Where there are duplicate values: -
first
: take the first occurrence. -last
: take the last occurrence.Returns: DataFrame
Notes
Dask doesn’t supports following argument(s).
- keep
Examples
>>> df = DataFrame({'a': [1, 10, 8, 11, -1], ... 'b': list('abdce'), ... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]}) >>> df.nlargest(3, 'a') a b c 3 11 c 3 1 10 b 2 2 8 d NaN
-
notnull
()¶ Return a boolean same-sized object indicating if the values are not null.
See also
isnull
- boolean inverse of notnull
-
npartitions
¶ Return number of partitions
-
pipe
(func, *args, **kwargs)¶ Apply func(self, *args, **kwargs)
New in version 0.16.2.
Parameters: func : function
function to apply to the NDFrame.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the NDFrame.args : positional arguments passed into
func
.kwargs : a dictionary of keyword arguments passed into
func
.Returns: object : the return type of
func
.See also
pandas.DataFrame.apply
,pandas.DataFrame.applymap
,pandas.Series.map
Notes
Use
.pipe
when chaining together functions that expect on Series or DataFrames. Instead of writing>>> f(g(h(df), arg1=a), arg2=b, arg3=c)
You can write
>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe(f, arg2=b, arg3=c) ... )
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asarg2
:>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe((f, 'arg2'), arg1=a, arg3=c) ... )
-
pivot_table
(index=None, columns=None, values=None, aggfunc='mean')¶ Create a spreadsheet-style pivot table as a DataFrame. Target
columns
must have category dtype to infer result’scolumns
.index
,columns
,values
andaggfunc
must be all scalar.Parameters: values : scalar
column to aggregate
index : scalar
column to be index
columns : scalar
column to be columns
aggfunc : {‘mean’, ‘sum’, ‘count’}, default ‘mean’
Returns: table : DataFrame
-
pow
(other, axis='columns', level=None, fill_value=None)¶ Exponential power of dataframe and other, element-wise (binary operator pow).
Equivalent to
dataframe ** other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
quantile
(q=0.5, axis=0)¶ Approximate row-wise and precise column-wise quantiles of DataFrame
Parameters: q : list/array of floats, default 0.5 (50%)
Iterable of numbers ranging from 0 to 1 for the desired quantiles
axis : {0, 1, ‘index’, ‘columns’} (default 0)
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
-
radd
(other, axis='columns', level=None, fill_value=None)¶ Addition of dataframe and other, element-wise (binary operator radd).
Equivalent to
other + dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
random_split
(frac, random_state=None)¶ Pseudorandomly split dataframe into different pieces row-wise
Parameters: frac : list
List of floats that should sum to one.
random_state: int or np.random.RandomState
If int create a new RandomState with this as the seed
Otherwise draw from the passed RandomState
See also
dask.DataFrame.sample
Examples
50/50 split
>>> a, b = df.random_split([0.5, 0.5])
80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)
-
rdiv
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
reduction
(chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)¶ Generic row-wise reductions.
Parameters: chunk : callable
Function to operate on each partition. Should return a
pandas.DataFrame
,pandas.Series
, or a scalar.aggregate : callable, optional
Function to operate on the concatenated result of
chunk
. If not specified, defaults tochunk
. Used to do the final aggregation in a tree reduction.The input to
aggregate
depends on the output ofchunk
. If the output ofchunk
is a: - scalar: Input is a Series, with one row per partition. - Series: Input is a DataFrame, with one row per partition. Columnsare the rows in the output series.
- DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.
Should return a
pandas.DataFrame
,pandas.Series
, or a scalar.combine : callable, optional
Function to operate on intermediate concatenated results of
chunk
in a tree-reduction. If not provided, defaults toaggregate
. The input/output requirements should match that ofaggregate
described above.meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.token : str, optional
The name to use for the output keys.
split_every : int, optional
Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to
aggregate
. Default is 8.chunk_kwargs : dict, optional
Keyword arguments to pass on to
chunk
only.aggregate_kwargs : dict, optional
Keyword arguments to pass on to
aggregate
only.combine_kwargs : dict, optional
Keyword arguments to pass on to
combine
only.kwargs :
All remaining keywords will be passed to
chunk
,combine
, andaggregate
.Examples
>>> import pandas as pd >>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)}) >>> ddf = dd.from_pandas(df, npartitions=4)
Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:
>>> res = ddf.reduction(lambda x: x.count(), ... aggregate=lambda x: x.sum()) >>> res.compute() x 50 y 50 dtype: int64
Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).
>>> def count_greater(x, value=0): ... return (x >= value).sum() >>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(), ... chunk_kwargs={'value': 25}) >>> res.compute() 25
Aggregate both the sum and count of a Series at the same time:
>>> def sum_and_count(x): ... return pd.Series({'sum': x.sum(), 'count': x.count()}) >>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum()) >>> res.compute() count 50 sum 1225 dtype: int64
Doing the same, but for a DataFrame. Here
chunk
returns a DataFrame, meaning the input toaggregate
is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.>>> def sum_and_count(x): ... return pd.DataFrame({'sum': x.sum(), 'count': x.count()}) >>> res = ddf.reduction(sum_and_count, ... aggregate=lambda x: x.groupby(level=0).sum()) >>> res.compute() count sum x 50 1225 y 50 3725
-
rename
(index=None, columns=None)¶ Alter axes input function or functions. Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error. Alternatively, change
Series.name
with a scalar value (Series only).Parameters: index, columns : scalar, list-like, dict-like or function, optional
Scalar or list-like will alter the
Series.name
attribute, and raise on DataFrame or Panel. dict-like or functions are transformations to apply to that axis’ valuescopy : boolean, default True
Also copy underlying data
inplace : boolean, default False
Whether to return a new DataFrame. If True then value of copy is ignored.
Returns: renamed : DataFrame (new object)
See also
pandas.NDFrame.rename_axis
Examples
>>> s = pd.Series([1, 2, 3]) >>> s 0 1 1 2 2 3 dtype: int64 >>> s.rename("my_name") # scalar, changes Series.name 0 1 1 2 2 3 Name: my_name, dtype: int64 >>> s.rename(lambda x: x ** 2) # function, changes labels 0 1 1 2 4 3 dtype: int64 >>> s.rename({1: 3, 2: 5}) # mapping, changes labels 0 1 3 2 5 3 dtype: int64 >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> df.rename(2) ... TypeError: 'int' object is not callable >>> df.rename(index=str, columns={"A": "a", "B": "c"}) a c 0 1 4 1 2 5 2 3 6 >>> df.rename(index=str, columns={"A": "a", "C": "c"}) a B 0 1 4 1 2 5 2 3 6
-
repartition
(divisions=None, npartitions=None, force=False)¶ Repartition dataframe along new divisions
Parameters: divisions : list, optional
List of partitions to be used. If specified npartitions will be ignored.
npartitions : int, optional
Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.
force : bool, default False
Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.
Examples
>>> df = df.repartition(npartitions=10) >>> df = df.repartition(divisions=[0, 5, 10, 20])
-
resample
(rule, how=None, closed=None, label=None)¶ Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
Parameters: rule : string
the offset string or object representing target conversion
axis : int, optional, default 0
closed : {‘right’, ‘left’}
Which side of bin interval is closed
label : {‘right’, ‘left’}
Which bin edge label to label bucket with
convention : {‘start’, ‘end’, ‘s’, ‘e’}
loffset : timedelta
Adjust the resampled time labels
base : int, default 0
For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0
on : string, optional
For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
New in version 0.19.0.
level : string or int, optional
For a MultiIndex, level (name or number) to use for resampling. Level must be datetime-like.
New in version 0.19.0.
To learn more about the offset strings, please see `this link
<http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases>`__.
Notes
Dask doesn’t supports following argument(s).
- axis
- fill_method
- convention
- kind
- loffset
- limit
- base
- on
- level
Examples
Start by creating a series with 9 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T') >>> series = pd.Series(range(9), index=index) >>> series 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 Freq: T, dtype: int64
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum() 2000-01-01 00:00:00 3 2000-01-01 00:03:00 12 2000-01-01 00:06:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket
2000-01-01 00:03:00
contains the value 3, but the summed value in the resampled bucket with the label``2000-01-01 00:03:00`` does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.>>> series.resample('3T', label='right').sum() 2000-01-01 00:03:00 3 2000-01-01 00:06:00 12 2000-01-01 00:09:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
>>> series.resample('3T', label='right', closed='right').sum() 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 Freq: 3T, dtype: int64
Upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] #select first 5 rows 2000-01-01 00:00:00 0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 1 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2 Freq: 30S, dtype: float64
Upsample the series into 30 second bins and fill the
NaN
values using thepad
method.>>> series.resample('30S').pad()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 0 2000-01-01 00:01:00 1 2000-01-01 00:01:30 1 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the
NaN
values using thebfill
method.>>> series.resample('30S').bfill()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 1 2000-01-01 00:01:00 1 2000-01-01 00:01:30 2 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Pass a custom function via
apply
>>> def custom_resampler(array_like): ... return np.sum(array_like)+5
>>> series.resample('3T').apply(custom_resampler) 2000-01-01 00:00:00 8 2000-01-01 00:03:00 17 2000-01-01 00:06:00 26 Freq: 3T, dtype: int64
-
reset_index
(drop=False)¶ For DataFrame with multi-level index, return new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.
Parameters: level : int, str, tuple, or list, default None
Only remove the given levels from the index. Removes all levels by default
drop : boolean, default False
Do not try to insert index into dataframe columns. This resets the index to the default integer index.
inplace : boolean, default False
Modify the DataFrame in place (do not create a new object)
col_level : int or str, default 0
If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.
col_fill : object, default ‘’
If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.
Returns: resetted : DataFrame
Notes
Dask doesn’t supports following argument(s).
- level
- inplace
- col_level
- col_fill
-
rfloordiv
(other, axis='columns', level=None, fill_value=None)¶ Integer division of dataframe and other, element-wise (binary operator rfloordiv).
Equivalent to
other // dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
rmod
(other, axis='columns', level=None, fill_value=None)¶ Modulo of dataframe and other, element-wise (binary operator rmod).
Equivalent to
other % dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
rmul
(other, axis='columns', level=None, fill_value=None)¶ Multiplication of dataframe and other, element-wise (binary operator rmul).
Equivalent to
other * dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
rolling
(window, min_periods=None, freq=None, center=False, win_type=None, axis=0)¶ Provides rolling transformations.
Parameters: window : int
Size of the moving window. This is the number of observations used for calculating the statistic. The window size must not be so large as to span more than one adjacent partition.
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
center : boolean, default False
Set the labels at the center of the window.
win_type : string, default None
Provide a window type. The recognized window types are identical to pandas.
axis : int, default 0
Returns: a Rolling object on which to call a method to compute a statistic
Notes
The freq argument is not supported.
-
round
(decimals=0)¶ Round a DataFrame to a variable number of decimal places.
New in version 0.17.0.
Parameters: decimals : int, dict, Series
Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.
Returns: DataFrame object
See also
numpy.around
,Series.round
Examples
>>> df = pd.DataFrame(np.random.random([3, 3]), ... columns=['A', 'B', 'C'], index=['first', 'second', 'third']) >>> df A B C first 0.028208 0.992815 0.173891 second 0.038683 0.645646 0.577595 third 0.877076 0.149370 0.491027 >>> df.round(2) A B C first 0.03 0.99 0.17 second 0.04 0.65 0.58 third 0.88 0.15 0.49 >>> df.round({'A': 1, 'C': 2}) A B C first 0.0 0.992815 0.17 second 0.0 0.645646 0.58 third 0.9 0.149370 0.49 >>> decimals = pd.Series([1, 0, 2], index=['A', 'B', 'C']) >>> df.round(decimals) A B C first 0.0 1 0.17 second 0.0 1 0.58 third 0.9 0 0.49
-
rpow
(other, axis='columns', level=None, fill_value=None)¶ Exponential power of dataframe and other, element-wise (binary operator rpow).
Equivalent to
other ** dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
rsub
(other, axis='columns', level=None, fill_value=None)¶ Subtraction of dataframe and other, element-wise (binary operator rsub).
Equivalent to
other - dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
rtruediv
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
sample
(frac, replace=False, random_state=None)¶ Random sample of items
Parameters: frac : float, optional
Fraction of axis items to return.
replace: boolean, optional
Sample with or without replacement. Default = False.
random_state: int or ``np.random.RandomState``
If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState
-
select_dtypes
(include=None, exclude=None)¶ Return a subset of a DataFrame including/excluding columns based on their
dtype
.Parameters: include, exclude : list-like
A list of dtypes or strings to be included/excluded. You must pass in a non-empty sequence for at least one of these.
Returns: subset : DataFrame
The subset of the frame including the dtypes in
include
and excluding the dtypes inexclude
.Raises: ValueError
- If both of
include
andexclude
are empty - If
include
andexclude
have overlapping elements - If any kind of string dtype is passed in.
TypeError
- If either of
include
orexclude
is not a sequence
Notes
- To select all numeric types use the numpy dtype
numpy.number
- To select strings you must use the
object
dtype, but note that this will return all object dtype columns - See the numpy dtype hierarchy
- To select Pandas categorical dtypes, use ‘category’
Examples
>>> df = pd.DataFrame({'a': np.random.randn(6).astype('f4'), ... 'b': [True, False] * 3, ... 'c': [1.0, 2.0] * 3}) >>> df a b c 0 0.3962 True 1 1 0.1459 False 2 2 0.2623 True 1 3 0.0764 False 2 4 -0.9703 True 1 5 -1.2094 False 2 >>> df.select_dtypes(include=['float64']) c 0 1 1 2 2 1 3 2 4 1 5 2 >>> df.select_dtypes(exclude=['floating']) b 0 True 1 False 2 True 3 False 4 True 5 False
- If both of
-
set_index
(other, drop=True, sorted=False, **kwargs)¶ Set the DataFrame index (row labels) using an existing column
This operation in dask.dataframe is expensive. If the input column is sorted then we accomplish the set_index in a single full read of that column. However, if the input column is not sorted then this operation triggers a full shuffle, which can take a while and only works on a single machine (not distributed).
Parameters: other: Series or label
drop: boolean, default True
Delete columns to be used as the new index
sorted: boolean, default False
Set to True if the new index column is already sorted
Examples
>>> df.set_index('x') >>> df.set_index(d.x) >>> df.set_index(d.timestamp, sorted=True)
-
set_partition
(column, divisions, **kwargs)¶ Set explicit divisions for new column index
>>> df2 = df.set_partition('new-index-column', divisions=[10, 20, 50])
See also
-
std
(axis=None, skipna=True, ddof=1, split_every=False)¶ Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
ddof : int, default 1
degrees of freedom
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: std : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
sub
(other, axis='columns', level=None, fill_value=None)¶ Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
sum
(axis=None, skipna=True, split_every=False)¶ Return the sum of the values for the requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: sum : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
tail
(n=5, compute=True)¶ Last n rows of the dataset
Caveat, the only checks the last n rows of the last partition.
-
to_bag
(index=False)¶ Convert to a dask Bag of tuples of each row.
Parameters: index : bool, optional
If True, the index is included as the first element of each tuple. Default is False.
-
to_castra
(fn=None, categories=None, sorted_index_column=None, compute=True, get=<function get_sync>)¶ Write DataFrame to Castra on-disk store
See https://github.com/blosc/castra for details
See also
Castra.to_dask
-
to_csv
(filename, **kwargs)¶ Write DataFrame to a series of comma-separated values (csv) files
One filename per partition will be created. You can specify the filenames in a variety of ways.
Use a globstring:
>>> df.to_csv('/path/to/data/export-*.csv')
The * will be replaced by the increasing sequence 0, 1, 2, ...
/path/to/data/export-0.csv /path/to/data/export-1.csv
Use a globstring and a
name_function=
keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.>>> from datetime import date, timedelta >>> def name(i): ... return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0) '2015-01-01' >>> name(15) '2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)
/path/to/data/export-2015-01-01.csv /path/to/data/export-2015-01-02.csv ...
You can also provide an explicit list of paths:
>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...] >>> df.to_csv(paths)
Parameters: filename : string
Path glob indicating the naming scheme for the output files
name_function : callable, default None
Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions
compression : string or None
String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically
sep : character, default ‘,’
Field delimiter for the output file
na_rep : string, default ‘’
Missing data representation
float_format : string, default None
Format string for floating point numbers
columns : sequence, optional
Columns to write
header : boolean or list of string, default True
Write out column names. If a list of string is given it is assumed to be aliases for the column names
index : boolean, default True
Write row names (index)
index_label : string or sequence, or False, default None
Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R
nanRep : None
deprecated, use na_rep
mode : str
Python write mode, default ‘w’
encoding : string, optional
A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
compression : string, optional
a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename
line_terminator : string, default ‘n’
The newline character or character sequence to use in the output file
quoting : optional constant from csv module
defaults to csv.QUOTE_MINIMAL
quotechar : string (length 1), default ‘”’
character used to quote fields
doublequote : boolean, default True
Control quoting of quotechar inside a field
escapechar : string (length 1), default None
character used to escape sep and quotechar when appropriate
chunksize : int or None
rows to write at a time
tupleize_cols : boolean, default False
write multi_index columns as a list of tuples (if True) or new (expanded format) if False)
date_format : string, default None
Format string for datetime objects
decimal: string, default ‘.’
Character recognized as decimal separator. E.g. use ‘,’ for European data
-
to_delayed
()¶ Convert dataframe into dask Delayed objects
Returns a list of delayed values, one value per partition.
-
to_hdf
(path_or_buf, key, mode='a', append=False, get=None, **kwargs)¶ Export frame to hdf file(s)
Export dataframe to one or multiple hdf5 files or nodes.
Exported hdf format is pandas’ hdf table format only. Data saved by this function should be read by pandas dataframe compatible reader.
By providing a single asterisk in either the path_or_buf or key parameters you direct dask to save each partition to a different file or node (respectively). The asterisk will be replaced with a zero padded partition number, as this is the default implementation of name_function.
When writing to a single hdf node in a single hdf file, all hdf save tasks are required to execute in a specific order, often becoming the bottleneck of the entire execution graph. Saving to multiple nodes or files removes that restriction (order is still preserved by enforcing order on output, using name_function) and enables executing save tasks in parallel.
Parameters: path_or_buf: HDFStore object or string
Destination file(s). If string, can contain a single asterisk to
save each partition to a different file. Only one asterisk is
allowed in both path_or_buf and key parameters.
key: string
A node / group path in file, can contain a single asterisk to save
each partition to a different hdf node in a single file. Only one
asterisk is allowed in both path_or_buf and key parameters.
format: optional, default ‘table’
Default hdf storage format, currently only pandas’ ‘table’ format is supported.
mode: optional, {‘a’, ‘w’, ‘r+’}, default ‘a’
'a'
Append: Add data to existing file(s) or create new.
'w'
Write: overwrite any existing files with new ones.
'r+'
Append to existing files, files must already exist.
append: optional, default False
If False, overwrites existing node with the same name otherwise
appends to it.
complevel: optional, 0-9, default 0
compression level, higher means better compression ratio and possibly more CPU time. Depends on complib.
complib: {‘zlib’, ‘bzip2’, ‘lzo’, ‘blosc’, None}, default None
If complevel > 0 compress using this compression library when
possible
fletcher32: bool, default False
If True and compression is used, additionally apply the fletcher32
checksum.
get: callable, optional
A scheduler `get` function to use. If not provided, the default is
to check the global settings first, and then fall back to defaults
for the collections.
dask_kwargs: dict, optional
A dictionary of keyword arguments passed to the `get` function
used.
name_function: callable, optional, default None
A callable called for each partition that accepts a single int
representing the partition number. name_function must return a
string representation of a partition’s index in a way that will
preserve the partition’s location after a string sort.
If None, a default name_function is used. The default name_function
will return a zero padded string of received int. See
dask.utils.build_name_function for more info.
compute: bool, default True
If True, execute computation of resulting dask graph. If False, return a Delayed object.
lock: bool, None or lock object, default None
In to_hdf locks are needed for two reasons. First, to protect
against writing to the same file from multiple processes or threads
simultaneously. Second, default libhdf5 is not thread safe, so we
must additionally lock on it’s usage. By default if lock is None
lock will be determined optimally based on path_or_buf, key and the
scheduler used. Manually setting this parameter is usually not
required to improve performance.
Alternatively, you can specify specific values:
If False, no locking will occur. If True, default lock object will
be created (multiprocessing.Manager.Lock on multiprocessing
scheduler, Threading.Lock otherwise), This can be used to force
using a lock in scenarios the default behavior will be to avoid
locking. Else, value is assumed to implement the lock interface,
and will be the lock object used.
See also
dask.DataFrame.read_hdf
- reading hdf files
dask.Series.read_hdf
- reading hdf files
Examples
Saving data to a single file:
>>> df.to_hdf('output.hdf', '/data')
Saving data to multiple nodes:
>>> with pd.HDFStore('output.hdf') as fh: ... df.to_hdf(fh, '/data*') ... fh.keys() ['/data0', '/data1']
Or multiple files:
>>> df.to_hdf('output_*.hdf', '/data')
Saving multiple files with the multiprocessing scheduler and manually disabling locks:
>>> df.to_hdf('output_*.hdf', '/data', ... get=dask.multiprocessing.get, lock=False)
-
to_timestamp
(freq=None, how='start', axis=0)¶ Cast to DatetimeIndex of timestamps, at beginning of period
Parameters: freq : string, default frequency of PeriodIndex
Desired frequency
how : {‘s’, ‘e’, ‘start’, ‘end’}
Convention for converting period to timestamp; start of period vs. end
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The axis to convert (the index by default)
copy : boolean, default True
If false then underlying input data is not copied
Returns: df : DataFrame with DatetimeIndex
Notes
Dask doesn’t supports following argument(s).
- copy
-
truediv
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
-
var
(axis=None, skipna=True, ddof=1, split_every=False)¶ Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
ddof : int, default 1
degrees of freedom
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: var : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
visualize
(filename='mydask', format=None, optimize_graph=False, **kwargs)¶ Render the computation of this object’s task graph using graphviz.
Requires
graphviz
to be installed.Parameters: filename : str or None, optional
The name (without an extension) of the file to write to disk. If filename is None, no file will be written, and we communicate with dot using only pipes.
format : {‘png’, ‘pdf’, ‘dot’, ‘svg’, ‘jpeg’, ‘jpg’}, optional
Format in which to write output file. Default is ‘png’.
optimize_graph : bool, optional
If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.
**kwargs
Additional keyword arguments to forward to
to_graphviz
.Returns: result : IPython.diplay.Image, IPython.display.SVG, or None
See dask.dot.dot_graph for more information.
See also
dask.base.visualize
,dask.dot.dot_graph
Notes
For more information on optimization see here:
-
where
(cond, other=nan)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
Parameters: cond : boolean NDFrame, array or callable
If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as cond.
other : scalar, NDFrame, or callable
If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as other.
inplace : boolean, default False
Whether to perform the operation in place on the data
axis : alignment axis if needed, default None
level : alignment level if needed, default None
try_cast : boolean, default False
try to cast the result back to the input type (if possible),
raise_on_error : boolean, default True
Whether to raise on invalid data types (e.g. trying to where on strings)
Returns: wh : same type as caller
See also
Notes
Dask doesn’t supports following argument(s).
- inplace
- axis
- level
- try_cast
- raise_on_error
Examples
>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
-
Series Methods¶
-
class
dask.dataframe.
Series
(dsk, name, meta, divisions)¶ Out-of-core Series object
Mimics
pandas.Series
.Parameters: dsk: dict
The dask graph to compute this Series
_name: str
The key prefix that specifies which keys in the dask comprise this particular Series
meta: pandas.Series
An empty
pandas.Series
with names, dtypes, and index matching the expected output.divisions: tuple of index values
Values along which we partition our blocks on the index
See also
-
add
(other, level=None, fill_value=None, axis=0)¶ Addition of series and other, element-wise (binary operator add).
Equivalent to
series + other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
align
(other, join='outer', axis=None, fill_value=None)¶ Align two object on their axes with the specified join method for each axis Index
Parameters: other : DataFrame or Series
join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’
axis : allowed axis of the other object, default None
Align on index (0), columns (1), or both (None)
level : int or level name, default None
Broadcast across a level, matching Index values on the passed MultiIndex level
copy : boolean, default True
Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
fill_value : scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any “compatible” value
method : str, default None
limit : int, default None
fill_axis : {0, ‘index’}, default 0
Filling axis, method and limit
broadcast_axis : {0, ‘index’}, default None
Broadcast values along this axis, if aligning two objects of different dimensions
New in version 0.17.0.
Returns: (left, right) : (Series, type of other)
Aligned objects
Notes
Dask doesn’t supports following argument(s).
- level
- copy
- method
- limit
- fill_axis
- broadcast_axis
-
all
(axis=None, skipna=True, split_every=False)¶ Return whether all elements are True over requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
bool_only : boolean, default None
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
Returns: all : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- bool_only
- level
-
any
(axis=None, skipna=True, split_every=False)¶ Return whether any element is True over requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
bool_only : boolean, default None
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
Returns: any : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- bool_only
- level
-
append
(other)¶ Concatenate two or more Series.
Parameters: to_append : Series or list/tuple of Series
ignore_index : boolean, default False
If True, do not use the index labels.
verify_integrity : boolean, default False
If True, raise Exception on creating index with duplicates
Returns: appended : Series
Notes
Dask doesn’t supports following argument(s).
- to_append
- ignore_index
- verify_integrity
Examples
>>> s1 = pd.Series([1, 2, 3]) >>> s2 = pd.Series([4, 5, 6]) >>> s3 = pd.Series([4, 5, 6], index=[3,4,5]) >>> s1.append(s2) 0 1 1 2 2 3 0 4 1 5 2 6 dtype: int64
>>> s1.append(s3) 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64
With ignore_index set to True:
>>> s1.append(s2, ignore_index=True) 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64
With verify_integrity set to True:
>>> s1.append(s2, verify_integrity=True) ValueError: Indexes have overlapping values: [0, 1, 2]
-
apply
(func, convert_dtype=True, meta='__no_default__', name='__no_default__', args=(), **kwds)¶ Parallel version of pandas.Series.apply
Parameters: func : function
Function to apply
convert_dtype : boolean, default True
Try to find better dtype for elementwise function results. If False, leave as dtype=object.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.name : list, scalar or None, optional
Deprecated, use meta instead. If list is given, the result is a DataFrame which columns is specified list. Otherwise, the result is a Series which name is given scalar or None (no name). If name keyword is not given, dask tries to infer the result type using its beginning of data. This inference may take some time and lead to unexpected result.
args : tuple
Positional arguments to pass to function in addition to the value.
Additional keyword arguments will be passed as keywords to the function.
Returns: applied : Series or DataFrame if func returns a Series.
See also
dask.Series.map_partitions
Examples
>>> import dask.dataframe as dd >>> s = pd.Series(range(5), name='x') >>> ds = dd.from_pandas(s, npartitions=2)
Apply a function elementwise across the Series, passing in extra arguments in
args
andkwargs
:>>> def myadd(x, a, b=1): ... return x + a + b >>> res = ds.apply(myadd, args=(2,), b=1.5)
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with name
'x'
, and dtypefloat64
:>>> res = ds.apply(myadd, args=(2,), b=1.5, meta=('x', 'f8'))
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ds.apply(lambda x: x + 1, meta=ds)
-
astype
(dtype)¶ Cast object to input numpy.dtype Return a copy when copy = True (be really careful with this!)
Parameters: dtype : data type, or dict of column name -> data type
Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, ...}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
raise_on_error : raise on invalid input
kwargs : keyword arguments to pass on to the constructor
Returns: casted : type of caller
Notes
Dask doesn’t supports following argument(s).
- copy
- raise_on_error
-
between
(left, right, inclusive=True)¶ Return boolean Series equivalent to left <= series <= right. NA values will be treated as False
Parameters: left : scalar
Left boundary
right : scalar
Right boundary
Returns: is_between : Series
-
cache
(cache=<class 'dict'>)¶ Evaluate Dataframe and store in local cache
Uses chest by default to store data on disk
-
clip
(lower=None, upper=None, out=None)¶ Trim values at input threshold(s).
Parameters: lower : float or array_like, default None
upper : float or array_like, default None
axis : int or string axis name, optional
Align object with lower and upper along the given axis.
Returns: clipped : Series
Notes
Dask doesn’t supports following argument(s).
- axis
Examples
>>> df 0 1 0 0.335232 -1.256177 1 -1.367855 0.746646 2 0.027753 -1.176076 3 0.230930 -0.679613 4 1.261967 0.570967 >>> df.clip(-1.0, 0.5) 0 1 0 0.335232 -1.000000 1 -1.000000 0.500000 2 0.027753 -1.000000 3 0.230930 -0.679613 4 0.500000 0.500000 >>> t 0 -0.3 1 -0.2 2 -0.1 3 0.0 4 0.1 dtype: float64 >>> df.clip(t, t + 1, axis=0) 0 1 0 0.335232 -0.300000 1 -0.200000 0.746646 2 0.027753 -0.100000 3 0.230930 0.000000 4 1.100000 0.570967
-
clip_lower
(threshold)¶ Return copy of the input with values below given value(s) truncated.
Parameters: threshold : float or array_like
axis : int or string axis name, optional
Align object with threshold along the given axis.
Returns: clipped : same type as input
See also
Notes
Dask doesn’t supports following argument(s).
- axis
-
clip_upper
(threshold)¶ Return copy of input with values above given value(s) truncated.
Parameters: threshold : float or array_like
axis : int or string axis name, optional
Align object with threshold along the given axis.
Returns: clipped : same type as input
See also
Notes
Dask doesn’t supports following argument(s).
- axis
-
combine_first
(other)¶ Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
Parameters: other : Series Returns: y : Series
-
compute
(**kwargs)¶ Compute several dask collections at once.
Parameters: get : callable, optional
A scheduler
get
function to use. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.optimize_graph : bool, optional
If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
kwargs
Extra keywords to forward to the scheduler
get
function.
-
corr
(other, method='pearson', min_periods=None)¶ Compute correlation with other Series, excluding missing values
Parameters: other : Series
method : {‘pearson’, ‘kendall’, ‘spearman’}
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
min_periods : int, optional
Minimum number of observations needed to have a valid result
Returns: correlation : float
-
count
(split_every=False)¶ Return number of non-NA/null observations in the Series
Parameters: level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series
Returns: nobs : int or Series (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
-
cov
(other, min_periods=None)¶ Compute covariance with Series, excluding missing values
Parameters: other : Series
min_periods : int, optional
Minimum number of observations needed to have a valid result
Returns: covariance : float
Normalized by N-1 (unbiased estimator).
-
cummax
(axis=None, skipna=True)¶ Return cumulative max over requested axis.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cummax : Series
-
cummin
(axis=None, skipna=True)¶ Return cumulative minimum over requested axis.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cummin : Series
-
cumprod
(axis=None, skipna=True)¶ Return cumulative product over requested axis.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cumprod : Series
-
cumsum
(axis=None, skipna=True)¶ Return cumulative sum over requested axis.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cumsum : Series
-
describe
(split_every=False)¶ Generate various summary statistics, excluding NaN values.
Parameters: percentiles : array-like, optional
The percentiles to include in the output. Should all be in the interval [0, 1]. By default percentiles is [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
include, exclude : list-like, ‘all’, or None (default)
Specify the form of the returned result. Either:
- None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.
- A list of dtypes or strings to be included/excluded. To select all numeric types use numpy numpy.number. To select categorical objects use type object. See also the select_dtypes documentation. eg. df.describe(include=[‘O’])
- If include is the string ‘all’, the output column-set will match the input one.
Returns: summary: NDFrame of summary statistics
See also
Notes
Dask doesn’t supports following argument(s).
- percentiles
- include
- exclude
-
div
(other, level=None, fill_value=None, axis=0)¶ Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
drop_duplicates
(**kwargs)¶ Return DataFrame with duplicate rows removed, optionally only considering certain columns
Parameters: subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns
keep : {‘first’, ‘last’, False}, default ‘first’
first
: Drop duplicates except for the first occurrence.last
: Drop duplicates except for the last occurrence.- False : Drop all duplicates.
take_last : deprecated
inplace : boolean, default False
Whether to drop duplicates in place or to return a copy
Returns: deduplicated : DataFrame
-
dropna
()¶ Return Series without null values
Returns: valid : Series
inplace : boolean, default False
Do operation in place.
Notes
Dask doesn’t supports following argument(s).
- axis
- inplace
-
dtype
¶ Return data type
-
eq
(other, level=None, axis=0)¶ Equal to of series and other, element-wise (binary operator eq).
Equivalent to
series == other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Series.None
-
fillna
(value)¶ Fill NA/NaN values using the specified method
Parameters: value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
axis : {0, ‘index’}
inplace : boolean, default False
If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).
limit : int, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.
downcast : dict, default is None
a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
Returns: filled : Series
See also
reindex
,asfreq
Notes
Dask doesn’t supports following argument(s).
- method
- axis
- inplace
- limit
- downcast
-
floordiv
(other, level=None, fill_value=None, axis=0)¶ Integer division of series and other, element-wise (binary operator floordiv).
Equivalent to
series // other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
ge
(other, level=None, axis=0)¶ Greater than or equal to of series and other, element-wise (binary operator ge).
Equivalent to
series >= other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Series.None
-
get_partition
(n)¶ Get a dask DataFrame/Series representing the nth partition.
-
groupby
(index, **kwargs)¶ Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
Parameters: by : mapping function / list of functions, dict, Series, or tuple /
list of column names. Called on each element of the object index to determine the groups. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups
axis : int, default 0
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
sort : boolean, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
group_keys : boolean, default True
When calling apply, add group keys to index to identify pieces
squeeze : boolean, default False
reduce the dimensionality of the return type if possible, otherwise return a consistent type
Returns: GroupBy object
Notes
Dask doesn’t supports following argument(s).
- by
- axis
- level
- as_index
- sort
- group_keys
- squeeze
Examples
DataFrame results
>>> data.groupby(func, axis=0).mean() >>> data.groupby(['col1', 'col2'])['col3'].mean()
DataFrame with hierarchical index
>>> data.groupby(['col1', 'col2']).mean()
-
gt
(other, level=None, axis=0)¶ Greater than of series and other, element-wise (binary operator gt).
Equivalent to
series > other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Series.None
-
head
(n=5, npartitions=1, compute=True)¶ First n rows of the dataset
Parameters: n : int, optional
The number of rows to return. Default is 5.
npartitions : int, optional
Elements are only taken from the first
npartitions
, with a default of 1. If there are fewer thann
rows in the firstnpartitions
a warning will be raised and any found rows returned. Pass -1 to use all partitions.compute : bool, optional
Whether to compute the result, default is True.
-
idxmax
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be first index.
Returns: idxmax : Series
See also
Notes
This method is the DataFrame version of
ndarray.argmax
.
-
idxmin
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of minimum over requested axis. NA/null values are excluded.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: idxmin : Series
See also
Notes
This method is the DataFrame version of
ndarray.argmin
.
-
index
¶ Return dask Index instance
-
isin
(other)¶ Return a boolean
Series
showing whether each element in theSeries
is exactly contained in the passed sequence ofvalues
.Parameters: values : set or list-like
The sequence of values to test. Passing in a single string will raise a
TypeError
. Instead, turn a single string into alist
of one element.New in version 0.18.1.
Support for values as a set
Returns: isin : Series (bool dtype)
Raises: TypeError
- If
values
is a string
See also
pandas.DataFrame.isin
Notes
Dask doesn’t supports following argument(s).
- values
Examples
>>> s = pd.Series(list('abc')) >>> s.isin(['a', 'c', 'e']) 0 True 1 False 2 True dtype: bool
Passing a single string as
s.isin('a')
will raise an error. Use a list of one element instead:>>> s.isin(['a']) 0 True 1 False 2 False dtype: bool
- If
-
isnull
()¶ Return a boolean same-sized object indicating if the values are null.
See also
notnull
- boolean inverse of isnull
-
iteritems
()¶ Lazily iterate over (index, value) tuples
-
known_divisions
¶ Whether divisions are already known
-
le
(other, level=None, axis=0)¶ Less than or equal to of series and other, element-wise (binary operator le).
Equivalent to
series <= other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Series.None
-
loc
¶ Purely label-location based indexer for selection by label.
>>> df.loc["b"] >>> df.loc["b":"d"]
-
lt
(other, level=None, axis=0)¶ Less than of series and other, element-wise (binary operator lt).
Equivalent to
series < other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Series.None
-
map
(arg, na_action=None, meta='__no_default__')¶ Map values of Series using input correspondence (which can be a dict, Series, or function)
Parameters: arg : function, dict, or Series
na_action : {None, ‘ignore’}
If ‘ignore’, propagate NA values, without passing them to the mapping function
Returns: y : Series
same index as caller
Examples
Map inputs to outputs
>>> x one 1 two 2 three 3
>>> y 1 foo 2 bar 3 baz
>>> x.map(y) one foo two bar three baz
Use na_action to control whether NA values are affected by the mapping function.
>>> s = pd.Series([1, 2, 3, np.nan])
>>> s2 = s.map(lambda x: 'this is a string {}'.format(x), na_action=None) 0 this is a string 1.0 1 this is a string 2.0 2 this is a string 3.0 3 this is a string nan dtype: object
>>> s3 = s.map(lambda x: 'this is a string {}'.format(x), na_action='ignore') 0 this is a string 1.0 1 this is a string 2.0 2 this is a string 3.0 3 NaN dtype: object
-
map_partitions
(func, *args, **kwargs)¶ Apply Python function on each DataFrame partition.
Parameters: func : function
Function applied to each partition.
args, kwargs :
Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Examples
Given a DataFrame, Series, or Index, such as:
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
One can use
map_partitions
to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:
>>> def myadd(df, a, b=1): ... return df.x + df.y + a + b >>> res = ddf.map_partitions(myadd, 1, b=2) >>> res.dtype dtype('float64')
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with no name, and dtype
float64
:>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))
Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:
>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y)) >>> res.dtypes x int64 y float64 z float64 dtype: object
As before, the output metadata can also be specified manually. This time we pass in a
dict
, as the output is a DataFrame:>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y), ... meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ddf.map_partitions(lambda df: df.head(), meta=df)
-
mask
(cond, other=nan)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.
Parameters: cond : boolean NDFrame, array or callable
If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as cond.
other : scalar, NDFrame, or callable
If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as other.
inplace : boolean, default False
Whether to perform the operation in place on the data
axis : alignment axis if needed, default None
level : alignment level if needed, default None
try_cast : boolean, default False
try to cast the result back to the input type (if possible),
raise_on_error : boolean, default True
Whether to raise on invalid data types (e.g. trying to where on strings)
Returns: wh : same type as caller
See also
Notes
Dask doesn’t supports following argument(s).
- inplace
- axis
- level
- try_cast
- raise_on_error
Examples
>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
-
max
(axis=None, skipna=True, split_every=False)¶ - This method returns the maximum of the values in the object.
- If you want the index of the maximum, use
idxmax
. This is the equivalent of thenumpy.ndarray
methodargmax
.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: max : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
mean
(axis=None, skipna=True, split_every=False)¶ Return the mean of the values for the requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: mean : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
min
(axis=None, skipna=True, split_every=False)¶ - This method returns the minimum of the values in the object.
- If you want the index of the minimum, use
idxmin
. This is the equivalent of thenumpy.ndarray
methodargmin
.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: min : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
mod
(other, level=None, fill_value=None, axis=0)¶ Modulo of series and other, element-wise (binary operator mod).
Equivalent to
series % other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
mul
(other, level=None, fill_value=None, axis=0)¶ Multiplication of series and other, element-wise (binary operator mul).
Equivalent to
series * other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
ndim
¶ Return dimensionality
-
ne
(other, level=None, axis=0)¶ Not equal to of series and other, element-wise (binary operator ne).
Equivalent to
series != other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Series.None
-
nlargest
(n=5, split_every=None)¶ Return the largest n elements.
Parameters: n : int
Return this many descending sorted values
keep : {‘first’, ‘last’, False}, default ‘first’
Where there are duplicate values: -
first
: take the first occurrence. -last
: take the last occurrence.take_last : deprecated
Returns: top_n : Series
The n largest values in the Series, in sorted order
See also
Series.nsmallest
Notes
Faster than
.sort_values(ascending=False).head(n)
for small n relative to the size of theSeries
object.Examples
>>> import pandas as pd >>> import numpy as np >>> s = pd.Series(np.random.randn(1e6)) >>> s.nlargest(10) # only sorts up to the N requested
-
notnull
()¶ Return a boolean same-sized object indicating if the values are not null.
See also
isnull
- boolean inverse of notnull
-
npartitions
¶ Return number of partitions
-
nunique
(split_every=None)¶ Return number of unique elements in the object.
Excludes NA values by default.
Parameters: dropna : boolean, default True
Don’t include NaN in the count.
Returns: nunique : int
Notes
Dask doesn’t supports following argument(s).
- dropna
-
pipe
(func, *args, **kwargs)¶ Apply func(self, *args, **kwargs)
New in version 0.16.2.
Parameters: func : function
function to apply to the NDFrame.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the NDFrame.args : positional arguments passed into
func
.kwargs : a dictionary of keyword arguments passed into
func
.Returns: object : the return type of
func
.See also
pandas.DataFrame.apply
,pandas.DataFrame.applymap
,pandas.Series.map
Notes
Use
.pipe
when chaining together functions that expect on Series or DataFrames. Instead of writing>>> f(g(h(df), arg1=a), arg2=b, arg3=c)
You can write
>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe(f, arg2=b, arg3=c) ... )
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asarg2
:>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe((f, 'arg2'), arg1=a, arg3=c) ... )
-
pow
(other, level=None, fill_value=None, axis=0)¶ Exponential power of series and other, element-wise (binary operator pow).
Equivalent to
series ** other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
quantile
(q=0.5)¶ Approximate quantiles of Series
- q : list/array of floats, default 0.5 (50%)
- Iterable of numbers ranging from 0 to 1 for the desired quantiles
-
radd
(other, level=None, fill_value=None, axis=0)¶ Addition of series and other, element-wise (binary operator radd).
Equivalent to
other + series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
random_split
(frac, random_state=None)¶ Pseudorandomly split dataframe into different pieces row-wise
Parameters: frac : list
List of floats that should sum to one.
random_state: int or np.random.RandomState
If int create a new RandomState with this as the seed
Otherwise draw from the passed RandomState
See also
dask.DataFrame.sample
Examples
50/50 split
>>> a, b = df.random_split([0.5, 0.5])
80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)
-
rdiv
(other, level=None, fill_value=None, axis=0)¶ Floating division of series and other, element-wise (binary operator rtruediv).
Equivalent to
other / series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
reduction
(chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)¶ Generic row-wise reductions.
Parameters: chunk : callable
Function to operate on each partition. Should return a
pandas.DataFrame
,pandas.Series
, or a scalar.aggregate : callable, optional
Function to operate on the concatenated result of
chunk
. If not specified, defaults tochunk
. Used to do the final aggregation in a tree reduction.The input to
aggregate
depends on the output ofchunk
. If the output ofchunk
is a: - scalar: Input is a Series, with one row per partition. - Series: Input is a DataFrame, with one row per partition. Columnsare the rows in the output series.
- DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.
Should return a
pandas.DataFrame
,pandas.Series
, or a scalar.combine : callable, optional
Function to operate on intermediate concatenated results of
chunk
in a tree-reduction. If not provided, defaults toaggregate
. The input/output requirements should match that ofaggregate
described above.meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.token : str, optional
The name to use for the output keys.
split_every : int, optional
Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to
aggregate
. Default is 8.chunk_kwargs : dict, optional
Keyword arguments to pass on to
chunk
only.aggregate_kwargs : dict, optional
Keyword arguments to pass on to
aggregate
only.combine_kwargs : dict, optional
Keyword arguments to pass on to
combine
only.kwargs :
All remaining keywords will be passed to
chunk
,combine
, andaggregate
.Examples
>>> import pandas as pd >>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)}) >>> ddf = dd.from_pandas(df, npartitions=4)
Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:
>>> res = ddf.reduction(lambda x: x.count(), ... aggregate=lambda x: x.sum()) >>> res.compute() x 50 y 50 dtype: int64
Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).
>>> def count_greater(x, value=0): ... return (x >= value).sum() >>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(), ... chunk_kwargs={'value': 25}) >>> res.compute() 25
Aggregate both the sum and count of a Series at the same time:
>>> def sum_and_count(x): ... return pd.Series({'sum': x.sum(), 'count': x.count()}) >>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum()) >>> res.compute() count 50 sum 1225 dtype: int64
Doing the same, but for a DataFrame. Here
chunk
returns a DataFrame, meaning the input toaggregate
is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.>>> def sum_and_count(x): ... return pd.DataFrame({'sum': x.sum(), 'count': x.count()}) >>> res = ddf.reduction(sum_and_count, ... aggregate=lambda x: x.groupby(level=0).sum()) >>> res.compute() count sum x 50 1225 y 50 3725
-
repartition
(divisions=None, npartitions=None, force=False)¶ Repartition dataframe along new divisions
Parameters: divisions : list, optional
List of partitions to be used. If specified npartitions will be ignored.
npartitions : int, optional
Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.
force : bool, default False
Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.
Examples
>>> df = df.repartition(npartitions=10) >>> df = df.repartition(divisions=[0, 5, 10, 20])
-
resample
(rule, how=None, closed=None, label=None)¶ Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
Parameters: rule : string
the offset string or object representing target conversion
axis : int, optional, default 0
closed : {‘right’, ‘left’}
Which side of bin interval is closed
label : {‘right’, ‘left’}
Which bin edge label to label bucket with
convention : {‘start’, ‘end’, ‘s’, ‘e’}
loffset : timedelta
Adjust the resampled time labels
base : int, default 0
For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0
on : string, optional
For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
New in version 0.19.0.
level : string or int, optional
For a MultiIndex, level (name or number) to use for resampling. Level must be datetime-like.
New in version 0.19.0.
To learn more about the offset strings, please see `this link
<http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases>`__.
Notes
Dask doesn’t supports following argument(s).
- axis
- fill_method
- convention
- kind
- loffset
- limit
- base
- on
- level
Examples
Start by creating a series with 9 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T') >>> series = pd.Series(range(9), index=index) >>> series 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 Freq: T, dtype: int64
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum() 2000-01-01 00:00:00 3 2000-01-01 00:03:00 12 2000-01-01 00:06:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket
2000-01-01 00:03:00
contains the value 3, but the summed value in the resampled bucket with the label``2000-01-01 00:03:00`` does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.>>> series.resample('3T', label='right').sum() 2000-01-01 00:03:00 3 2000-01-01 00:06:00 12 2000-01-01 00:09:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
>>> series.resample('3T', label='right', closed='right').sum() 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 Freq: 3T, dtype: int64
Upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] #select first 5 rows 2000-01-01 00:00:00 0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 1 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2 Freq: 30S, dtype: float64
Upsample the series into 30 second bins and fill the
NaN
values using thepad
method.>>> series.resample('30S').pad()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 0 2000-01-01 00:01:00 1 2000-01-01 00:01:30 1 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the
NaN
values using thebfill
method.>>> series.resample('30S').bfill()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 1 2000-01-01 00:01:00 1 2000-01-01 00:01:30 2 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Pass a custom function via
apply
>>> def custom_resampler(array_like): ... return np.sum(array_like)+5
>>> series.resample('3T').apply(custom_resampler) 2000-01-01 00:00:00 8 2000-01-01 00:03:00 17 2000-01-01 00:06:00 26 Freq: 3T, dtype: int64
-
reset_index
(drop=False)¶ Analogous to the
pandas.DataFrame.reset_index()
function, see docstring there.Parameters: level : int, str, tuple, or list, default None
Only remove the given levels from the index. Removes all levels by default
drop : boolean, default False
Do not try to insert index into dataframe columns
name : object, default None
The name of the column corresponding to the Series values
inplace : boolean, default False
Modify the Series in place (do not create a new object)
Returns: resetted : DataFrame, or Series if drop == True
Notes
Dask doesn’t supports following argument(s).
- level
- name
- inplace
-
rfloordiv
(other, level=None, fill_value=None, axis=0)¶ Integer division of series and other, element-wise (binary operator rfloordiv).
Equivalent to
other // series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
rmod
(other, level=None, fill_value=None, axis=0)¶ Modulo of series and other, element-wise (binary operator rmod).
Equivalent to
other % series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
rmul
(other, level=None, fill_value=None, axis=0)¶ Multiplication of series and other, element-wise (binary operator rmul).
Equivalent to
other * series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
rolling
(window, min_periods=None, freq=None, center=False, win_type=None, axis=0)¶ Provides rolling transformations.
Parameters: window : int
Size of the moving window. This is the number of observations used for calculating the statistic. The window size must not be so large as to span more than one adjacent partition.
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
center : boolean, default False
Set the labels at the center of the window.
win_type : string, default None
Provide a window type. The recognized window types are identical to pandas.
axis : int, default 0
Returns: a Rolling object on which to call a method to compute a statistic
Notes
The freq argument is not supported.
-
round
(decimals=0)¶ Round each value in a Series to the given number of decimals.
Parameters: decimals : int
Number of decimal places to round to (default: 0). If decimals is negative, it specifies the number of positions to the left of the decimal point.
Returns: Series object
See also
numpy.around
,DataFrame.round
-
rpow
(other, level=None, fill_value=None, axis=0)¶ Exponential power of series and other, element-wise (binary operator rpow).
Equivalent to
other ** series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
rsub
(other, level=None, fill_value=None, axis=0)¶ Subtraction of series and other, element-wise (binary operator rsub).
Equivalent to
other - series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
rtruediv
(other, level=None, fill_value=None, axis=0)¶ Floating division of series and other, element-wise (binary operator rtruediv).
Equivalent to
other / series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
sample
(frac, replace=False, random_state=None)¶ Random sample of items
Parameters: frac : float, optional
Fraction of axis items to return.
replace: boolean, optional
Sample with or without replacement. Default = False.
random_state: int or ``np.random.RandomState``
If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState
-
std
(axis=None, skipna=True, ddof=1, split_every=False)¶ Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
ddof : int, default 1
degrees of freedom
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: std : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
sub
(other, level=None, fill_value=None, axis=0)¶ Subtraction of series and other, element-wise (binary operator sub).
Equivalent to
series - other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
sum
(axis=None, skipna=True, split_every=False)¶ Return the sum of the values for the requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: sum : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
tail
(n=5, compute=True)¶ Last n rows of the dataset
Caveat, the only checks the last n rows of the last partition.
-
to_bag
(index=False)¶ Convert to a dask Bag.
Parameters: index : bool, optional
If True, the elements are tuples of
(index, value)
, otherwise they’re just thevalue
. Default is False.
-
to_csv
(filename, **kwargs)¶ Write DataFrame to a series of comma-separated values (csv) files
One filename per partition will be created. You can specify the filenames in a variety of ways.
Use a globstring:
>>> df.to_csv('/path/to/data/export-*.csv')
The * will be replaced by the increasing sequence 0, 1, 2, ...
/path/to/data/export-0.csv /path/to/data/export-1.csv
Use a globstring and a
name_function=
keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.>>> from datetime import date, timedelta >>> def name(i): ... return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0) '2015-01-01' >>> name(15) '2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)
/path/to/data/export-2015-01-01.csv /path/to/data/export-2015-01-02.csv ...
You can also provide an explicit list of paths:
>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...] >>> df.to_csv(paths)
Parameters: filename : string
Path glob indicating the naming scheme for the output files
name_function : callable, default None
Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions
compression : string or None
String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically
sep : character, default ‘,’
Field delimiter for the output file
na_rep : string, default ‘’
Missing data representation
float_format : string, default None
Format string for floating point numbers
columns : sequence, optional
Columns to write
header : boolean or list of string, default True
Write out column names. If a list of string is given it is assumed to be aliases for the column names
index : boolean, default True
Write row names (index)
index_label : string or sequence, or False, default None
Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R
nanRep : None
deprecated, use na_rep
mode : str
Python write mode, default ‘w’
encoding : string, optional
A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
compression : string, optional
a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename
line_terminator : string, default ‘n’
The newline character or character sequence to use in the output file
quoting : optional constant from csv module
defaults to csv.QUOTE_MINIMAL
quotechar : string (length 1), default ‘”’
character used to quote fields
doublequote : boolean, default True
Control quoting of quotechar inside a field
escapechar : string (length 1), default None
character used to escape sep and quotechar when appropriate
chunksize : int or None
rows to write at a time
tupleize_cols : boolean, default False
write multi_index columns as a list of tuples (if True) or new (expanded format) if False)
date_format : string, default None
Format string for datetime objects
decimal: string, default ‘.’
Character recognized as decimal separator. E.g. use ‘,’ for European data
-
to_delayed
()¶ Convert dataframe into dask Delayed objects
Returns a list of delayed values, one value per partition.
-
to_frame
(name=None)¶ Convert Series to DataFrame
Parameters: name : object, default None
The passed name should substitute for the series name (if it has one).
Returns: data_frame : DataFrame
-
to_hdf
(path_or_buf, key, mode='a', append=False, get=None, **kwargs)¶ Export frame to hdf file(s)
Export dataframe to one or multiple hdf5 files or nodes.
Exported hdf format is pandas’ hdf table format only. Data saved by this function should be read by pandas dataframe compatible reader.
By providing a single asterisk in either the path_or_buf or key parameters you direct dask to save each partition to a different file or node (respectively). The asterisk will be replaced with a zero padded partition number, as this is the default implementation of name_function.
When writing to a single hdf node in a single hdf file, all hdf save tasks are required to execute in a specific order, often becoming the bottleneck of the entire execution graph. Saving to multiple nodes or files removes that restriction (order is still preserved by enforcing order on output, using name_function) and enables executing save tasks in parallel.
Parameters: path_or_buf: HDFStore object or string
Destination file(s). If string, can contain a single asterisk to
save each partition to a different file. Only one asterisk is
allowed in both path_or_buf and key parameters.
key: string
A node / group path in file, can contain a single asterisk to save
each partition to a different hdf node in a single file. Only one
asterisk is allowed in both path_or_buf and key parameters.
format: optional, default ‘table’
Default hdf storage format, currently only pandas’ ‘table’ format is supported.
mode: optional, {‘a’, ‘w’, ‘r+’}, default ‘a’
'a'
Append: Add data to existing file(s) or create new.
'w'
Write: overwrite any existing files with new ones.
'r+'
Append to existing files, files must already exist.
append: optional, default False
If False, overwrites existing node with the same name otherwise
appends to it.
complevel: optional, 0-9, default 0
compression level, higher means better compression ratio and possibly more CPU time. Depends on complib.
complib: {‘zlib’, ‘bzip2’, ‘lzo’, ‘blosc’, None}, default None
If complevel > 0 compress using this compression library when
possible
fletcher32: bool, default False
If True and compression is used, additionally apply the fletcher32
checksum.
get: callable, optional
A scheduler `get` function to use. If not provided, the default is
to check the global settings first, and then fall back to defaults
for the collections.
dask_kwargs: dict, optional
A dictionary of keyword arguments passed to the `get` function
used.
name_function: callable, optional, default None
A callable called for each partition that accepts a single int
representing the partition number. name_function must return a
string representation of a partition’s index in a way that will
preserve the partition’s location after a string sort.
If None, a default name_function is used. The default name_function
will return a zero padded string of received int. See
dask.utils.build_name_function for more info.
compute: bool, default True
If True, execute computation of resulting dask graph. If False, return a Delayed object.
lock: bool, None or lock object, default None
In to_hdf locks are needed for two reasons. First, to protect
against writing to the same file from multiple processes or threads
simultaneously. Second, default libhdf5 is not thread safe, so we
must additionally lock on it’s usage. By default if lock is None
lock will be determined optimally based on path_or_buf, key and the
scheduler used. Manually setting this parameter is usually not
required to improve performance.
Alternatively, you can specify specific values:
If False, no locking will occur. If True, default lock object will
be created (multiprocessing.Manager.Lock on multiprocessing
scheduler, Threading.Lock otherwise), This can be used to force
using a lock in scenarios the default behavior will be to avoid
locking. Else, value is assumed to implement the lock interface,
and will be the lock object used.
See also
dask.DataFrame.read_hdf
- reading hdf files
dask.Series.read_hdf
- reading hdf files
Examples
Saving data to a single file:
>>> df.to_hdf('output.hdf', '/data')
Saving data to multiple nodes:
>>> with pd.HDFStore('output.hdf') as fh: ... df.to_hdf(fh, '/data*') ... fh.keys() ['/data0', '/data1']
Or multiple files:
>>> df.to_hdf('output_*.hdf', '/data')
Saving multiple files with the multiprocessing scheduler and manually disabling locks:
>>> df.to_hdf('output_*.hdf', '/data', ... get=dask.multiprocessing.get, lock=False)
-
to_timestamp
(freq=None, how='start', axis=0)¶ Cast to DatetimeIndex of timestamps, at beginning of period
Parameters: freq : string, default frequency of PeriodIndex
Desired frequency
how : {‘s’, ‘e’, ‘start’, ‘end’}
Convention for converting period to timestamp; start of period vs. end
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The axis to convert (the index by default)
copy : boolean, default True
If false then underlying input data is not copied
Returns: df : DataFrame with DatetimeIndex
Notes
Dask doesn’t supports following argument(s).
- copy
-
truediv
(other, level=None, fill_value=None, axis=0)¶ Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other: Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
-
unique
(split_every=None)¶ Return Series of unique values in the object. Includes NA values.
Returns: uniques : Series
-
value_counts
(split_every=None)¶ Returns object containing counts of unique values.
The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
Parameters: normalize : boolean, default False
If True then the object returned will contain the relative frequencies of the unique values.
sort : boolean, default True
Sort by values
ascending : boolean, default False
Sort in ascending order
bins : integer, optional
Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data
dropna : boolean, default True
Don’t include counts of NaN.
Returns: counts : Series
Notes
Dask doesn’t supports following argument(s).
- normalize
- sort
- ascending
- bins
- dropna
-
var
(axis=None, skipna=True, ddof=1, split_every=False)¶ Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
ddof : int, default 1
degrees of freedom
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: var : Series or DataFrame (if level specified)
Notes
Dask doesn’t supports following argument(s).
- level
- numeric_only
-
visualize
(filename='mydask', format=None, optimize_graph=False, **kwargs)¶ Render the computation of this object’s task graph using graphviz.
Requires
graphviz
to be installed.Parameters: filename : str or None, optional
The name (without an extension) of the file to write to disk. If filename is None, no file will be written, and we communicate with dot using only pipes.
format : {‘png’, ‘pdf’, ‘dot’, ‘svg’, ‘jpeg’, ‘jpg’}, optional
Format in which to write output file. Default is ‘png’.
optimize_graph : bool, optional
If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.
**kwargs
Additional keyword arguments to forward to
to_graphviz
.Returns: result : IPython.diplay.Image, IPython.display.SVG, or None
See dask.dot.dot_graph for more information.
See also
dask.base.visualize
,dask.dot.dot_graph
Notes
For more information on optimization see here:
-
where
(cond, other=nan)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
Parameters: cond : boolean NDFrame, array or callable
If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as cond.
other : scalar, NDFrame, or callable
If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as other.
inplace : boolean, default False
Whether to perform the operation in place on the data
axis : alignment axis if needed, default None
level : alignment level if needed, default None
try_cast : boolean, default False
try to cast the result back to the input type (if possible),
raise_on_error : boolean, default True
Whether to raise on invalid data types (e.g. trying to where on strings)
Returns: wh : same type as caller
See also
Notes
Dask doesn’t supports following argument(s).
- inplace
- axis
- level
- try_cast
- raise_on_error
Examples
>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
-
DataFrameGroupBy¶
-
class
dask.dataframe.groupby.
DataFrameGroupBy
(df, index=None, slice=None, **kwargs)¶ -
agg
(arg, split_every=None)¶ Aggregate using input function or dict of {column -> function}
Parameters: arg : function or dict
Function to use for aggregating groups. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If passed a dict, the keys must be DataFrame column names.
- Accepted Combinations are:
- string cythonized function name
- function
- list of functions
- dict of columns -> functions
- nested dict of names -> dicts of functions
Returns: aggregated : DataFrame
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
Notes
Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).
-
aggregate
(arg, split_every=None)¶ Aggregate using input function or dict of {column -> function}
Parameters: arg : function or dict
Function to use for aggregating groups. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If passed a dict, the keys must be DataFrame column names.
- Accepted Combinations are:
- string cythonized function name
- function
- list of functions
- dict of columns -> functions
- nested dict of names -> dicts of functions
Returns: aggregated : DataFrame
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
Notes
Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).
-
apply
(func, meta='__no_default__', columns='__no_default__')¶ Parallel version of pandas GroupBy.apply
This mimics the pandas version except for the following:
- The user should provide output metadata.
- If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Parameters: func: function
Function to apply
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.columns: list, scalar or None
Deprecated, use meta instead. If list is given, the result is a DataFrame which columns is specified list. Otherwise, the result is a Series which name is given scalar or None (no name). If name keyword is not given, dask tries to infer the result type using its beginning of data. This inference may take some time and lead to unexpected result
Returns: applied : Series or DataFrame depending on columns keyword
-
count
(split_every=None)¶ Compute count of group, excluding missing values
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
get_group
(key)¶ Constructs NDFrame from group with provided name
Parameters: name : object
the name of the group to get as a DataFrame
obj : NDFrame, default None
the NDFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used
Returns: group : type of obj
Notes
Dask doesn’t supports following argument(s).
- name
- obj
-
max
(split_every=None)¶ Compute max of group values
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
mean
(split_every=None)¶ Compute mean of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
min
(split_every=None)¶ Compute min of group values
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
size
(split_every=None)¶ Compute group sizes
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
std
(ddof=1, split_every=None)¶ Compute standard deviation of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters: ddof : integer, default 1
degrees of freedom
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
sum
(split_every=None)¶ Compute sum of group values
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
var
(ddof=1, split_every=None)¶ Compute variance of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters: ddof : integer, default 1
degrees of freedom
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
SeriesGroupBy¶
-
class
dask.dataframe.groupby.
SeriesGroupBy
(df, index, slice=None, **kwargs)¶ -
agg
(arg, split_every=None)¶ Apply aggregation function or functions to groups, yielding most likely Series but in some cases DataFrame depending on the output of the aggregation function
Parameters: func_or_funcs : function or list / dict of functions
List/dict of functions will produce DataFrame with column names determined by the function names themselves (list) or the keys in the dict
Returns: Series or DataFrame
See also
apply
,transform
Notes
Dask doesn’t supports following argument(s).
- func_or_funcs
Examples
>>> series bar 1.0 baz 2.0 qot 3.0 qux 4.0
>>> mapper = lambda x: x[0] # first letter >>> grouped = series.groupby(mapper)
>>> grouped.aggregate(np.sum) b 3.0 q 7.0
>>> grouped.aggregate([np.sum, np.mean, np.std]) mean std sum b 1.5 0.5 3 q 3.5 0.5 7
>>> grouped.agg({'result' : lambda x: x.mean() / x.std(), ... 'total' : np.sum}) result total b 2.121 3 q 4.95 7
-
aggregate
(arg, split_every=None)¶ Apply aggregation function or functions to groups, yielding most likely Series but in some cases DataFrame depending on the output of the aggregation function
Parameters: func_or_funcs : function or list / dict of functions
List/dict of functions will produce DataFrame with column names determined by the function names themselves (list) or the keys in the dict
Returns: Series or DataFrame
See also
apply
,transform
Notes
Dask doesn’t supports following argument(s).
- func_or_funcs
Examples
>>> series bar 1.0 baz 2.0 qot 3.0 qux 4.0
>>> mapper = lambda x: x[0] # first letter >>> grouped = series.groupby(mapper)
>>> grouped.aggregate(np.sum) b 3.0 q 7.0
>>> grouped.aggregate([np.sum, np.mean, np.std]) mean std sum b 1.5 0.5 3 q 3.5 0.5 7
>>> grouped.agg({'result' : lambda x: x.mean() / x.std(), ... 'total' : np.sum}) result total b 2.121 3 q 4.95 7
-
apply
(func, meta='__no_default__', columns='__no_default__')¶ Parallel version of pandas GroupBy.apply
This mimics the pandas version except for the following:
- The user should provide output metadata.
- If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Parameters: func: function
Function to apply
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.columns: list, scalar or None
Deprecated, use meta instead. If list is given, the result is a DataFrame which columns is specified list. Otherwise, the result is a Series which name is given scalar or None (no name). If name keyword is not given, dask tries to infer the result type using its beginning of data. This inference may take some time and lead to unexpected result
Returns: applied : Series or DataFrame depending on columns keyword
-
count
(split_every=None)¶ Compute count of group, excluding missing values
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
get_group
(key)¶ Constructs NDFrame from group with provided name
Parameters: name : object
the name of the group to get as a DataFrame
obj : NDFrame, default None
the NDFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used
Returns: group : type of obj
Notes
Dask doesn’t supports following argument(s).
- name
- obj
-
max
(split_every=None)¶ Compute max of group values
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
mean
(split_every=None)¶ Compute mean of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
min
(split_every=None)¶ Compute min of group values
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
size
(split_every=None)¶ Compute group sizes
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
std
(ddof=1, split_every=None)¶ Compute standard deviation of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters: ddof : integer, default 1
degrees of freedom
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
sum
(split_every=None)¶ Compute sum of group values
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
var
(ddof=1, split_every=None)¶ Compute variance of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters: ddof : integer, default 1
degrees of freedom
See also
pandas.Series.groupby
,pandas.DataFrame.groupby
,pandas.Panel.groupby
-
Other functions¶
-
dask.dataframe.
compute
(*args, **kwargs)¶ Compute several dask collections at once.
Parameters: args : object
Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.
get : callable, optional
A scheduler
get
function to use. If not provided, the default is to check the global settings first, and then fall back to defaults for the collections.optimize_graph : bool, optional
If True [default], the optimizations for each collection are applied before computation. Otherwise the graph is run as is. This can be useful for debugging.
kwargs
Extra keywords to forward to the scheduler
get
function.Examples
>>> import dask.array as da >>> a = da.arange(10, chunks=2).sum() >>> b = da.arange(10, chunks=2).mean() >>> compute(a, b) (45, 4.5)
-
dask.dataframe.
map_partitions
(func, *args, **kwargs)¶ Apply Python function on each DataFrame partition.
Parameters: func : function
Function applied to each partition.
args, kwargs :
Arguments and keywords to pass to the function. At least one of the args should be a Dask.dataframe.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
-
dask.dataframe.multi.
concat
(dfs, axis=0, join='outer', interleave_partitions=False)¶ Concatenate DataFrames along rows.
- When axis=0 (default), concatenate DataFrames row-wise:
- If all divisions are known and ordered, concatenate DataFrames keeping divisions. When divisions are not ordered, specifying interleave_partition=True allows concatenate divisions each by each.
- If any of division is unknown, concatenate DataFrames resetting its division to unknown (None)
- When axis=1, concatenate DataFrames column-wise:
- Allowed if all divisions are known.
- If any of division is unknown, it raises ValueError.
Parameters: dfs : list
List of dask.DataFrames to be concatenated
axis : {0, 1, ‘index’, ‘columns’}, default 0
The axis to concatenate along
join : {‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis
interleave_partitions : bool, default False
Whether to concatenate DataFrames ignoring its order. If True, every divisions are concatenated each by each.
Examples
If all divisions are known and ordered, divisions are kept.
>>> a dd.DataFrame<x, divisions=(1, 3, 5)> >>> b dd.DataFrame<y, divisions=(6, 8, 10)> >>> dd.concat([a, b]) dd.DataFrame<concat-..., divisions=(1, 3, 6, 8, 10)>
Unable to concatenate if divisions are not ordered.
>>> a dd.DataFrame<x, divisions=(1, 3, 5)> >>> b dd.DataFrame<y, divisions=(2, 3, 6)> >>> dd.concat([a, b]) ValueError: All inputs have known divisions which cannot be concatenated in order. Specify interleave_partitions=True to ignore order
Specify interleave_partitions=True to ignore the division order.
>>> dd.concat([a, b], interleave_partitions=True) dd.DataFrame<concat-..., divisions=(1, 2, 3, 5, 6)>
If any of division is unknown, the result division will be unknown
>>> a dd.DataFrame<x, divisions=(None, None)> >>> b dd.DataFrame<y, divisions=(1, 4, 10)> >>> dd.concat([a, b]) dd.DataFrame<concat-..., divisions=(None, None, None, None)>
- When axis=0 (default), concatenate DataFrames row-wise:
-
dask.dataframe.multi.
merge
(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)¶ Merge DataFrame objects by performing a database-style join operation by columns or indexes.
If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
Parameters: left : DataFrame
right : DataFrame
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
- left: use only keys from left frame (SQL: left outer join)
- right: use only keys from right frame (SQL: right outer join)
- outer: use union of keys from both frames (SQL: full outer join)
- inner: use intersection of keys from both frames (SQL: inner join)
on : label or list
Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.
left_on : label or list, or array-like
Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns
right_on : label or list, or array-like
Field names to join on in right DataFrame or vector/list of vectors per left_on docs
left_index : boolean, default False
Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels
right_index : boolean, default False
Use the index from the right DataFrame as the join key. Same caveats as left_index
sort : boolean, default False
Sort the join keys lexicographically in the result DataFrame
suffixes : 2-length sequence (tuple, list, ...)
Suffix to apply to overlapping column names in the left and right side, respectively
copy : boolean, default True
If False, do not copy data unnecessarily
indicator : boolean or string, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.
New in version 0.17.0.
Returns: merged : DataFrame
The output type will the be same as ‘left’, if it is a subclass of DataFrame.
See also
merge_ordered
,merge_asof
Examples
>>> A >>> B lkey value rkey value 0 foo 1 0 foo 5 1 bar 2 1 bar 6 2 baz 3 2 qux 7 3 foo 4 3 bar 8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 4 foo 5 2 bar 2 bar 6 3 bar 2 bar 8 4 baz 3 NaN NaN 5 NaN NaN qux 7
-
dask.dataframe.
read_csv
(urlpath, blocksize=33554432, collection=True, lineterminator=None, compression=None, sample=256000, enforce=False, storage_options=None, **kwargs)¶ Read CSV files into a Dask.DataFrame
This parallelizes the
pandas.read_csv
file in the following ways:It supports loading many files at once using globstrings as follows:
>>> df = dd.read_csv('myfiles.*.csv')
In some cases it can break up large files as follows:
>>> df = dd.read_csv('largefile.csv', blocksize=25e6) # 25MB chunks
You can read CSV files from external resources (e.g. S3, HDFS) providing a URL:
>>> df = dd.read_csv('s3://bucket/myfiles.*.csv') >>> df = dd.read_csv('hdfs:///myfiles.*.csv') >>> df = dd.read_csv('hdfs://namenode.example.com/myfiles.*.csv')
Internally
dd.read_csv
usespandas.read_csv
and so supports many of the same keyword arguments with the same performance guarantees.See the docstring for
pandas.read_csv
for more information on available keyword arguments.Note that this function may fail if a CSV file includes quoted strings that contain the line terminator.
Parameters: urlpath : string
Absolute or relative filepath, URL (may include protocols like
s3://
), or globstring for CSV files.blocksize : int or None
Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores. If
None
, use a single block for each file.collection : boolean
Return a dask.dataframe if True or list of dask.delayed objects if False
sample : int
Number of bytes to use when determining dtypes
storage_options : dict
Extra options that make sense to a particular storage connection, e.g. host, port, username, password, etc.
**kwargs : dict
Options to pass down to
pandas.read_csv
-
dask.dataframe.
read_table
(urlpath, blocksize=33554432, collection=True, lineterminator=None, compression=None, sample=256000, enforce=False, storage_options=None, **kwargs)¶ Read delimited files into a Dask.DataFrame
This parallelizes the
pandas.read_table
file in the following ways:It supports loading many files at once using globstrings as follows:
>>> df = dd.read_table('myfiles.*.csv')
In some cases it can break up large files as follows:
>>> df = dd.read_table('largefile.csv', blocksize=25e6) # 25MB chunks
You can read CSV files from external resources (e.g. S3, HDFS) providing a URL:
>>> df = dd.read_table('s3://bucket/myfiles.*.csv') >>> df = dd.read_table('hdfs:///myfiles.*.csv') >>> df = dd.read_table('hdfs://namenode.example.com/myfiles.*.csv')
Internally
dd.read_table
usespandas.read_table
and so supports many of the same keyword arguments with the same performance guarantees.See the docstring for
pandas.read_table
for more information on available keyword arguments.Note that this function may fail if a delimited file includes quoted strings that contain the line terminator.
Parameters: urlpath : string
Absolute or relative filepath, URL (may include protocols like
s3://
), or globstring for delimited files.blocksize : int or None
Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores. If
None
, use a single block for each file.collection : boolean
Return a dask.dataframe if True or list of dask.delayed objects if False
sample : int
Number of bytes to use when determining dtypes
storage_options : dict
Extra options that make sense to a particular storage connection, e.g. host, port, username, password, etc.
**kwargs : dict
Options to pass down to
pandas.read_table
-
dask.dataframe.
read_hdf
(path_or_buf, key=None, **kwargs)¶ read from the store, close it if we opened it
Retrieve pandas object stored in file, optionally based on where criteria
Parameters: path_or_buf : path (string), buffer, or path object (pathlib.Path or
py._path.local.LocalPath) to read from
New in version 0.19.0: support for pathlib, py.path.
key : group identifier in the store. Can be omitted if the HDF file
contains a single pandas object.
where : list of Term (or convertable) objects, optional
start : optional, integer (defaults to None), row number to start
selection
stop : optional, integer (defaults to None), row number to stop
selection
columns : optional, a list of columns that if not None, will limit the
return columns
iterator : optional, boolean, return an iterator, default False
chunksize : optional, nrows to include in iteration, return an iterator
Returns: The selected object
-
dask.dataframe.
from_array
(x, chunksize=50000, columns=None)¶ Read dask Dataframe from any slicable array
Uses getitem syntax to pull slices out of the array. The array need not be a NumPy array but must support slicing syntax
x[50000:100000]and have 2 dimensions:
x.ndim == 2or have a record dtype:
x.dtype == [(‘name’, ‘O’), (‘balance’, ‘i8’)]
-
dask.dataframe.
from_pandas
(data, npartitions=None, chunksize=None, sort=True, name=None)¶ Construct a dask object from a pandas object.
If given a
pandas.Series
adask.Series
will be returned. If given apandas.DataFrame
adask.DataFrame
will be returned. All other pandas objects will raise aTypeError
.Parameters: df : pandas.DataFrame or pandas.Series
The DataFrame/Series with which to construct a dask DataFrame/Series
npartitions : int, optional
The number of partitions of the index to create.
chunksize : int, optional
The size of the partitions of the index.
sort: bool
Sort input first to obtain cleanly divided partitions or don’t sort and don’t get cleanly divided partitions
name: string, optional
An optional keyname for the dataframe. Defaults to hashing the input
Returns: dask.DataFrame or dask.Series
A dask DataFrame/Series partitioned along the index
Raises: TypeError
If something other than a
pandas.DataFrame
orpandas.Series
is passed in.See also
from_array
- Construct a dask.DataFrame from an array that has record dtype
from_bcolz
- Construct a dask.DataFrame from a bcolz ctable
read_csv
- Construct a dask.DataFrame from a CSV file
Examples
>>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))), ... index=pd.date_range(start='20100101', periods=6)) >>> ddf = from_pandas(df, npartitions=3) >>> ddf.divisions (Timestamp('2010-01-01 00:00:00', freq='D'), Timestamp('2010-01-03 00:00:00', freq='D'), Timestamp('2010-01-05 00:00:00', freq='D'), Timestamp('2010-01-06 00:00:00', freq='D')) >>> ddf = from_pandas(df.a, npartitions=3) # Works with Series too! >>> ddf.divisions (Timestamp('2010-01-01 00:00:00', freq='D'), Timestamp('2010-01-03 00:00:00', freq='D'), Timestamp('2010-01-05 00:00:00', freq='D'), Timestamp('2010-01-06 00:00:00', freq='D'))
-
dask.dataframe.
from_bcolz
(x, chunksize=None, categorize=True, index=None, lock=<unlocked _thread.lock object>, **kwargs)¶ Read dask Dataframe from bcolz.ctable
Parameters: x : bcolz.ctable
Input data
chunksize : int, optional
The size of blocks to pull out from ctable. Ideally as large as can comfortably fit in memory
categorize : bool, defaults to True
Automatically categorize all string dtypes
index : string, optional
Column to make the index
lock: bool or Lock
Lock to use when reading or False for no lock (not-thread-safe)
See also
from_array
- more generic function not optimized for bcolz
-
dask.dataframe.
from_dask_array
(x, columns=None)¶ Convert dask Array to dask DataFrame
Converts a 2d array into a DataFrame and a 1d array into a Series.
Parameters: x: da.Array
columns: list or string
list of column names if DataFrame, single string if Series
Examples
>>> import dask.array as da >>> import dask.dataframe as dd >>> x = da.ones((4, 2), chunks=(2, 2)) >>> df = dd.io.from_dask_array(x, columns=['a', 'b']) >>> df.compute() a b 0 1.0 1.0 1 1.0 1.0 2 1.0 1.0 3 1.0 1.0
-
dask.dataframe.
from_delayed
(dfs, meta=None, divisions=None, prefix='from-delayed', metadata=None)¶ Create DataFrame from many dask.delayed objects
Parameters: dfs : list of Delayed
An iterable of
dask.delayed.Delayed
objects, such as come fromdask.delayed
These comprise the individual partitions of the resulting dataframe.meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.divisions : tuple, str, optional
Partition boundaries along the index. For tuple, see http://dask.pydata.io/en/latest/dataframe-partitions.html For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information
prefix : str, optional
Prefix to prepend to the keys.
-
dask.dataframe.rolling.
rolling_apply
(arg, window, func, min_periods=None, freq=None, center=False, args=(), kwargs={})¶ Generic moving function application.
Parameters: arg : Series, DataFrame
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
func : function
Must produce a single value from an ndarray input
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Whether the label should correspond with center of window
args : tuple
Passed on to func
kwargs : dict
Passed on to func
Returns: y : type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).To learn more about the frequency strings, please see this link.
-
dask.dataframe.rolling.
rolling_chunk
(func, part1, part2, window, *args)¶
-
dask.dataframe.rolling.
rolling_count
(arg, window, **kwargs)¶ Rolling count of number of non-NaN observations inside provided window.
Parameters: arg : DataFrame or numpy ndarray-like
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Whether the label should correspond with center of window
how : string, default ‘mean’
Method for down- or re-sampling
Returns: rolling_count : type of caller
Notes
The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).To learn more about the frequency strings, please see this link.
-
dask.dataframe.rolling.
rolling_kurt
(arg, window, min_periods=None, freq=None, center=False, **kwargs)¶ Unbiased moving kurtosis.
Parameters: arg : Series, DataFrame
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Set the labels at the center of the window.
how : string, default ‘None’
Method for down- or re-sampling
Returns: y : type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_max
(arg, window, min_periods=None, freq=None, center=False, **kwargs)¶ Moving maximum.
Parameters: arg : Series, DataFrame
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Set the labels at the center of the window.
how : string, default ‘’max’
Method for down- or re-sampling
Returns: y : type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_mean
(arg, window, min_periods=None, freq=None, center=False, **kwargs)¶ Moving mean.
Parameters: arg : Series, DataFrame
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Set the labels at the center of the window.
how : string, default ‘None’
Method for down- or re-sampling
Returns: y : type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_median
(arg, window, min_periods=None, freq=None, center=False, **kwargs)¶ Moving median.
Parameters: arg : Series, DataFrame
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Set the labels at the center of the window.
how : string, default ‘’median’
Method for down- or re-sampling
Returns: y : type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_min
(arg, window, min_periods=None, freq=None, center=False, **kwargs)¶ Moving minimum.
Parameters: arg : Series, DataFrame
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Set the labels at the center of the window.
how : string, default ‘’min’
Method for down- or re-sampling
Returns: y : type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_quantile
(arg, window, quantile, min_periods=None, freq=None, center=False)¶ Moving quantile.
Parameters: arg : Series, DataFrame
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
quantile : float
0 <= quantile <= 1
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Whether the label should correspond with center of window
Returns: y : type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).To learn more about the frequency strings, please see this link.
-
dask.dataframe.rolling.
rolling_skew
(arg, window, min_periods=None, freq=None, center=False, **kwargs)¶ Unbiased moving skewness.
Parameters: arg : Series, DataFrame
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Set the labels at the center of the window.
how : string, default ‘None’
Method for down- or re-sampling
Returns: y : type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_std
(arg, window, min_periods=None, freq=None, center=False, **kwargs)¶ Moving standard deviation.
Parameters: arg : Series, DataFrame
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Set the labels at the center of the window.
how : string, default ‘None’
Method for down- or re-sampling
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is
N - ddof
, whereN
represents the number of elements.Returns: y : type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_sum
(arg, window, min_periods=None, freq=None, center=False, **kwargs)¶ Moving sum.
Parameters: arg : Series, DataFrame
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Set the labels at the center of the window.
how : string, default ‘None’
Method for down- or re-sampling
Returns: y : type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_var
(arg, window, min_periods=None, freq=None, center=False, **kwargs)¶ Moving variance.
Parameters: arg : Series, DataFrame
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Set the labels at the center of the window.
how : string, default ‘None’
Method for down- or re-sampling
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is
N - ddof
, whereN
represents the number of elements.Returns: y : type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_window
(arg, window=None, win_type=None, min_periods=None, freq=None, center=False, mean=True, axis=0, how=None, **kwargs)¶ Applies a moving window of type
window_type
and sizewindow
on the data.Parameters: arg : Series, DataFrame
window : int or ndarray
Weighting window specification. If the window is an integer, then it is treated as the window length and win_type is required
win_type : str, default None
Window type (see Notes)
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Whether the label should correspond with center of window
mean : boolean, default True
If True computes weighted mean, else weighted sum
axis : {0, 1}, default 0
how : string, default ‘mean’
Method for down- or re-sampling
Returns: y : type of input argument
Notes
The recognized window types are:
boxcar
triang
blackman
hamming
bartlett
parzen
bohman
blackmanharris
nuttall
barthann
kaiser
(needs beta)gaussian
(needs std)general_gaussian
(needs power, width)slepian
(needs width).
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).To learn more about the frequency strings, please see this link.