Numerical utilities (pwkit.numutil)

The numpy and scipy packages provide a whole host of routines, but there are still some that are missing. The pwkit.numutil module provides several useful additions.

The functionality in this module can be grouped into these categories:

Making functions that auto-broadcast their arguments

@pwkit.numutil.broadcastize(n_arr, ret_spec=0, force_float=True)

Wrap a function to automatically broadcast numpy.ndarray arguments.

It’s often desirable to write numerical utility functions in a way that’s compatible with vectorized processing. It can be tedious to do this, however, since the function arguments need to turned into arrays and checked for compatible shape, and scalar values need to be special cased.

The @broadcastize decorator takes care of these matters. The decorated function can be implemented in vectorized form under the assumption that all array arguments have been broadcast to the same shape. The broadcasting of inputs and (potentially) de-vectorizing of the return values are done automatically. For instance, if you decorate a function foo(x,y) with @numutil.broadcastize(2), you can implement it assuming that both x and y are numpy.ndarray objects that have at least one dimension and are both of the same shape. If the function is called with only scalar arguments, x and y will have shape (1,) and the function’s return value will be turned back into a scalar before reaching the caller.

The n_arr argument specifies the number of array arguments that the function takes. These are required to be at the beginning of its argument list.

The ret_spec argument specifies the structure of the function’s return value.

  • 0 indicates that the value has the same shape as the (broadcasted) vector arguments. If the arguments are all scalar, the return value will be scalar too.

  • 1 indicates that the value is an array of higher rank than the input arguments. For instance, if the input has shape (3,), the output might have shape (4,4,3); in general, if the input has shape s, the output will have shape t + s for some tuple t. If the arguments are all scalar, the output will have a shape of just t. The numpy.asarray() function is called on such arguments, so (for instance) you can return a list of arrays [a, b] and it will be converted into a numpy.ndarray.

  • None indicates that the value is completely independ of the inputs. It is returned as-is.

  • A tuple t indicates that the return value is also a tuple. The elements of the ret_spec tuple should contain the values listed above, and each element of the return value will be handled accordingly.

The default ret_spec is 0, i.e. the return value is expected to be an array of the same shape as the argument(s).

If force_float is true (the default), the input arrays will be converted to floating-point types if necessary (with numpy.asfarray()) before being passed to the function.

Example:

@numutil.broadcastize (2, ret_spec=(0, 1, None)):
def myfunction (x, y, extra_arg):
    print ('a random non-vector argument is:', extra_arg)
    z = x + y
    z[np.where (y)] *= 2
    higher_vector = [x, y, z]
    return z, higher_vector, 'hello'

Convenience functions for statistics

rms(x)

Return the square root of the mean of the squares of x.

weighted_mean(values, uncerts, **kwargs)

weighted_mean_df(df, **kwargs)

The same as weighted_mean(), except the argument is expected to be a two-column pandas.DataFrame whose first column gives the data values and second column gives their uncertainties.

weighted_variance(x, weights)

Return the variance of a weighted sample.

pwkit.numutil.rms(x)[source]

Return the square root of the mean of the squares of x.

pwkit.numutil.weighted_mean(values, uncerts, **kwargs)[source]
pwkit.numutil.weighted_mean_df(df, **kwargs)[source]

The same as weighted_mean(), except the argument is expected to be a two-column pandas.DataFrame whose first column gives the data values and second column gives their uncertainties. Returns (weighted_mean, uncertainty_in_mean).

pwkit.numutil.weighted_variance(x, weights)[source]

Return the variance of a weighted sample.

The weighted sample mean is calculated and subtracted off, so the returned variance is upweighted by n / (n - 1). If the sample mean is known to be zero, you should just compute np.average(x**2, weights=weights).

Convenience functions for pandas.DataFrame objects

reduce_data_frame(df, chunk_slicers[, ...])

"Reduce" a DataFrame by collapsing rows in grouped chunks.

reduce_data_frame_evenly_with_gaps(df, ...)

"Reduce" a DataFrame by collapsing rows in grouped chunks, grouping based on gaps in one of the columns.

slice_around_gaps(values, maxgap)

Given an ordered array of values, generate a set of slices that traverse all of the values.

slice_evenly_with_gaps(values, target_len, ...)

Given an ordered array of values, generate a set of slices that traverse all of the values.

dfsmooth(window, df, ucol[, k])

Smooth a pandas.DataFrame according to a window, weighting based on uncertainties.

smooth_data_frame_with_gaps(window, df, ...)

Smooth a pandas.DataFrame according to a window, weighting based on uncertainties, and breaking the smoothing process at gaps in a time axis.

fits_recarray_to_data_frame(recarray[, ...])

Convert a FITS data table, stored as a Numpy record array, into a Pandas DataFrame object. By default, non-scalar columns are discarded, but if drop_nonscalar_ok is False then a ValueError is raised. Column names are lower-cased. Example::.

data_frame_to_astropy_table(dataframe)

This is a backport of the Astropy method astropy.table.table.Table.from_pandas().

usmooth(window, uncerts, *data, **kwargs)

Smooth data series according to a window, weighting based on uncertainties.

page_data_frame(df[, pager_argv])

Render a DataFrame as text and send it to a terminal pager program (e.g.

pwkit.numutil.reduce_data_frame(df, chunk_slicers, avg_cols=(), uavg_cols=(), minmax_cols=(), nchunk_colname='nchunk', uncert_prefix='u', min_points_per_chunk=3)[source]

“Reduce” a DataFrame by collapsing rows in grouped chunks. Returns another DataFrame with similar columns but fewer rows.

Arguments:

df

The input pandas.DataFrame.

chunk_slicers

An iterable that returns values that are used to slice df with its pandas.DataFrame.iloc() indexer. An example value might be the generator returned from slice_evenly_with_gaps().

avg_cols

An iterable of names of columns that are to be reduced by taking the mean.

uavg_cols

An iterable of names of columns that are to be reduced by taking a weighted mean.

minmax_cols

An iterable of names of columns that are to be reduced by reporting minimum and maximum values.

nchunk_colname

The name of a column to create reporting the number of rows contributing to each chunk.

uncert_prefix

The column name prefix for locating uncertainty estimates. By default, the uncertainty on the column "temp" is given in the column "utemp".

min_points_per_chunk

Require at least this many rows in each chunk. Smaller chunks are discarded.

Returns a new pandas.DataFrame.

pwkit.numutil.reduce_data_frame_evenly_with_gaps(df, valcol, target_len, maxgap, **kwargs)[source]

“Reduce” a DataFrame by collapsing rows in grouped chunks, grouping based on gaps in one of the columns.

This function combines reduce_data_frame() with slice_evenly_with_gaps().

pwkit.numutil.slice_around_gaps(values, maxgap)[source]

Given an ordered array of values, generate a set of slices that traverse all of the values. Within each slice, no gap between adjacent values is larger than maxgap. In other words, these slices break the array into chunks separated by gaps of size larger than maxgap.

pwkit.numutil.slice_evenly_with_gaps(values, target_len, maxgap)[source]

Given an ordered array of values, generate a set of slices that traverse all of the values. Each slice contains about target_len items. However, no slice contains a gap larger than maxgap, so a slice may contain only a single item (if it is surrounded on both sides by a large gap). If a non-gapped run of values does not divide evenly into target_len, the algorithm errs on the side of making the slices contain more than target_len items, rather than fewer. It also attempts to keep the slice size uniform within each non-gapped run.

pwkit.numutil.dfsmooth(window, df, ucol, k=None)[source]

Smooth a pandas.DataFrame according to a window, weighting based on uncertainties.

Arguments are:

window

The smoothing window.

df

The pandas.DataFrame.

ucol

The name of the column in df that contains the uncertainties to weight by.

k = None

If specified, only every k-th point of the results will be kept. If k is None (the default), it is set to window.size, i.e. correlated points will be discarded.

Returns: a smoothed data frame.

The returned data frame has a default integer index.

Example:

sdata = numutil.dfsmooth(np.hamming(7), data, 'u_temp')
pwkit.numutil.smooth_data_frame_with_gaps(window, df, uncert_col, time_col, max_gap, min_points_per_chunk=3, k=None)[source]

Smooth a pandas.DataFrame according to a window, weighting based on uncertainties, and breaking the smoothing process at gaps in a time axis.

Arguments are:

window

The smoothing window.

df

The pandas.DataFrame.

uncert_col

The name of the column in df that contains the uncertainties to weight by.

time_col

The name of the column in df that contains the time-like quantities used to determine where the gaps are.

max_gap

If a difference larger than this value is encountered in df[time_col], the smoothing will be discontinuous around this gap. Therefore the units of this column are whatever the units of df[time_col] are.

k = None

If specified, only every k-th point of the results will be kept. If k is None (the default), it is set to window.size, i.e. correlated points will be discarded. This decimation does not cross over gaps.

min_points_per_chunk = 3

When gaps are identified, if any chunk of data has fewer than this many elements, it is dropped altogether. This counting happens before decimation by k.

Returns: a smoothed data frame with a default integer index.

pwkit.numutil.fits_recarray_to_data_frame(recarray, drop_nonscalar_ok=True)[source]

Convert a FITS data table, stored as a Numpy record array, into a Pandas DataFrame object. By default, non-scalar columns are discarded, but if drop_nonscalar_ok is False then a ValueError is raised. Column names are lower-cased. Example:

from pwkit import io, numutil
hdu_list = io.Path('my-table.fits').read_fits()
# assuming the first FITS extension is a binary table:
df = numutil.fits_recarray_to_data_frame(hdu_list[1].data)

FITS data are big-endian, whereas nowadays almost everything is little-endian. This seems to be an issue for Pandas DataFrames, where df[['col1', 'col2']] triggers an assertion for me if the underlying data are not native-byte-ordered. This function normalizes the read-in data to native endianness to avoid this.

See also pwkit.io.Path.read_fits_bintable().

pwkit.numutil.data_frame_to_astropy_table(dataframe)[source]

This is a backport of the Astropy method astropy.table.table.Table.from_pandas(). It converts a Pandas pandas.DataFrame object to an Astropy astropy.table.Table.

pwkit.numutil.usmooth(window, uncerts, *data, **kwargs)[source]

Smooth data series according to a window, weighting based on uncertainties.

Arguments:

window

The smoothing window.

uncerts

An array of uncertainties used to weight the smoothing.

data

One or more data series, of the same size as uncerts.

k = None

If specified, only every k-th point of the results will be kept. If k is None (the default), it is set to window.size, i.e. correlated points will be discarded.

Returns: (s_uncerts, s_data[0], s_data[1], ...), the smoothed uncertainties and data series.

Example:

u, x, y = numutil.usmooth(np.hamming(7), u, x, y)
pwkit.numutil.page_data_frame(df, pager_argv=['less'], **kwargs)[source]

Render a DataFrame as text and send it to a terminal pager program (e.g. less), so that one can browse a full table conveniently.

df

The DataFrame to view

pager_argv: default ['less']

A list of strings passed to subprocess.Popen that launches the pager program

kwargs

Additional keywords are passed to pandas.DataFrame.to_string().

Returns None. Execution blocks until the pager subprocess exits.

Parallelized versions of simple math algorithms

parallel_newton(func, x0[, fprime, ...])

A parallelized version of scipy.optimize.newton().

parallel_quad(func, a, b[, par_args, ...])

A parallelized version of scipy.integrate.quad().

pwkit.numutil.parallel_newton(func, x0, fprime=None, par_args=(), simple_args=(), tol=1.48e-08, maxiter=50, parallel=True, **kwargs)[source]

A parallelized version of scipy.optimize.newton().

Arguments:

func

The function to search for zeros, called as f(x, [*par_args...], [*simple_args...]).

x0

The initial point for the zero search.

fprime

(Optional) The first derivative of func, called the same way.

par_args

Tuple of additional parallelized arguments.

simple_args

Tuple of additional arguments passed identically to every invocation.

tol

The allowable error of the zero value.

maxiter

Maximum number of iterations.

parallel

Controls parallelization; default uses all available cores. See pwkit.parallel.make_parallel_helper().

kwargs

Passed to scipy.optimize.newton().

Returns: an array of locations of zeros.

Finds zeros in parallel. The values x0, tol, maxiter, and the items of par_args should all be numeric, and may be N-dimensional Numpy arrays. They are all broadcast to a common shape, and one zero-finding run is performed for each element in the resulting array. The return value is an array of zero locations having the same shape as the common broadcast of the parameters named above.

The simple_args are passed to each function identically for each integration. They do not need to be Pickle-able.

Example:

>>> parallel_newton(lambda x, a: x - 2 * a, 2,
                    par_args=(np.arange(6),))
<<< array([  0.,   2.,   4.,   6.,   8.,  10.])
>>> parallel_newton(lambda x: np.sin(x), np.arange(6))
<<< array([  0.00000000e+00,   3.65526589e-26,   3.14159265e+00,
             3.14159265e+00,   3.14159265e+00,   6.28318531e+00])
pwkit.numutil.parallel_quad(func, a, b, par_args=(), simple_args=(), parallel=True, **kwargs)[source]

A parallelized version of scipy.integrate.quad().

Arguments are:

func

The function to integrate, called as f(x, [*par_args...], [*simple_args...]).

a

The lower limit(s) of integration.

b

The upper limits(s) of integration.

par_args

Tuple of additional parallelized arguments.

simple_args

Tuple of additional arguments passed identically to every invocation.

parallel

Controls parallelization; default uses all available cores. See pwkit.parallel.make_parallel_helper().

kwargs

Passed to scipy.integrate.quad(). Don’t set full_output to True.

Returns: integrals and errors; see below.

Computes many integrals in parallel. The values a, b, and the items of par_args should all be numeric, and may be N-dimensional Numpy arrays. They are all broadcast to a common shape, and one integral is performed for each element in the resulting array. If this common shape is (X,Y,Z), the return value has shape (2,X,Y,Z), where the subarray [0,…] contains the computed integrals and the subarray [1,…] contains the absolute error estimates. If a, b, and the items in par_args are all scalars, the return value has shape (2,).

The simple_args are passed to each integrand function identically for each integration. They do not need to be Pickle-able.

Example:

>>> parallel_quad(lambda x, u, v, q: u * x + v,
                  0, # a
                  [3, 4], # b
                  (np.arange(6).reshape((3,2)), np.arange(3).reshape((3,1))), # par_args
                  ('hello',),)

Computes six integrals and returns an array of shape (2,3,2). The functions that are evaluated are:

[[ 0*x + 0, 1*x + 0 ],
 [ 2*x + 1, 3*x + 1 ],
 [ 4*x + 2, 5*x + 2 ]]

and the bounds of the integrals are:

[[ (0, 3), (0, 4) ],
 [ (0, 3), (0, 4) ],
 [ (0, 3), (0, 4) ]]

In all cases the unused fourth parameter q is 'hello'.

Tophat and step functions

unit_tophat_ee(x)

Tophat function on the unit interval, left-exclusive and right-exclusive.

unit_tophat_ei(x)

Tophat function on the unit interval, left-exclusive and right-inclusive.

unit_tophat_ie(x)

Tophat function on the unit interval, left-inclusive and right-exclusive.

unit_tophat_ii(x)

Tophat function on the unit interval, left-inclusive and right-inclusive.

make_tophat_ee(lower, upper)

Return a ufunc-like tophat function on the defined range, left-exclusive and right-exclusive.

make_tophat_ei(lower, upper)

Return a ufunc-like tophat function on the defined range, left-exclusive and right-inclusive.

make_tophat_ie(lower, upper)

Return a ufunc-like tophat function on the defined range, left-inclusive and right-exclusive.

make_tophat_ii(lower, upper)

Return a ufunc-like tophat function on the defined range, left-inclusive and right-inclusive.

make_step_lcont(transition)

Return a ufunc-like step function that is left-continuous.

make_step_rcont(transition)

Return a ufunc-like step function that is right-continuous.

pwkit.numutil.unit_tophat_ee(x)[source]

Tophat function on the unit interval, left-exclusive and right-exclusive. Returns 1 if 0 < x < 1, 0 otherwise.

pwkit.numutil.unit_tophat_ei(x)[source]

Tophat function on the unit interval, left-exclusive and right-inclusive. Returns 1 if 0 < x <= 1, 0 otherwise.

pwkit.numutil.unit_tophat_ie(x)[source]

Tophat function on the unit interval, left-inclusive and right-exclusive. Returns 1 if 0 <= x < 1, 0 otherwise.

pwkit.numutil.unit_tophat_ii(x)[source]

Tophat function on the unit interval, left-inclusive and right-inclusive. Returns 1 if 0 <= x <= 1, 0 otherwise.

pwkit.numutil.make_tophat_ee(lower, upper)[source]

Return a ufunc-like tophat function on the defined range, left-exclusive and right-exclusive. Returns 1 if lower < x < upper, 0 otherwise.

pwkit.numutil.make_tophat_ei(lower, upper)[source]

Return a ufunc-like tophat function on the defined range, left-exclusive and right-inclusive. Returns 1 if lower < x <= upper, 0 otherwise.

pwkit.numutil.make_tophat_ie(lower, upper)[source]

Return a ufunc-like tophat function on the defined range, left-inclusive and right-exclusive. Returns 1 if lower <= x < upper, 0 otherwise.

pwkit.numutil.make_tophat_ii(lower, upper)[source]

Return a ufunc-like tophat function on the defined range, left-inclusive and right-inclusive. Returns 1 if lower < x < upper, 0 otherwise.

pwkit.numutil.make_step_lcont(transition)[source]

Return a ufunc-like step function that is left-continuous. Returns 1 if x > transition, 0 otherwise.

pwkit.numutil.make_step_rcont(transition)[source]

Return a ufunc-like step function that is right-continuous. Returns 1 if x >= transition, 0 otherwise.