102.7. Masked array pitfalls#
102.7 Masked array pitfalls¶
For the Rubin Science Platform at data.lsst.cloud.
Data Release: Data Preview 1
Container Size: large
LSST Science Pipelines version: r29.2.0
Last verified to run: 2026-01-02
Repository: github.com/lsst/tutorial-notebooks
DOI: 10.11578/rubin/dc.20250909.20
Learning objective: How to correctly compute statistics on masked arrays by identifying and avoiding common numpy pitfalls.
LSST data products: CcdVisit table
Packages: lsst.rsp.get_tap_service
Credit: Originally developed by the Rubin Community Science team. Please consider acknowledging them if this notebook is used for the preparation of journal articles, software releases, or other notebooks.
Get Support: Everyone is encouraged to ask questions or raise issues in the Support Category of the Rubin Community Forum. Rubin staff will respond to all questions posted there.
1. Introduction¶
A masked array pairs a raw data array with a boolean mask that flags invalid entries. Crucially, this mask layer exists even if all data is valid. Scientific errors could occur when standard functions ignore the mask and compute results using the underlying raw data. This tutorial demonstrates how to correctly handle masked arrays, which may have missing or invalid entries, to avoid silent scientific errors. Use the specific failure modes presented in this tutorial as a guide to the fundamental disconnect between standard numpy functions and the masked array structure.
While this tutorial investigates common operations like numpy.median, these risks extend to any function that ignores the internal mask mechanism. Develop a rigorous habit of verifying whether your statistical tools respect the mask. Adopt robust statistical strategies, such as data compression using .compressed(), which flattens the input array and discards masked values, or specialized modules/libraries for handling masked arrays like numpy.ma and scipy.stats.mstats, to guarantee your scientific results remain robust against invalid underlying data.
Note: The examples demonstrated herein are based on numpy 2.2.6, the default version on the RSP as of January 2, 2026. Future updates to the library may alter how these functions handle masked input, potentially rendering some examples obsolete.
1.1. Import packages¶
Import python packages numpy, and scipy. From the lsst package, import the module for accessing the Table Access Protocol (TAP) service.
import numpy as np
import scipy.stats.mstats as scistats
from lsst.rsp import get_tap_service
1.2. Define parameters¶
Instantiate the TAP service.
service = get_tap_service("tap")
assert service is not None
Option to check the numpy version.
# print(np.__version__)
2. Retrieve data¶
Start with a simple query executing a cone search on the CcdVisit table for CCD metadata within 1 degree of the center of the 47 Tuc field, RA, Dec = 6.128, -72.090 degrees.
ra_cen = 6.128
dec_cen = -72.090
radius = 1.0
query = """
SELECT ra, psfSigma, ccdVisitId
FROM dp1.CcdVisit
WHERE CONTAINS(POINT('ICRS', ra, dec),
CIRCLE('ICRS', {}, {}, {}))=1
""".format(ra_cen, dec_cen, radius)
Submit the query to the TAP service.
job = service.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
if job.phase == 'ERROR':
job.raise_if_error()
Job phase is COMPLETED
Fetch the results, store them as an Astropy table.
results = job.fetch_result().to_table()
masked_type_cols = [
name for name in results.colnames
if hasattr(results[name], 'mask')
]
all_masked = len(masked_type_cols) == len(results.colnames)
if all_masked == True:
print(f"All {len(results.colnames)} columns MaskedColumn objects.")
else:
print("Some columns are standard arrays (not masked).")
All 3 columns MaskedColumn objects.
2.2. Identify active masks¶
While it is confirmed that every column is technically a MaskedColumn object, this does not mean every column contains bad data. Many columns (like IDs or coordinates) may be fully populated.
Inspect each column's boolean mask using np.any(results[name].mask) to check if at least one entry is flagged as True (invalid). Based on this check, separate the columns into two categories:
- Active Masks: Columns that contain at least one invalid/flagged entry.
- Fully Valid: Columns that are entirely clean (all mask values are
False).
active_mask_cols = [
name for name in results.colnames
if np.any(results[name].mask)
]
fully_valid_cols = [
name for name in results.colnames
if name not in active_mask_cols
]
print(f"Columns with missing/invalid data:")
print(f" -> {active_mask_cols}\n")
print(f"Columns that are fully valid:")
print(f" -> {fully_valid_cols}")
Columns with missing/invalid data: -> ['psfSigma'] Columns that are fully valid: -> ['ra', 'ccdVisitId']
3. Compute statistics with Numpy.¶
Compute statistics for both "active mask" columns (those containing invalid entries) and "fully valid" columns (those with only valid entries) to investigate how different functions/methods handle valid and invalid entries.
Define a data_sources dictionary containing one sample column from each category (fully_valid_cols and active_mask_cols) to streamline the upcoming tests.
data_sources = {
"A fully valid column": results[fully_valid_cols[0]],
"An active mask column": results[active_mask_cols[0]]
}
3.1. Mean¶
Iterate through the selected columns and display the output of standard numpy functions (np.mean, np.nanmean) against explicit mask-handling techniques (via np.ma or .compressed()) to reveal differences in resultant statistics and precision.
print(f"{'Source': <21} | {'dtype_in'} | {'dtype_out'} | {'Method': <13} | {'Result'}")
print("-" * 83)
for name, data in data_sources.items():
input_dtype = str(data.dtype)
out_dtype_mean = str(np.mean(data).dtype)
out_dtype_nanmean = str(np.nanmean(data).dtype)
out_dtype_mamean = str(np.ma.mean(data).dtype)
out_dtype_compmean = str(np.mean(data.compressed()).dtype)
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_mean: <9} | {'np.mean': <13} | {np.mean(data): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_nanmean: <9} | {'np.nanmean': <13} | {np.nanmean(data): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_mamean: <9} | {'np.ma.mean': <13} | {np.ma.mean(data): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_compmean: <9} | {'.compressed()': <13} | {np.mean(data.compressed()): .17f}")
print("-" * 83)
Source | dtype_in | dtype_out | Method | Result ----------------------------------------------------------------------------------- A fully valid column | float64 | float64 | np.mean | 6.06579679440014097 A fully valid column | float64 | float64 | np.nanmean | 6.06579679440014097 A fully valid column | float64 | float64 | np.ma.mean | 6.06579679440014097 A fully valid column | float64 | float64 | .compressed() | 6.06579679440014097 ----------------------------------------------------------------------------------- An active mask column | float32 | float64 | np.mean | 2.65416754138675604 An active mask column | float32 | float32 | np.nanmean | 2.65416765213012695 An active mask column | float32 | float64 | np.ma.mean | 2.65416754138675604 An active mask column | float32 | float32 | .compressed() | 2.65416789054870605 -----------------------------------------------------------------------------------
Conclusion: Calculating the mean generally works well across all four methods for both "fully valid" and "active mask" columns. However, notice that while the "fully valid" column (with input dtype of np.float64) yields identical results, the "active mask" column (with input dtype of np.float32) shows slight variations. This discrepancy arises because different reduction functions handle accumulator precision differently: some promote np.float32 inputs to np.float64 for the internal calculation (yielding high precision), while others compute or return results in the original np.float32. Account for the fixed input data types provided by the LSST Science Pipelines by verifying the return type of your chosen reduction functions.
3.2. Median¶
Repeat the same analysis for the median.
Warning: When you first run this notebook tutorial, the following cell produces a warning stating that "'partition' will ignore the 'mask' of the
MaskedColumn". Disregard this alert; the code intentionally triggers this behavior to demonstrate hownp.medianandnp.nanmedianfail to respectMaskedColumnmasks.
print(f"{'Source': <21} | {'dtype_in'} | {'dtype_out'} | {'Method': <13} | {'Result'}")
print("-" * 83)
for name, data in data_sources.items():
input_dtype = str(data.dtype)
out_dtype_median = str(np.median(data).dtype)
out_dtype_nanmedian = str(np.nanmedian(data).dtype)
out_dtype_mamedian = str(np.ma.median(data).dtype)
out_dtype_compmedian = str(np.median(data.compressed()).dtype)
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_median: <9} | {'np.median': <13} | {np.median(data): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_nanmedian: <9} | {'np.nanmedian': <13} | {np.nanmedian(data): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_mamedian: <9} | {'np.ma.median': <13} | {np.ma.median(data): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_compmedian: <9} | {'.compressed()': <13} | {np.median(data.compressed()): .17f}")
print("-" * 83)
Source | dtype_in | dtype_out | Method | Result ----------------------------------------------------------------------------------- A fully valid column | float64 | float64 | np.median | 6.06872133075949094 A fully valid column | float64 | float64 | np.nanmedian | 6.06872133075949094 A fully valid column | float64 | float64 | np.ma.median | 6.06872133075949094 A fully valid column | float64 | float64 | .compressed() | 6.06872133075949094 ----------------------------------------------------------------------------------- An active mask column | float32 | float32 | np.median | nan An active mask column | float32 | float32 | np.nanmedian | nan An active mask column | float32 | float32 | np.ma.median | 2.58667993545532227 An active mask column | float32 | float32 | .compressed() | 2.58667993545532227 -----------------------------------------------------------------------------------
/opt/lsst/software/stack/conda/envs/lsst-scipipe-10.1.0/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:868: UserWarning: Warning: 'partition' will ignore the 'mask' of the MaskedColumn. a.partition(kth, axis=axis, kind=kind, order=order)
Conclusion: Both np.median and np.nanmedian returned nan for the "active mask" array. This indicates that these functions ignored the mask, accessing the underlying invalid data (likely NaN or other bad values), which propagated to corrupt the final result. In contrast, both np.ma.median and the .compressed() method handled the mask array correctly, excluded the invalid entries and produced the correct result. However, the users should remain cautious regarding the precision match between the input array and returned value. Always use functions/methods that correctly handle the mask array when computing the median.
3.3. Quantile/Percentile¶
Repeat the same analysis for the quantile. Percentile calculations share the exact same behavior regarding masked data and precision types, so the findings here apply to both.
Warning: When you first run this notebook tutorial, the following cell produces a warning stating that "'partition' will ignore the 'mask' of the
MaskedColumn". Disregard this alert; the code intentionally triggers this behavior to demonstrate hownp.quantileandnp.nanquantilefail to respectMaskedColumnmasks.
print(f"{'Source': <21} | {'dtype_in'} | {'dtype_out'} | {'Method': <14} | {'Result'}")
print("-" * 84)
for name, data in data_sources.items():
input_dtype = str(data.dtype)
out_dtype_quantile = str(np.quantile(data, 0.5).dtype)
out_dtype_nanquantile = str(np.nanquantile(data, 0.5).dtype)
out_dtype_compquantile = str(np.quantile(data.compressed(), 0.5).dtype)
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_quantile: <9} | {'np.quantile': <14} | {np.quantile(data, 0.5): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_nanquantile: <9} | {'np.nanquantile': <14} | {np.nanquantile(data, 0.5): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {'N/A': <9} | {'np.ma.quantile': <14} | {' Not available'}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_compquantile: <9} | {'.compressed()': <14} | {np.quantile(data.compressed(), 0.5): .17f}")
print("-" * 84)
Source | dtype_in | dtype_out | Method | Result ------------------------------------------------------------------------------------ A fully valid column | float64 | float64 | np.quantile | 6.06872133075949094 A fully valid column | float64 | float64 | np.nanquantile | 6.06872133075949094 A fully valid column | float64 | N/A | np.ma.quantile | Not available A fully valid column | float64 | float64 | .compressed() | 6.06872133075949094 ------------------------------------------------------------------------------------ An active mask column | float32 | float32 | np.quantile | nan An active mask column | float32 | float32 | np.nanquantile | nan An active mask column | float32 | N/A | np.ma.quantile | Not available An active mask column | float32 | float32 | .compressed() | 2.58667993545532227 ------------------------------------------------------------------------------------
/opt/lsst/software/stack/conda/envs/lsst-scipipe-10.1.0/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:4842: UserWarning: Warning: 'partition' will ignore the 'mask' of the MaskedColumn. arr.partition(
Conclusion: Similar to the median functions, np.quantile and np.nanquantile fail on the "active mask" array because they do not properly handle the mask. Since the current numpy.ma lacks a direct np.quantile equivalent, the only robust approach is to perform these operations on compressed data. Always compress your MaskedColumn before calculating quantiles, by explicitly calling .compressed() to remove invalid entries. This conclusion also applies to the percentile calculations.
3.4. Min/Max¶
Repeat the same analysis for the mininum. Maximum calculations share the exact same behavior regarding masked data and precision types, so the findings here apply to both.
print(f"{'Source': <21} | {'dtype_in'} | {'dtype_out'} | {'Method': <13} | {'Result'}")
print("-" * 83)
for name, data in data_sources.items():
input_dtype = str(data.dtype)
out_dtype_min = str(np.min(data).dtype)
out_dtype_nanmin = str(np.nanmin(data).dtype)
out_dtype_mamin = str(np.ma.min(data).dtype)
out_dtype_compmin = str(np.min(data.compressed()).dtype)
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_min: <9} | {'np.min': <13} | {np.min(data): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_nanmin: <9} | {'np.nanmin': <13} | {np.nanmin(data): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_mamin: <9} | {'np.ma.min': <13} | {np.ma.min(data): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_compmin: <9} | {'.compressed()': <13} | {np.min(data.compressed()): .17f}")
print("-" * 83)
Source | dtype_in | dtype_out | Method | Result ----------------------------------------------------------------------------------- A fully valid column | float64 | float64 | np.min | 4.46371685993863654 A fully valid column | float64 | float64 | np.nanmin | 4.46371685993863654 A fully valid column | float64 | float64 | np.ma.min | 4.46371685993863654 A fully valid column | float64 | float64 | .compressed() | 4.46371685993863654 ----------------------------------------------------------------------------------- An active mask column | float32 | float32 | np.min | 0.28867501020431519 An active mask column | float32 | float32 | np.nanmin | 0.28867501020431519 An active mask column | float32 | float32 | np.ma.min | 0.28867501020431519 An active mask column | float32 | float32 | .compressed() | 0.28867501020431519 -----------------------------------------------------------------------------------
Conclusion: Both the "fully valid" and "active mask" columns yield consistent results across all four methods. This conclusion also applies to the maximum calculations.
3.5. Standard deviation and variance¶
Repeat the same analysis for the stdstandard deviation. Variance calculations share the exact same behavior regarding masked data and precision types, so the findings here apply to both.
print(f"{'Source': <21} | {'dtype_in'} | {'dtype_out'} | {'Method': <13} | {'Result'}")
print("-" * 83)
for name, data in data_sources.items():
input_dtype = str(data.dtype)
out_dtype_std = str(np.std(data).dtype)
out_dtype_nanstd = str(np.nanstd(data).dtype)
out_dtype_mastd = str(np.ma.std(data).dtype)
out_dtype_compstd = str(np.std(data.compressed()).dtype)
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_std: <9} | {'np.std': <13} | {np.std(data): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_nanstd: <9} | {'np.nanstd': <13} | {np.nanstd(data): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_mastd: <9} | {'np.ma.std': <13} | {np.ma.std(data): .17f}")
print(f"{name: <21} | {input_dtype: <8} | {out_dtype_compstd: <9} | {'.compressed()': <13} | {np.std(data.compressed()): .17f}")
print("-" * 83)
Source | dtype_in | dtype_out | Method | Result ----------------------------------------------------------------------------------- A fully valid column | float64 | float64 | np.std | 0.70406182027289055 A fully valid column | float64 | float64 | np.nanstd | 0.70406182027289055 A fully valid column | float64 | float64 | np.ma.std | 0.70406182027289055 A fully valid column | float64 | float64 | .compressed() | 0.70406182027289055 ----------------------------------------------------------------------------------- An active mask column | float32 | float64 | np.std | 0.48106138027025397 An active mask column | float32 | float32 | np.nanstd | 0.48106136918067932 An active mask column | float32 | float64 | np.ma.std | 0.48106138027025397 An active mask column | float32 | float32 | .compressed() | 0.48106136918067932 -----------------------------------------------------------------------------------
Conclusion: Both the "fully valid" and "active mask" columns yield consistent results across all four methods. This conclusion also applies to the variance calculations.
3.6. Summary¶
While functions like np.mean correctly handles MaskedColumn objects, others (e.g., np.median) do not respect the mask layer and implicitly convert the input to a raw array. This exposes invalid data to the calculation, resulting in errors or scientifically incorrect values.
A safe recommendation when using numpy for MaskedColumn objects is to compress them before performing computations, or to use the numpy.ma module when the desired functions are available.
4. Compute statistics with scipy.stats.mstats¶
scipy.stats.mstats is a specialized sub-module within scipy dedicated to statistical functions for MaskedColumn objects. It serves as the masked-array equivalent of the standard scipy.stats module. While numpy.ma provides mainly basic reductions (like mean, median), mstats offers a broader suite of statistical tools that automatically respect the input mask, including correlation functions such as the Spearman rank-order correlation and statistical tests like Pearson's chi-squared and the Kolmogorov-Smirnov test. This eliminates the need to manually compress arrays before performing advanced statistical analysis.
4.1. Compare with numpy¶
Compare the 75th quantile results for the "active mask" column using mstats versus numpy with .compressed().
target_percentile = 0.75
data = data_sources['An active mask column']
scipy_val = scistats.mquantiles(data, prob=[target_percentile])[0]
numpy_val = np.quantile(data.compressed(), target_percentile)
print(f"{'Method': <24} | {'dtype_out': <9} | {'Result'}")
print("-" * 59)
print(f"{'mstats': <24} | {str(scipy_val.dtype): <9} | {scipy_val: .17f}")
print(f"{'numpy with .compressed()': <24} | {str(numpy_val.dtype): <9} | {numpy_val: .17f}")
Method | dtype_out | Result ----------------------------------------------------------- mstats | float64 | 2.82579901218414298 numpy with .compressed() | float32 | 2.82483005523681641
Conclusion: They returned essentially the same results. The only difference is the precision of the returned values.
Option to list all the functions available in a scipy.stats.mstats module.
# fns = [fn for fn in dir(scistats) if not fn.startswith('_')]
# print(fns)
5. Exercise for the learner¶
Retrieve a table you use frequently and check the data types of the columns most relevant to your research. Then, repeat the analysis presented in this tutorial.