207.2. Timeseries distributions#

207_2_Timeseries_distributions

207.2. Timeseries distributions¶

For the Rubin Science Platform at data.lsst.cloud.
Data Release: Data Preview 1
Container Size: Large
LSST Science Pipelines version: r29.2.0
Last verified to run: 2025-12-30
Repository: github.com/lsst/tutorial-notebooks
DOI: 10.11578/rubin/dc.20250909.20

Learning objective: Understand the distributions of timeseries features values in the DiaObject table.

LSST data products: DiaObject, DiaSource

Packages:

Credit: Originally developed by the Rubin Community Science team. Please consider acknowledging them if this notebook is used for the preparation of journal articles, software releases, or other notebooks.

Get Support: Everyone is encouraged to ask questions or raise issues in the Support Category of the Rubin Community Forum. Rubin staff will respond to all questions posted there.

1. Introduction¶

This tutorial illustrates the distributions of the timeseries features values in the DiaObject table, in the context of the limited time baseline and cadence of Data Preview 1.

This tutorial serves as a guide for interpreting the timerseries features values, and an illustration of "inlier" and "outlier" values. For each of the timeseries features this tutorial plots their distributions and correlations with other features, and marks the values of a known variable star and a known extragalactic transient as examples.

The values and distributions of these timeseries features - and thus what constitutes "inliers" and "outliers" - will be different in future data releases, for a few reasons. One reason is that future values would be derived from observations obtained with different cadences and seasons (the data for DP1 was obtained over only a seven week period). Another reason is that the DP1 template images were made with the third of the images from that seven week period which had the best seeing. Astrophysical transients are thus present in the template image, leading to over-subtractions and negative difference-image fluxes. In the future, templates will be made from data obtained the year before.

For future data releases, additional timeseries features are likely to be included, as discussed in Data Management Tech Note 118 Review of Timeseries Features (DMTN-118).

Related tutorials: The 200-level tutorials on the DiaObject and DiaSource tables, and the other tutorial in the 207 series on timeseries features.

1.1. Import packages¶

Import numpy, a fundamental package for scientific computing with arrays in Python (numpy.org), and matplotlib, a comprehensive library for data visualization (matplotlib.org; matplotlib gallery).

From the lsst package, import modules for accessing the Table Access Protocol (TAP) service.

In [1]:

import numpy as np
import matplotlib.pyplot as plt
from lsst.rsp import get_tap_service

1.2. Define parameters and functions¶

Get an instance of the TAP service, and assert that it exists.

In [2]:

service = get_tap_service("tap")
assert service is not None

Define two functions: one to plot a standard histogram for a given timeseries feature with vertical lines to mark the two example objects, and one to plot a standard scatter plot to compare two features with colored edges for the points representing the two example objects. This function will only work in this notebook, while the datasets retrieved in Section 2 are in memory.

In [3]:

def feature_histogram(feature, nbins, bounds=None):
    """
    Plot the histogram for a given feature, and mark the values
      of the known variable star and transient.

    Parameters
    ----------
    feature: str
        Column name of the timeseries feature.
    nbins: int
        Number of histogram bins.
    bounds: float[4]
        A four-element array of x1, x2, y1, y2.
    """

    plt.figure(figsize=(6, 3))
    temp = np.concatenate((np.asarray(star_sample_diaObject[feature]),
                           np.asarray(at_sample_diaObject[feature])))
    use_bins = np.linspace(np.min(temp), np.max(temp), nbins)
    plt.hist(star_sample_diaObject[feature], bins=use_bins, histtype='step', log=True,
             lw=3, alpha=0.4, color='black', label='Low-b field')
    plt.hist(at_sample_diaObject[feature], bins=use_bins, histtype='step', log=True,
             lw=1, color='black', label='ECDFS')
    plt.axvline(star_diaObject[feature], lw=3, alpha=0.4, color='black', label='star')
    plt.axvline(at_diaObject[feature], lw=1, color='black', label='transient')
    plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
    plt.xlabel(feature)
    plt.ylabel('N(diaObjects)')
    if bounds is not None:
        plt.xlim([bounds[0], bounds[1]])
        plt.ylim([bounds[2], bounds[3]])
    plt.tight_layout()
    plt.show()

In [4]:

def feature_scatter(feature1, feature2, bounds=None):
    """
    Create a scatter plot of the relation between two features,
      and mark the points for the known variable star and transient.

    Parameters
    ----------
    feature: str
        Column name of the timeseries feature.
    nbins: int
        Number of histogram bins.
    bounds: float[4]
        A four-element array of x1, x2, y1, y2.
    """

    plt.figure(figsize=(6, 3))
    plt.plot(star_sample_diaObject[feature1], star_sample_diaObject[feature2],
             'o', ms=3, alpha=0.2, mew=0, color='black', label='Low-b field')
    plt.plot(at_sample_diaObject[feature1], at_sample_diaObject[feature2],
             'o', ms=1, alpha=0.8, mew=0, color='black', label='ECDFS')
    plt.plot(star_diaObject[feature1], star_diaObject[feature2],
             '*', ms=10, mew=1, mec='cyan', color='black', label='star')
    plt.plot(at_diaObject[feature1], at_diaObject[feature2],
             's', ms=6, mew=1, mec='cyan', color='black', label='transient')
    plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
    plt.xlabel(feature1)
    plt.ylabel(feature2)
    if bounds is not None:
        plt.xlim([bounds[0], bounds[1]])
        plt.ylim([bounds[2], bounds[3]])
    plt.tight_layout()
    plt.show()

2. Retrieve data¶

Retrieve the column names for all of the $r$-band timeseries features from the DiaObject table. Create a comma-separated list of columns to retrieve from the DiaObject table, and add the nDiaSources column.

In [5]:

query = "SELECT column_name " \
        "FROM tap_schema.columns " \
        "WHERE table_name = 'dp1.DiaObject'"
results = service.search(query).to_table()

columns_list = ''
for name in results['column_name']:
    if name.find('r_') == 0:
        columns_list += name + ', '
columns_list += 'nDiaSources'

del query, results

2.1. Variable star¶

This tutorial uses a known SX Phoenicis-type pulsating variable star captured in the Low Galactic Latitude ("Low-$b$") Field of DP1: Gaia DR3 2912281258855051520 (see Carlin et al. 2025).

In [6]:

star_diaObjectId = 614435753623027782

Retrieve all columns in the list from the DiaObject table for the row corresponding to the variable star.

In [7]:

query = """SELECT ra, dec, {} FROM dp1.DiaObject WHERE diaObjectId = {}
        """.format(columns_list, star_diaObjectId)
job = service.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
if job.phase == 'ERROR':
    job.raise_if_error()

Job phase is COMPLETED

In [8]:

assert job.phase == 'COMPLETED'
star_diaObject = job.fetch_result().to_table()
assert len(star_diaObject) == 1

Option to view star_diaObject.

In [9]:

# star_diaObject

Retrieve the $r$-band difference-image (psfFlux) and direct-image (scienceFlux) measurements from the DiaSource table, their errors, and the midpoint of the exposure time as a Modified Julian Date (MJD).

In [10]:

query = """SELECT psfFlux, psfFluxErr, scienceFlux,
        scienceFluxErr, band, midpointMjdTai
        FROM dp1.DiaSource
        WHERE band = 'r' AND diaObjectId = {}
        """.format(str(star_diaObjectId))
job = service.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
if job.phase == 'ERROR':
    job.raise_if_error()

Job phase is COMPLETED

In [11]:

assert job.phase == 'COMPLETED'
star_diaSources = job.fetch_result().to_table()
assert len(star_diaSources) == 68

Visualize the measured fluxes.

In [12]:

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(6, 4), sharex=True)
ax1.errorbar(star_diaSources['midpointMjdTai'], star_diaSources['psfFlux'],
             yerr=star_diaSources['psfFluxErr'],
             fmt='o', ms=5, alpha=0.5, mew=0, color='black')
ax2.errorbar(star_diaSources['midpointMjdTai'], star_diaSources['scienceFlux'],
             yerr=star_diaSources['scienceFluxErr'],
             fmt='s', ms=5, alpha=0.5, mew=0, color='black')
ax1.set_ylabel('psfFlux [nJy]')
ax2.set_ylabel('scienceFlux [nJy]')
ax2.set_xlabel('MJD [d]')
plt.suptitle('r-band DiaSource fluxes vs. time for a known variable star')
plt.tight_layout()
plt.show()

No description has been provided for this image

Figure 1: The psfFlux (difference image flux; top) and scienceFlux (direct image flux; bottom) for all visits in which the star was detected with SNR>5 in the difference image, versus the MJD of the visit, for the $r$-band. The errors bars on the flux are, for most data points, relatively too small to be seen.

2.2. Extragalactic transient¶

This tutorial uses an extragalactic astrophysical transient (AT) that was captured in the Extended Chandra Deep Field South (ECDFS) for DP1: AT 2024aigg.

In [13]:

at_diaObjectId = 611255759837069401

Retrieve the DiaObject record and the associated DiaSource records, and plot the lightcurve.

In [14]:

query = """SELECT ra, dec, {} FROM dp1.DiaObject WHERE diaObjectId = {}
        """.format(columns_list, at_diaObjectId)
job = service.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
if job.phase == 'ERROR':
    job.raise_if_error()

Job phase is COMPLETED

In [15]:

assert job.phase == 'COMPLETED'
at_diaObject = job.fetch_result().to_table()
assert len(at_diaObject) == 1

In [16]:

# at_diaObject

In [17]:

query = """SELECT psfFlux, psfFluxErr, scienceFlux,
        scienceFluxErr, band, midpointMjdTai
        FROM dp1.DiaSource
        WHERE band = 'r' AND diaObjectId = {}
        """.format(str(at_diaObjectId))
job = service.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
if job.phase == 'ERROR':
    job.raise_if_error()

Job phase is COMPLETED

In [18]:

assert job.phase == 'COMPLETED'
at_diaSources = job.fetch_result().to_table()
assert len(at_diaSources) == 91

In [19]:

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(6, 4), sharex=True)
ax1.errorbar(at_diaSources['midpointMjdTai'], at_diaSources['psfFlux'],
             yerr=at_diaSources['psfFluxErr'],
             fmt='o', ms=5, alpha=0.5, mew=0, color='black')
ax2.errorbar(at_diaSources['midpointMjdTai'], at_diaSources['scienceFlux'],
             yerr=at_diaSources['scienceFluxErr'],
             fmt='s', ms=5, alpha=0.5, mew=0, color='black')
ax1.set_ylabel('psfFlux [nJy]')
ax2.set_ylabel('scienceFlux [nJy]')
ax2.set_xlabel('MJD [d]')
plt.suptitle('r-band DiaSource fluxes vs. time for an extragalactic transient')
plt.tight_layout()
plt.show()

Figure 2: The psfFlux (difference image flux; top) and scienceFlux (direct image flux; bottom) for all visits in which the star was detected with SNR>5 in the difference image, versus the MJD of the visit, for the $r$-band. The errors bars on the flux are, for most data points, relatively too small to be seen.

Notice that visits for which this object's brightness was similar to that of its brightness in the template image are missing from the lightcurve above (i.e., visits around MJD $\sim 606404$ days, when psfFlux was near 0 nJy).

Notice also that because this object is an astrophysical transient embedded in its host galaxy, the scienceFlux measurements are contaminated by host galaxy light. This contamination is not simply a "pedestal" (i.e., are not a constant amount added to the psfFlux) because the forced photometry uses the PSF of the image. For images with worse seeing, the PSF FWHM is larger and the forced flux measurement includes more host galaxy light. The timeseries features based on scienceFlux will reflect this contamination.

2.3. Samples of diaObjects¶

Retrieve two samples of diaObjects: one from the same DP1 field as the star (the "Low-$b$" field), and one from the same DP1 field as the transient (the ECDFS field).

To define the cone search, use the coordinates of the star and transient, and a 2.0 deg radius, which will encompass the relevant DP1 field. Print the coordinates of the star and transient.

In [20]:

print('star coordinates (Low-b field):      %9.6f %10.6f' %
      (float(star_diaObject['ra']), float(star_diaObject['dec'])))
print('transient coordinates (ECDFS field): %9.6f %10.6f' %
      (float(at_diaObject['ra']), float(at_diaObject['dec'])))

star coordinates (Low-b field):      94.922635 -25.231851
transient coordinates (ECDFS field): 53.124768 -27.739815

Set maximum and minimum values to apply to the psfFluxMean and scienceFluxMean values of diaObjects. Use a magnitude of 17.5 mag, and convert it into positive and negative difference-image fluxes. These flux constraints will help to omit sources that are anywhere near saturation from the samples.

In [21]:

flux_max = np.power(10, (17.5-31.4)/(-2.5))
flux_min = -1.0 * flux_max
print('difference flux max and min: %9.0f %9.0f' % (flux_max, flux_min))

difference flux max and min:    363078   -363078

Apply the constraints in the ADQL query statement, and add a requirement that only diaObjects which were detected at least 10 times in the $r$-band be retrieved. This will constraint will mean that the timeseries features used in this tutorial are all measured from at least 10 detections.

Create the query for the star's field (the "Low-$b$" field) and submit the query job.

In [22]:

query = """SELECT ra, dec, {} FROM dp1.DiaObject
        WHERE CONTAINS(POINT('ICRS', ra, dec), CIRCLE('ICRS', {}, {}, 2.0)) = 1
        AND r_psfFluxMean < {} AND r_psfFluxMean > {}
        AND r_psfFluxMax < {} AND r_psfFluxMin > {}
        AND r_scienceFluxMean < {} AND r_psfFluxNdata > 10
        """.format(columns_list, float(star_diaObject['ra']), float(star_diaObject['dec']),
                   flux_max, flux_min, flux_max, flux_min, flux_max)
job = service.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
if job.phase == 'ERROR':
    job.raise_if_error()

Job phase is COMPLETED

In [23]:

assert job.phase == 'COMPLETED'
star_sample_diaObject = job.fetch_result().to_table()
print(len(star_sample_diaObject))
assert len(star_sample_diaObject) == 2769

Create the query for the transients's field (the ECDFS) and submit the query job.

In [24]:

query = """SELECT ra, dec, {} FROM dp1.DiaObject
        WHERE CONTAINS(POINT('ICRS', ra, dec), CIRCLE('ICRS', {}, {}, 2.0)) = 1
        AND r_psfFluxMean < {} AND r_psfFluxMean > {}
        AND r_psfFluxMax < {} AND r_psfFluxMin > {}
        AND r_scienceFluxMean < {} AND r_psfFluxNdata > 10
        """.format(columns_list, float(at_diaObject['ra']), float(at_diaObject['dec']),
                   flux_max, flux_min, flux_max, flux_min, flux_max)
job = service.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
if job.phase == 'ERROR':
    job.raise_if_error()

Job phase is COMPLETED

In [25]:

assert job.phase == 'COMPLETED'
at_sample_diaObject = job.fetch_result().to_table()
print(len(at_sample_diaObject))
assert len(at_sample_diaObject) == 919

3. Timeseries features¶

For each of the timeseries features in turn, generate histograms and scatter plots to illustrate the distributions and correlations between the values. On each plot, mark the values of the known variable star and transient.

Recall that, as explained in the introduction, the timeseries features for future data releases will have different values, because they will be derived from datasets with very different cadences and observing windows than Data Preview 1.

3.1. Ndata¶

Create the histogram for the psfFluxNdata feature.

In [26]:

feature_histogram('r_psfFluxNdata', 30)

Figure 3: The number of diaObjects with a given number of diaSources (number of difference image detections) in the $r$-band. While the star is in the tail of its population's distribution, the transient is not.

3.2. Min, Max, Mean¶

Option to display any or all of the standard histograms for these features.

In [27]:

# feature_histogram('r_psfFluxMin', 50)
# feature_histogram('r_psfFluxMax', 50)
# feature_histogram('r_psfFluxMean', 50)

Instead of considering the minimum and maximum, derive the difference-flux amplitude as the maximum, minus the minimum: $A = f_{max}-f_{min}$. The flux amplitude will always be positive, and so can be logged.

In [28]:

star_diaObject['r_psfFluxAmp'] = star_diaObject['r_psfFluxMax'] - star_diaObject['r_psfFluxMin']
at_diaObject['r_psfFluxAmp'] = at_diaObject['r_psfFluxMax'] - at_diaObject['r_psfFluxMin']
star_sample_diaObject['r_psfFluxAmp'] = star_sample_diaObject['r_psfFluxMax'] \
    - star_sample_diaObject['r_psfFluxMin']
at_sample_diaObject['r_psfFluxAmp'] = at_sample_diaObject['r_psfFluxMax'] \
    - at_sample_diaObject['r_psfFluxMin']

star_diaObject['r_psfFluxAmpLog'] = np.log10(star_diaObject['r_psfFluxAmp'])
at_diaObject['r_psfFluxAmpLog'] = np.log10(at_diaObject['r_psfFluxAmp'])
star_sample_diaObject['r_psfFluxAmpLog'] = np.log10(star_sample_diaObject['r_psfFluxAmp'])
at_sample_diaObject['r_psfFluxAmpLog'] = np.log10(at_sample_diaObject['r_psfFluxAmp'])

Display the histogram for the newly-derived difference-flux amplitude feature.

In [29]:

feature_histogram('r_psfFluxAmpLog', 50)

Figure 4: The distribution of the log of the amplitude values. The star's amplitude is in the tail of the "Low-$b$" field sample, and the transient is not. However, as shown in Figure 2, only part of the transient's lightcurve was observed by DP1.

3.3. Sigma, MAD, Chi2¶

Add the logarithm for each of the Sigma, MAD, and Chi2 features to the tables.

In [30]:

star_sample_diaObject['r_psfFluxSigmaLog'] = np.log10(star_sample_diaObject['r_psfFluxSigma'])
at_sample_diaObject['r_psfFluxSigmaLog'] = np.log10(at_sample_diaObject['r_psfFluxSigma'])
star_diaObject['r_psfFluxSigmaLog'] = np.log10(star_diaObject['r_psfFluxSigma'])
at_diaObject['r_psfFluxSigmaLog'] = np.log10(at_diaObject['r_psfFluxSigma'])

star_sample_diaObject['r_psfFluxMADLog'] = np.log10(star_sample_diaObject['r_psfFluxMAD'])
at_sample_diaObject['r_psfFluxMADLog'] = np.log10(at_sample_diaObject['r_psfFluxMAD'])
star_diaObject['r_psfFluxMADLog'] = np.log10(star_diaObject['r_psfFluxMAD'])
at_diaObject['r_psfFluxMADLog'] = np.log10(at_diaObject['r_psfFluxMAD'])

star_sample_diaObject['r_psfFluxChi2Log'] = np.log10(star_sample_diaObject['r_psfFluxChi2'])
at_sample_diaObject['r_psfFluxChi2Log'] = np.log10(at_sample_diaObject['r_psfFluxChi2'])
star_diaObject['r_psfFluxChi2Log'] = np.log10(star_diaObject['r_psfFluxChi2'])
at_diaObject['r_psfFluxChi2Log'] = np.log10(at_diaObject['r_psfFluxChi2'])

Plot the histogram for the log of each of these features.

In [31]:

feature_histogram('r_psfFluxSigmaLog', 50)
feature_histogram('r_psfFluxMADLog', 50)
feature_histogram('r_psfFluxChi2Log', 50)

Figure 5: The distribution of psfFluxSigma (top), MAD (middle), and Chi2 (bottom). Vertical lines mark the star and transient. Both are in the tails of their Chi2 distribution.

Create a scatter plot to show the correlation between the difference-image flux amplitude and the log of Chi2.

In [32]:

feature_scatter('r_psfFluxAmpLog', 'r_psfFluxChi2Log')

Figure 6: A scatter plot of the log Chi2 versus psfFlux amplitude, in which the known objects are outlier values.

3.4. Skew¶

The skewness of a distribution describes its asymetry about its peak, as in the diagram below.

Figure 7: An illustration of negatively and positively skewed distributions. By Rodolfo Hermans (Godot) at en.wikipedia (CC BY-SA 3.0).

Display the histogram of skew values.

In [33]:

feature_histogram('r_psfFluxSkew', 50)

Figure 8: The histograms of skew values, with the known variable star and transient values plotted as vertical lines.

As with all timeseries features, the skew is affected by detection bias: i.e., brighter difference-image sources are more likely to be detected, which can affect the shape of the flux distribution and thus the skew.

3.5. Percentiles¶

The psfFluxPercentiles are 5, 25, 50, 75, and 95th percentiles of the cumulative distribution of difference-image fluxes.

Option to display any of the histograms for the flux percentiles.

In [34]:

# feature_histogram('r_psfFluxPercentile05', 50)
# feature_histogram('r_psfFluxPercentile25', 50)
# feature_histogram('r_psfFluxPercentile50', 50)
# feature_histogram('r_psfFluxPercentile75', 50)
# feature_histogram('r_psfFluxPercentile95', 50)

Create a scatter plot to show how the 50th percentile correlates with the mean of the psfFlux values.

In [35]:

feature_scatter('r_psfFluxMean', 'r_psfFluxPercentile50',
                bounds=[-15000, 15000, -20000, 20000])

Figure 9: The correlation between the 50th percentile and the mean of the psfFlux, with the star's value seen as a clear outlier.

3.6. StetsonJ¶

A variability index developed for Cepheids (defined in Stetson 1996).

Plot the distributions of the StetsonJ parameters for the sample of diaObjects, and mark the values of the known variable star.

In [36]:

feature_histogram('r_psfFluxStetsonJ', 50)

Figure 10: The distributions of StetsonJ parameter for the $g$- and $r$-band (top and bottom), with the StetsonJ parameter for the known variable star marked as a vertical line.

Since the known variable star is pulsating, and the StetsonJ parameter is designed to identify and characterize pulsating stars (Cepheids, specifically), it is not surprising that the known variable star is an outlier on this distribution.

3.7. Linear fit¶

The psfFluxLinearSlope and psfFluxLinearIntercept are the result of fitting a linear regression using the scipy.optimize.lsq_linear function. The psfFluxMaxSlope is the maximum "instantaneous" slope, or the maximum ration of the series of time-ordered values of $\Delta f / \Delta t$.

Plot the histogram of the best fit linear slopes.

In [37]:

feature_histogram('r_psfFluxLinearSlope', 100, bounds=[-1000, 1000, 0.5, 1400])

Figure 11: The distribution of the slope of linear fits to all difference-image flux light curves, with axis boundaries set to display only the main distribution and not the few outliers.

As shown in Figure 2, the transient was only observed during its decline, and thus has a negative slope, as also seen above in Figure 11.

Option to also plot histograms for the other two linear fit parameters.

In [38]:

# feature_histogram('r_psfFluxLinearIntercept', 50)
# feature_histogram('r_psfFluxMaxSlope', 50)

3.8. Science flux features¶

As explained in Section 1, the "science flux" is derived from forced photometry on the science image (the direct image, not the difference image). As further demonstrated in Section 2.2., and Figure 2 especially, for transient objects the science flux has host galaxy contamination. The science flux and its related timeseries features should really only be used for known variable stars, and so the general distributions of the features are not very useful.

Option to display the histograms for the timeseries features derived from the science flux.

In [39]:

# feature_histogram('r_scienceFluxMean', 50)
# feature_histogram('r_scienceFluxMeanErr', 50)
# feature_histogram('r_scienceFluxSigma', 50)

In [ ]: