311.3. Large_datasets#

311_3_Large_datasets

311.3. Large datasets visualization¶

For the Rubin Science Platform at data.lsst.cloud.
Data Release: Data Preview 1
Container Size: large
LSST Science Pipelines version: r29.2.0
Last verified to run: 2026-03-23
Repository: github.com/lsst/tutorial-notebooks
DOI: 10.11578/rubin/dc.20250909.20

Learning objective: Interactive catalog data visualizations for large datasets.

LSST data products: Object table

Packages: lsst.rsp.get_tap_service, holoviews, bokeh, datashader

Credit: Originally developed by the Rubin Community Science team, inspired by the tutorials of Leanne Guy and Keith Bechtol. Please consider acknowledging them if this notebook is used for the preparation of journal articles, software releases, or other notebooks.

Get Support: Everyone is encouraged to ask questions or raise issues in the Support Category of the Rubin Community Forum. Rubin staff will respond to all questions posted there.

1. Introduction¶

The Rubin Science Platform was designed to enable scientific analysis of the LSST data sets, which will be unprecedentedly large and complex. The software and techniques that are best suited for visualizing large data sets might be new to many astronomers. This series of notebooks introduce learners with some knowledge of python to three open-source Python libraries that enable powerful interactive visualization of catalogs.

HoloViews: Produce high-quality interactive visualizations easily by annotating plots and images rather than using direct calls to a plotting library.
Bokeh: A powerful data visualization library that provides interactive tools including brushing and linking between multiple plots.
Datashader: Accurately render very large datasets quickly and flexibly.

This notebook focuses on all three packages for representing large datasets.

These packages are part of the Holoviz ecosystem of tools intended for visualization in a web browser and can be used to create quite sophisticated dashboard-like interactive displays and widgets. The goal of this tutorial is to provide an introduction and starting point from which to create more advanced, custom interactive visualizations. Holoviz has a vibrant and active community where you can ask questions and discuss visualizations with a global community.

Notice: If the notebook or any interactive features seem to stall, first try going a few cells back and rerunning them in order (the order in which cells are run is imporant for this notebook's functionality). If that does not work, try restarting the kernel. If issues persist, try logging out and restarting the Notebook aspect using a "large" instance of the JupyterLab environment.

Warning: It is not recommended to "Restart Kernel and Run All Cells" in this notebook, or to execute multiple cells very quickly. Some of the examples require interaction (e.g., for the user to select points on a graph) in order to run correctly, and going too fast can cause some plots to not display properly.

Related tutorials: This notebook is part three in a series of three tutorials on interactive catalog data visualizations.

1.1. Import packages¶

Import general scientific python packages (os, numpy, pandas), functions from the astronomy python package astropy, the Rubin function for accessing the TAP service (lsst.rsp.get_tap_service), and various functions from the holoviews and bokeh packages that are used in this tutorial.

In [1]:

import numpy as np
import pandas as pd

from astropy import units as u
from astropy.coordinates import SkyCoord

from lsst.rsp import get_tap_service

import bokeh
from bokeh.io import output_notebook, show, reset_output
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import figure
from bokeh.transform import factor_cmap

import holoviews as hv
from holoviews import streams
from holoviews.operation.datashader import datashade, dynspread
from holoviews.plotting.util import process_cmap

import datashader as dsh

Show which version of Bokeh, HoloViews, and DataShader we are working with. This is important when referring to online documentation as APIs can change between versions.

In [2]:

print("Bokeh version: " + bokeh.__version__)
print("HoloViews version: " + hv.__version__)
print("Datashader version: " + dsh.__version__)

Bokeh version: 3.8.2
HoloViews version: 1.22.1
Datashader version: 0.18.2

1.2. Define functions and parameters¶

Update the maximum number of display rows for Pandas tables.

In [3]:

pd.set_option('display.max_rows', 5)

Set the display of the output Bokeh plots to be inline, in the notebook.

In [4]:

output_notebook()

Loading BokehJS ...

Set the HoloViews plotting library to be bokeh. The HoloViews and Bokeh icons are displayed when the library is loaded successfully. Always set the extension after executing output_notebook() to avoid issues with plot display.

In [5]:

hv.extension('bokeh')

ⓘ

Notice: Sometimes the bokeh.io.show function can be finicky when output modes are switched (e.g., from inline to an HTML file and back again).

To avert a "Models must be owned by only a single document" error (see, e.g., https://github.com/bokeh/bokeh/issues/8579), define the following function and use it in Section 3.

In [6]:

def show_bokeh_inline(p):
    try:
        reset_output()
        output_notebook()
        show(p)
    except:
        output_notebook()
        show(p)

2. Use the TAP service to obtain table data¶

The basis for any data visualization is the underlying data. This tutorial works with tabular data that is retrieved from a cone search around a defined coordinate with a specified radius using the Rubin TAP service.

Get a Rubin TAP service instance.

In [7]:

service = get_tap_service("tap")
assert service is not None

Define a reference position on the sky (in the Extended Chandra Deep Field South or ECDFS field, as an example) and a radius in degrees for a cone search.

In [8]:

coord = SkyCoord(ra=53.2*u.degree, dec=-28.1*u.degree, frame='icrs')
radius = 1 * u.deg

Define the query pass to the TAP service.

In [9]:

query = """SELECT coord_ra, coord_dec, objectId, r_extendedness,
        g_cModelMag, r_cModelMag, i_cModelMag
        FROM dp1.Object
        WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec),
        CIRCLE('ICRS', {}, {}, {})) = 1
        AND r_cModelMag < 27
        AND r_extendedness IS NOT NULL
        """.format(coord.ra.value, coord.dec.value, radius.to(u.deg).value)

In [10]:

print(query)

SELECT coord_ra, coord_dec, objectId, r_extendedness,
        g_cModelMag, r_cModelMag, i_cModelMag
        FROM dp1.Object
        WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec),
        CIRCLE('ICRS', 53.2, -28.1, 1.0)) = 1
        AND r_cModelMag < 27
        AND r_extendedness IS NOT NULL

As query will return a very large dataset, use an asynchronous query. This will take a few minutes.

In [11]:

job = service.submit_job(query)

In [12]:

job.run()

Out[12]:

<pyvo.dal.tap.AsyncTAPJob at 0x7f9150356de0>

In [13]:

job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
assert job.phase == 'COMPLETED'

Job phase is COMPLETED

After the job phase has completed, fetch the results into a Pandas table.

In [14]:

data = job.fetch_result().to_table().to_pandas()

Confirm that the expected number of rows has been returned (258807). If true, the following cell will not return an assertion error.

In [15]:

assert len(data) == 258807

Compute three colors from the apparent magnitudes.

In [16]:

data['gmi'] = data['g_cModelMag'] - data['i_cModelMag']
data['rmi'] = data['r_cModelMag'] - data['i_cModelMag']
data['gmr'] = data['g_cModelMag'] - data['r_cModelMag']

Use the r-band extendedness parameter to classify objects as having a "shape_type" of "point" or "extended".

In [17]:

data['shape_type'] = data['r_extendedness'].map({0: 'point', 1: 'extended'})

Convert the objectId to a string because as an integer, it's too large for bokeh to handle.

In [18]:

data['objectId'] = np.array(data['objectId']).astype('str')

Option to print the number of objects with shape_type as "point" or "extended".

In [19]:

# tx1 = np.where(data['shape_type'] == 'point')[0]
# tx2 = np.where(data['shape_type'] == 'extended')[0]
# print(len(tx1), len(tx2))
# del tx1, tx2

Confirm that the expected number of each shape_type have been returned.

In [20]:

assert data[data["shape_type"] == "point"].shape[0] == 48454
assert data[data["shape_type"] == "extended"].shape[0] == 210353

Create a 8% random subsample of this dataset with which to demonstrate some basic functionality. Print the length of this subset and confirm that it is 8% of the original size.

In [21]:

frac = 0.08
data20K = data.sample(frac=frac, axis='index')
print(len(data20K))
assert len(data20K) == round(frac * len(data))

3. Datashader¶

The interactive features of Bokeh work well with datasets up to a few tens of thousands of data points. To efficiently explore larger datasets it is recommended to use another visualization model that offers better scalability: Datashader.

The examples below will show how, when zooming in on the datashaded two-dimensional histograms, the bin sizes are dynamically adjusted to show finer or coarser granularity in the distribution. This allows for the interactive exploration of large datasets without having to manually adjust the bin sizes while panning and zooming.

Zoom in all the way to see individual points (i.e., bins contain either zero or one count). Zoom in far enough and find that the individual points are represented by extremely small pixels in datashader that are difficult to see. A solution is to dynspread instead of datashade, which will preserve a finite size of the plotted points.

3.1. Data preparation¶

Getting the data preparation phase right is key to creating powerful visualizations. Bokeh works with a ColumnDataSource (CDS). A CDS is essentially a collection of sequences of data that have their own unique column name.

The CDS is the core of bokeh plots. Bokeh automatically creates a CDS from data passed as python lists or numpy arrays. CDS are useful as they allow data to be shared between multiple plots and renderers, enabling brushing and linking.

Below, a CDS is created from the data returned by the query above and passed directly to Bokeh.

Create a CDS.

In [22]:

col_data = dict(ra=data20K['coord_ra'], dec=data20K['coord_dec'])
source = ColumnDataSource(data=col_data)

Additional data can be added to the CDS after creation.

In [23]:

source.data['objectId'] = data20K['objectId']
source.data['rmi'] = data20K['rmi']
source.data['gmr'] = data20K['gmr']
source.data['r_cModelMag'] = data20K['r_cModelMag']
source.data['shape_type'] = data20K['shape_type']
source.data['r_extendedness'] = data20K['r_extendedness']

3.2. Plotting a small set of data points¶

Plot a color-color diagram using Bokeh with a customized hover tool.

In [24]:

plot_options = {'height': 400, 'width': 800,
                'tools': ['pan', 'box_zoom', 'box_select',
                          'wheel_zoom', 'reset', 'help']}

p = figure(title="Color-Color Diagram",
           x_axis_label="(g-r)", y_axis_label="(r-i)",
           x_range=(-2.0, 3.0), y_range=(-2.0, 3.0),
           **plot_options)
p.scatter(x='gmr', y='rmi', source=source,
          size=2, alpha=0.2,
          hover_color='firebrick',
          legend_field="shape_type",
          color=factor_cmap('shape_type', 'Category10_3',
                            ['point', 'extended']))

hover = HoverTool(tooltips=[("objectId", "@objectId"),
                            ("(ra,dec)", "(@ra, @dec)"),
                            ("(g-r,r-i)", "(@gmr, @rmi)"),
                            ("type", "@shape_type")])
p.add_tools(hover)

In [25]:

show_bokeh_inline(p)

Loading BokehJS ...

Figure 1: The r-i color vs. the g-r color, with extended objects in orange and point-like objects in blue. The interactive tool bar is at right.

The plot above suffers from overplotting (confusion), even though the dataset only contains ~20K points. A classic strategy to visualize the dense areas is to specify the transparency (alpha) of the glyphs; in the plot above, alpha=0.2 was used. This has helped, but washes out the detail in the sparser regions.

HoloViews + Datashader allow millions to billions of points to be plotted, and this produces much more informative plots. Datashader rasterizes or aggregates datasets into regular grids that can then be further analysed or viewed as plots or images.

Create a Holoviews object "points" to hold and plot data.

In [26]:

points = hv.Points((source.to_df()['gmr'], source.to_df()['rmi']))

Create the linked streams instance.

In [27]:

boundsxy = (0, 0, 0, 0)
box = streams.BoundsXY(source=points, bounds=boundsxy)
bounds = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box])

Apply the datashader.

In [28]:

p = dynspread(datashade(points, cmap="Viridis"))
p = p.opts(width=800, height=300, padding=0.05, show_grid=True,
           xlim=(-2.0, 7.0), ylim=(-5.0, 3.0), xlabel="(g-r)", ylabel="(r-i)",
           tools=['box_select'])

Render the datashaded plot.

In [29]:

p * bounds

Out[29]:

Figure 2: The r-i vs. the g-r color as in Fig 1, but displayed as a 2-dimensional density map (with a purple-green-yellow colormap) in regions of high density, and as individual points in regions of low density. Interactive toolbar at right.

This datashaded plot of the same color-color diagram as above does not require any magic-number parameters such as size and alpha and automatically ensures that there is no saturation or overplotting.

Above, select the wheel zoom and adjust the image to interact with the plot. Note how the shades of color of the data points change according to the local density.

3.3. Plotting a large set of data points¶

The dataset of ~20K points used above is actually too small to demonstrate the power of datashader.

Below, visualize the full >200k object dataset returned by the query.

Create a Points Element for the data.

In [30]:

points = hv.Points((data['gmr'], data['rmi']))

Create the linked streams instance.

In [31]:

boundsxy = (0, 0, 0, 0)
box = streams.BoundsXY(source=points, bounds=boundsxy)
bounds = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box])

Apply the datashader.

In [32]:

p = dynspread(datashade(points, cmap="Viridis"))
p = p.opts(width=800, height=300, padding=0.05, show_grid=True,
           xlim=(-2.0, 7.0), ylim=(-5.0, 3.0), xlabel="(g-r)", ylabel="(r-i)",
           tools=['box_select'])

Render the datashaded plot.

In [33]:

p * bounds

Out[33]:

Figure 3: Same as Fig 2, but with many more objects included.

Above, use the box zoom or wheel zoom functionality to zoom in on the plot, and watch how datashader adapts the view to optimally display the data.

3.4. Adding a callback function¶

Add callback functionality to the color-color diagram above to retrieve the indices of selected points.

Above, use the box selection tool to select data.

STOP - Select some data points from the plot above using the box select tool before proceeding

Print the number of objects selected.

In [34]:

selection = (points.data.x > box.bounds[0]) \
    & (points.data.y > box.bounds[1]) \
    & (points.data.x < box.bounds[2]) \
    & (points.data.y < box.bounds[3])
print('The selection box contains {} data points'.format(np.sum(selection)))

The selection box contains 0 data points

3.5. Interactive selection¶

Below, create two side-by-side plots.

The left-hand plot will show the datashaded spatial distribution on the sky, and the right-hand plot will be a linked and brushed plot showing the r-band magnitude distribution for objects selected in the left-hand plot. It will be possible to use the box selection in the spatial distribution plot to change which data are included in the histogram.

First, create a holoviews dataset instance, and label some of the columns.

In [35]:

kdims = [('coord_ra', 'RA(deg)'), ('coord_dec', 'dec(deg)')]
vdims = [('r_cModelMag', 'r(mag)')]
ds = hv.Dataset(data, kdims, vdims)

In [36]:

points = hv.Points(ds)
boundsxy = (np.min(ds.data['coord_ra']), np.min(ds.data['coord_dec']),
            np.max(ds.data['coord_ra']), np.max(ds.data['coord_dec']))
box = streams.BoundsXY(source=points, bounds=boundsxy)
box_plot = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box])

Create custom callback functionality to update the linked histogram.

These functions are defined here (and not in Section 1.2) because they are only used in the following cell. The first function is used for setting the frequency of a magnitude to log scale because of the large number of data points. The second function updates the histogram based on the new selected sky area.

In [37]:

def log_inf(x) -> float:
    return np.log(x) if x > 0 else 0


def update_histogram(bounds=bounds) -> hv.Histogram:
    selection = (ds.data['coord_ra'] > bounds[0]) & \
                (ds.data['coord_dec'] > bounds[1]) & \
                (ds.data['coord_ra'] < bounds[2]) & \
                (ds.data['coord_dec'] < bounds[3])
    selected_mag = ds.data.loc[selection]['r_cModelMag']
    frequencies, edges = np.histogram(selected_mag)
    hist = hv.Histogram(
        (list(map(log_inf, frequencies)), edges)).opts(xlabel='r (mag)')
    return hist

Make the histogram for r-band magnitude and plot it together with the 2D density map of celestial coordinates. Compared to the plots above, this plot shows two panels at the same time, and applying the "Box Select" to the left panel changes the histogram on the right.

In [38]:

dmap = hv.DynamicMap(update_histogram, streams=[box]).options(height=400, width=400)

datashade(points, cmap=process_cmap("Viridis", provider="bokeh")) * \
    box_plot.options(height=400, width=400, tools=['box_select']) + dmap

Out[38]:

Figure 4: At right, Dec vs. RA as a 2-dimensional density map with a purple-green-yellow colormap. At left, a histogram (blue bar chart) of the fraction of objects in bins of r-band magnitude.

Try changing the box selection and watch as the histogram is recomputed and displayed.

4. Conclusion¶

This notebook demonstrates visualizing large catalog datasets. Additional notebooks in this series will present single plots and paired plots.