102.1. Get started with TAP¶
102.1. Get started with TAP¶
Data Release: Data Preview 1
Container Size: large
LSST Science Pipelines version: r29.1.1
Last verified to run: 2025-06-25
Repository: github.com/lsst/tutorial-notebooks
Learning objective: How to use the TAP service efficiently.
LSST data products: Object
table
Packages: lsst.rsp.get_tap_service
Credit: Originally developed by the Rubin Community Science team. Please consider acknowledging them if this notebook is used for the preparation of journal articles, software releases, or other notebooks.
Get Support: Everyone is encouraged to ask questions or raise issues in the Support Category of the Rubin Community Forum. Rubin staff will respond to all questions posted there.
1. Introduction¶
TAP (Table Access Protocol) provides standardized access to catalog data for discovery, search, and retrieval. Full documentation for TAP is provided by the International Virtual Observatory Alliance (IVOA).
Related tutorials: The other 100-level tutorials in this series demonstrate advanced ADQL statements. The 200-level tutorials describe the contents of the catalog data and have example queries for each.
1.1. TAP glossary¶
- schema - Database terminology for the abstract design that represents the storage of data in a database.
- TAP schema - The set of tables that describe the data tables and their columns.
- table collection - A collection of tables (catalogs), e.g., for a given data release.
- table - A collection of related data held in a table format in a database.
- query - A string formatted in ADQL that selects data from a table, with contraints if desired.
- results - The output of the TAP service's search method when a query is passed.
1.2. ADQL statements¶
The documentation for ADQL includes more information about syntax, keywords, operators, functions, and so on. ADQL is similar to SQL (Structured Query Langage).
A typical ADQL statement has at least three components:
SELECT <columns> FROM <catalog> WHERE <constraints>
where
<columns>
is a comma-separated list of the columns to return,<catalog>
is the name of the catalog to retreive data from, and<constraints>
imposes a restriction that only rows with column values that meet the constraints are returned.
For example, say there is a catalog called "mysurveydata" with 5 columns, "col1", "col2", and so on. The ADQL statement:
SELECT col3, col4, col5 FROM mysurveydata WHERE col1 > 0.5 AND col5 < 10
would
return a table that has three columns, and as many rows as meet both of the restrictions in the WHERE
statement.
import numpy as np
from lsst.rsp import get_tap_service
2. Instantiate the TAP service¶
Get an instance of the TAP service, and assert that it exists.
service = get_tap_service("tap")
assert service is not None
3. Explore the TAP schema¶
Browsable TAP schema are accessible at sdm-schemas.lsst.io and in the Portal Aspect's user interface.
In a Notebook, it is possible to interact programmatically with the TAP schema.
3.1. Use caution with synchronous queries¶
Caution:
Typically users should not make synchronous queries that retrieve all available columns (SELECT *
). The only time this is acceptable and useful is when querying table schema.
A "synchronous" query is when the search is executed and the results are retrieved at the same time, and no other cells can be run while the search is executing.
Synchronous queries are best used sparingly, and only for retrieving table metadata. However asynchronous queries should be used whenever table data is being returned.
Queries that use SELECT *
, which will retrieve all columns, are only appropriate for schemas or tables with very few (<20) columns. Queries that return data should specify which of the tens to thousands of columns are necessary to return.
3.2. List all schemas¶
Pass a query for the TAP schemas to the TAP service, and store the results in results
.
results = service.search('SELECT * FROM tap_schema.schemas')
Show the results.
results
<DALResultsTable length=4> schema_name utype ... schema_index str64 str512 ... int32 ----------------- ------ ... ------------ dp02_dc2_catalogs ... 2 dp1 ... 0 ivoa ... 1 tap_schema ... 100000
The to_table()
method converts results
into an Astropy table.
Convert results
to a table
and display the table.
results.to_table()
schema_name | utype | description | schema_index |
---|---|---|---|
str64 | str512 | str512 | int32 |
dp02_dc2_catalogs | Data Preview 0.2 contains the image and catalog products of the Rubin Science Pipelines v23 processing of the DESC Data Challenge 2 simulation, which covered 300 square degrees of the wide-fast-deep LSST survey region over 5 years. | 2 | |
dp1 | Data Preview 1 | 0 | |
ivoa | ObsCore v1.1 attributes in ObsTAP realization | 1 | |
tap_schema | A TAP-standard-mandated schema to describe tablesets in a TAP 1.1 service | 100000 |
del results
3.3. List all tables¶
Create a query that selects all table names from the TAP schema for DP1. Retrieve the results as an Astropy table and display it.
query = "SELECT * FROM tap_schema.tables " \
"WHERE tap_schema.tables.schema_name = 'dp1'" \
"ORDER BY table_index ASC"
results = service.search(query).to_table()
results
schema_name | table_name | table_type | utype | description | table_index |
---|---|---|---|---|---|
str512 | str64 | str8 | str512 | str512 | int32 |
dp1 | dp1.CcdVisit | table | Defines a single detector of a visit | 0 | |
dp1 | dp1.DiaObject | table | this is diaObjectTable_tract in the butler repo | 0 | |
dp1 | dp1.DiaSource | table | this is diaSourceTable_tract in the butler repo | 0 | |
dp1 | dp1.ForcedSource | table | The forced source table contains forced-photometry source measurement on an individual Exposure based on a Multifit shape model derived from a deep detection. | 0 | |
dp1 | dp1.ForcedSourceOnDiaObject | table | this is forcedSourceOnDiaObjectTable_tract in the butler repo | 0 | |
dp1 | dp1.MPCORB | table | The orbit catalog produced by the Minor Planet Center. Ingested daily. O(10M) rows by survey end. The columns are described at https://minorplanetcenter.net//iau/info/MPOrbitFormat.html | 0 | |
dp1 | dp1.Object | table | The object table contains descriptions of the multi-epoch static astronomical objects, in particular their astrophysical properties as derived from analysis of the Sources that are associated with them. Note that fast moving objects are kept in the MovingObject tables. Note that less-frequently used columns are stored in a separate table called Object_Extra. | 0 | |
dp1 | dp1.Source | table | Table to store high signal-to-noise sources;. A source is a measurement of Object's properties from a single image that contains its footprint on the sky. | 0 | |
dp1 | dp1.SSObject | table | LSST-computed per-object quantities. 1:1 relationship with MPCORB. Recomputed daily, upon MPCORB ingestion. | 0 | |
dp1 | dp1.SSSource | table | LSST-computed per-source quantities. 1:1 relationship with DIASource. Recomputed daily, upon MPCORB ingestion. | 0 | |
dp1 | dp1.Visit | table | Defines a single Visit. | 0 |
del query, results
3.4. List all columns¶
Create a query that selects all column names, data types, description, and units from the DP1 Object
table. Retrieve the results as an Astropy table.
query = "SELECT column_name, datatype, description, unit " \
"FROM tap_schema.columns " \
"WHERE table_name = 'dp1.Object'"
results = service.search(query).to_table()
Option to display the results
. The table's display will automatically be truncated by the notebook because the Object
table has 1296 columns.
# results
Use a for
loop to look for the presence of coord
in the
column_name
column for every row of the results table,
and write the name when there is a match.
search_string = 'coord'
# search_string = 'g_psfFlux'
for cname in results['column_name']:
if cname.find(search_string) > -1:
print(cname)
coord_dec coord_decErr coord_ra coord_ra_dec_Cov coord_raErr
Uncomment the second line in the cell above and re-execute it to print all columns containing the string g_psfFlux
.
Print the unique set of data types or units.
print(np.unique(results['datatype']))
# print(np.unique(results['unit']))
datatype -------- boolean char double float int long
del query, results
4. Best practices for TAP queries¶
The Object
table used in this demonstration contains forced measurements in the multi-band deep_coadd
images at the coordinates of all signal-to-noise ratio >5 detections in a deep_coadd
of any filter.
The 200-level tutorial for the Object
catalog contains further detail.
4.1. Use asynchronous queries¶
For all TAP searches for data (not schema), one should use asynchronous queries.
Asynchronous means that the query is submitted as a job to the TAP service, and it can run in the background until it completes. The results can then be retrieved right away, or at a later time.
The step-by-step process for asynchronous queries starts by defining the job
by passing the query
to the TAP service.
job = service.submit_job(query)
The next step is to run the job and, optionally, wait for it to return either a COMPLETED or ERROR result. While waiting is optional, tutorial notebooks always wait because the next steps of a tutorial depend on having the retrieved data.
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
Finally, display the error message if the query was not successful, or retrieve the results if the query was completed.
if job.phase == 'ERROR':
job.raise_if_error()
assert job.phase == 'COMPLETED'
results = job.fetch_result()
4.2. Use spatial constraints¶
All TAP searches on tables that include sky coordinates should include spatial constraints.
It is recommended to always start with spatial constraints for a small region and then expand the search area.
Spatial constraints should be implemented with either an ADQL cone or polygon search.
4.2.1. Table sharding by coordinate¶
On disk, the catalog tables are sharded by coordinate (RA, Dec). This can be thought of as the database being divided up by spatial region (shard) and distributed across multiple servers. Queries that include spatial constraints which minimize the number of shards that have to be searched through will be much faster than queries that have no (or very wide) spatial constraints.
4.2.2. Cone searches¶
A cone search means a catalog query that returns all objects within a given radius of a sky coordinate. Since a given angular diameter corresponds to a larger physical diameter at larger distances, the volume queried is a cone shape in 3D, not a cylinder.
In ADQL, a cone search is executed with WHERE CONTAINS(POINT(), CIRCLE()) = 1
.
POINT()
passes the catalog's columns for sky coordinates.CIRCLE()
defines the center and radius of the search circle.CONTAINS() = 1
constrains the query to only return rows for which the statement "this circle contains this point" is "True" (=1).
Column constraints on the coordinates, either with ADQL like -30 < coord_dec AND coord_dec < -25
or using a WHERE ... BETWEEN
statement, are not recommended because the TAP service doesn't interpret them the same as a cone or polygon search.
Create a cone search on the Object
table for objects within 0.01 degrees of coordinates near the center of the ECDFS field, RA, Dec = 53, -28 degrees.
query = """SELECT coord_ra, coord_dec
FROM dp1.Object
WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec),
CIRCLE('ICRS', 53, -28, 0.01)) = 1"""
print(query)
SELECT coord_ra, coord_dec FROM dp1.Object WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec), CIRCLE('ICRS', 53, -28, 0.01)) = 1
Execute the query asynchronously, and print the number of rows returned.
job = service.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
if job.phase == 'ERROR':
job.raise_if_error()
assert job.phase == 'COMPLETED'
results = job.fetch_result()
print(len(results))
Job phase is COMPLETED 141
4.2.3. Polygon searches¶
For a polygon search, replace CIRCLE
with POLYGON(ra1, dec1, ra2, dec2, ra3, dec3, ...)
where each ra1, dec
is a vertex of the polygon.
4.3. Sort results with ORDER BY¶
Add an ORDER BY
statement to the TAP query to sort the results in ascending (ASC) or descending (DESC) order.
Order the results by ascending coord_ra
values.
query = """SELECT coord_ra, coord_dec
FROM dp1.Object
WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec),
CIRCLE('ICRS', 53, -28, 0.01)) = 1
ORDER BY coord_ra ASC"""
print(query)
SELECT coord_ra, coord_dec FROM dp1.Object WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec), CIRCLE('ICRS', 53, -28, 0.01)) = 1 ORDER BY coord_ra ASC
4.4. Limit rows with TOP¶
For debugging and testing queries, the recommended way to restrict the number of rows returned is to use very small spatial regions, not to use TOP.
The TAP service first applies WHERE
constraints, then ORDER BY
, and then TOP
.
If the query is not well constrained, i.e., if thousands or more objects meet the WHERE
constraints, then they all must first be sorted before the top number are returned.
This is a waste of time and compute resources.
However, it can be useful to only retrieve a subset of the rows which meet the query constraints. To do this, add TOP N
after SELECT
to return only the top N
number of rows.
query = """SELECT TOP 10 coord_ra, coord_dec
FROM dp1.Object
WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec),
CIRCLE('ICRS', 53, -28, 0.01)) = 1
ORDER BY coord_ra ASC"""
print(query)
SELECT TOP 10 coord_ra, coord_dec FROM dp1.Object WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec), CIRCLE('ICRS', 53, -28, 0.01)) = 1 ORDER BY coord_ra ASC
Execute the query and confirm that only 10 rows are returned.
job = service.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
if job.phase == 'ERROR':
job.raise_if_error()
assert job.phase == 'COMPLETED'
results = job.fetch_result()
print(len(results))
Job phase is COMPLETED 10
For users with TAP experience, including maxrec
in the job definition will also work, but use of TOP
is recommended.
4.4.1. Use caution with TOP and ORDER BY¶
Combined use of TOP and ORDER BY in ADQL queries can be dangerous: it may take an unexpectedly long time because the database is trying to first sort, and then extract the top N elements. It is best to only combine TOP and ORDER BY if the query's WHERE statements significantly cut down the number of objects that would need to be sorted.